LustreProc.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="lustreproc">
   5   <title xml:id="lustreproc.title">Lustre Parameters</title>
   6   <para>There are many parameters for Lustre that can tune client and server
   7   performance, change behavior of the system, and report statistics about
   8   various subsystems.  This chapter describes the various parameters and
   9   tunables that are useful for optimizing and monitoring aspects of a Lustre
  10   file system.  It includes these sections:</para>
  11   <itemizedlist>
  12     <listitem>
  13       <para><xref linkend="enabling_interpreting_debugging_logs"/></para>
  14       <para>.</para>
  15     </listitem>
  16   </itemizedlist>
  17   <section>
  18     <title>Introduction to Lustre Parameters</title>
  19     <para>Lustre parameters and statistics files provide an interface to
  20     internal data structures in the kernel that enables monitoring and
  21     tuning of many aspects of Lustre file system and application performance.
  22     These data structures include settings and metrics for components such
  23     as memory, networking, file systems, and kernel housekeeping routines,
  24     which are available throughout the hierarchical file layout.
  25     </para>
  26     <para>Typically, metrics are accessed via <literal>lctl get_param</literal>
  27     files and settings are changed by via <literal>lctl set_param</literal>.
  28     They allow getting and setting multiple parameters with a single command,
  29     through the use of wildcards in one or more part of the parameter name.
  30     While each of these parameters maps to files in <literal>/proc</literal>
  31     and <literal>/sys</literal> directly, the location of these parameters may
  32     change between Lustre releases, so it is recommended to always use
  33     <literal>lctl</literal> to access the parameters from userspace scripts.
  34     Some data is server-only, some data is client-only, and some data is
  35     exported from the client to the server and is thus duplicated in both
  36     locations.</para>
  37     <note>
  38       <para>In the examples in this chapter, <literal>#</literal> indicates
  39       a command is entered as root.  Lustre servers are named according to the
  40       convention <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
  41       The standard UNIX wildcard designation (*) is used to represent any
  42       part of a single component of the parameter name, excluding
  43       "<literal>.</literal>" and "<literal>/</literal>".
  44       It is also possible to use brace <literal>{}</literal>expansion
  45       to specify a list of parameter names efficiently.</para>
  46     </note>
  47     <para>Some examples are shown below:</para>
  48     <itemizedlist>
  49       <listitem>
  50         <para> To list available OST targets on a Lustre client:</para>
  51         <screen># lctl list_param -F osc.*
  52 osc.testfs-OST0000-osc-ffff881071d5cc00/
  53 osc.testfs-OST0001-osc-ffff881071d5cc00/
  54 osc.testfs-OST0002-osc-ffff881071d5cc00/
  55 osc.testfs-OST0003-osc-ffff881071d5cc00/
  56 osc.testfs-OST0004-osc-ffff881071d5cc00/
  57 osc.testfs-OST0005-osc-ffff881071d5cc00/
  58 osc.testfs-OST0006-osc-ffff881071d5cc00/
  59 osc.testfs-OST0007-osc-ffff881071d5cc00/
  60 osc.testfs-OST0008-osc-ffff881071d5cc00/</screen>
  61         <para>In this example, information about OST connections available
  62         on a client is displayed (indicated by "osc").  Each of these
  63         connections may have numerous sub-parameters as well.</para>
  64       </listitem>
  65     </itemizedlist>
  66     <itemizedlist>
  67       <listitem>
  68         <para> To see multiple levels of parameters, use multiple
  69           wildcards:<screen># lctl list_param osc.*.*
  70 osc.testfs-OST0000-osc-ffff881071d5cc00.active
  71 osc.testfs-OST0000-osc-ffff881071d5cc00.blocksize
  72 osc.testfs-OST0000-osc-ffff881071d5cc00.checksum_type
  73 osc.testfs-OST0000-osc-ffff881071d5cc00.checksums
  74 osc.testfs-OST0000-osc-ffff881071d5cc00.connect_flags
  75 osc.testfs-OST0000-osc-ffff881071d5cc00.contention_seconds
  76 osc.testfs-OST0000-osc-ffff881071d5cc00.cur_dirty_bytes
  77 ...
  78 osc.testfs-OST0000-osc-ffff881071d5cc00.rpc_stats</screen></para>
  79       </listitem>
  80     </itemizedlist>
  81     <itemizedlist>
  82       <listitem>
  83         <para> To see a specific subset of parameters, use braces, like:
  84 <screen># lctl list_param osc.*.{checksum,connect}*
  85 osc.testfs-OST0000-osc-ffff881071d5cc00.checksum_type
  86 osc.testfs-OST0000-osc-ffff881071d5cc00.checksums
  87 osc.testfs-OST0000-osc-ffff881071d5cc00.connect_flags
  88 </screen></para>
  89       </listitem>
  90     </itemizedlist>
  91     <itemizedlist>
  92       <listitem>
  93         <para> To view a specific file, use <literal>lctl get_param</literal>:
  94           <screen># lctl get_param osc.lustre-OST0000*.rpc_stats</screen></para>
  95       </listitem>
  96     </itemizedlist>
  97     <para>For more information about using <literal>lctl</literal>, see <xref
  98         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="setting_param_with_lctl"/>.</para>
  99     <para>Data can also be viewed using the <literal>cat</literal> command
 100     with the full path to the file. The form of the <literal>cat</literal>
 101     command is similar to that of the <literal>lctl get_param</literal>
 102     command with some differences.  Unfortunately, as the Linux kernel has
 103     changed over the years, the location of statistics and parameter files
 104     has also changed, which means that the Lustre parameter files may be
 105     located in either the <literal>/proc</literal> directory, in the
 106     <literal>/sys</literal> directory, and/or in the
 107     <literal>/sys/kernel/debug</literal> directory, depending on the kernel
 108     version and the Lustre version being used.  The <literal>lctl</literal>
 109     command insulates scripts from these changes and is preferred over direct
 110     file access, unless as part of a high-performance monitoring system.
 111     </para>
 112     <note condition='l2c'><para>Starting in Lustre 2.12, there is
 113     <literal>lctl get_param</literal> and <literal>lctl set_param</literal>
 114     command can provide <emphasis>tab completion</emphasis> when using an
 115     interactive shell with <literal>bash-completion</literal> installed.
 116     This simplifies the use of <literal>get_param</literal> significantly,
 117     since it provides an interactive list of available parameters.
 118     </para></note>
 119     <para>The <literal>llstat</literal> utility can be used to monitor some
 120     Lustre file system I/O activity over a specified time period. For more
 121     details, see
 122     <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="config_llstat"/></para>
 123     <para>Some data is imported from attached clients and is available in a
 124     directory called <literal>exports</literal> located in the corresponding
 125     per-service directory on a Lustre server. For example:
 126     <screen>oss:/root# lctl list_param obdfilter.testfs-OST0000.exports.*
 127 # hash ldlm_stats stats uuid</screen></para>
 128     <section remap="h3">
 129       <title>Identifying Lustre File Systems and Servers</title>
 130       <para>Several parameter files on the MGS list existing
 131       Lustre file systems and file system servers. The examples below are for
 132       a Lustre file system called
 133           <literal>testfs</literal> with one MDT and three OSTs.</para>
 134       <itemizedlist>
 135         <listitem>
 136           <para> To view all known Lustre file systems, enter:</para>
 137           <screen>mgs# lctl get_param mgs.*.filesystems
 138 testfs</screen>
 139         </listitem>
 140         <listitem>
 141           <para> To view the names of the servers in a file system in which least one server is
 142             running,
 143             enter:<screen>lctl get_param mgs.*.live.<replaceable>&lt;filesystem name></replaceable></screen></para>
 144           <para>For example:</para>
 145           <screen>mgs# lctl get_param mgs.*.live.testfs
 146 fsname: testfs
 147 flags: 0x20     gen: 45
 148 testfs-MDT0000
 149 testfs-OST0000
 150 testfs-OST0001
 151 testfs-OST0002
 152
 153 Secure RPC Config Rules:
 154
 155 imperative_recovery_state:
 156     state: startup
 157     nonir_clients: 0
 158     nidtbl_version: 6
 159     notify_duration_total: 0.001000
 160     notify_duation_max:  0.001000
 161     notify_count: 4</screen>
 162         </listitem>
 163         <listitem>
 164           <para>To list all configured devices on the local node, enter:</para>
 165           <screen># lctl device_list
 166 0 UP mgs MGS MGS 11
 167 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
 168 2 UP mdt MDS MDS_uuid 3
 169 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
 170 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 7
 171 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
 172 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
 173 7 UP lov testfs-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
 174 8 UP mdc testfs-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
 175 9 UP osc testfs-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
 176 10 UP osc testfs-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
 177           <para>The information provided on each line includes:</para>
 178           <para> -  Device number</para>
 179           <para> - Device status (UP, INactive, or STopping) </para>
 180           <para> -  Device name</para>
 181           <para> -  Device UUID</para>
 182           <para> -  Reference count (how many users this device has)</para>
 183         </listitem>
 184         <listitem>
 185           <para>To display the name of any server, view the device
 186             label:<screen>mds# e2label /dev/sda
 187 testfs-MDT0000</screen></para>
 188         </listitem>
 189       </itemizedlist>
 190     </section>
 191   </section>
 192   <section>
 193     <title>Tuning Multi-Block Allocation (mballoc)</title>
 194     <para>Capabilities supported by <literal>mballoc</literal> include:</para>
 195     <itemizedlist>
 196       <listitem>
 197         <para> Pre-allocation for single files to help to reduce fragmentation.</para>
 198       </listitem>
 199       <listitem>
 200         <para> Pre-allocation for a group of files to enable packing of small files into large,
 201           contiguous chunks.</para>
 202       </listitem>
 203       <listitem>
 204         <para> Stream allocation to help decrease the seek rate.</para>
 205       </listitem>
 206     </itemizedlist>
 207     <para>The following <literal>mballoc</literal> tunables are available:</para>
 208     <informaltable frame="all">
 209       <tgroup cols="2">
 210         <colspec colname="c1" colwidth="30*"/>
 211         <colspec colname="c2" colwidth="70*"/>
 212         <thead>
 213           <row>
 214             <entry>
 215               <para><emphasis role="bold">Field</emphasis></para>
 216             </entry>
 217             <entry>
 218               <para><emphasis role="bold">Description</emphasis></para>
 219             </entry>
 220           </row>
 221         </thead>
 222         <tbody>
 223           <row>
 224             <entry>
 225               <para>
 226                 <literal>mb_max_to_scan</literal></para>
 227             </entry>
 228             <entry>
 229               <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a
 230                 final decision to avoid a livelock situation.</para>
 231             </entry>
 232           </row>
 233           <row>
 234             <entry>
 235               <para>
 236                 <literal>mb_min_to_scan</literal></para>
 237             </entry>
 238             <entry>
 239               <para>Minimum number of free chunks that <literal>mballoc</literal> searches before
 240                 picking the best chunk for allocation. This is useful for small requests to reduce
 241                 fragmentation of big free chunks.</para>
 242             </entry>
 243           </row>
 244           <row>
 245             <entry>
 246               <para>
 247                 <literal>mb_order2_req</literal></para>
 248             </entry>
 249             <entry>
 250               <para>For requests equal to 2^N, where N &gt;= <literal>mb_order2_req</literal>, a
 251                 fast search is done using a base 2 buddy allocation service.</para>
 252             </entry>
 253           </row>
 254           <row>
 255             <entry>
 256               <para>
 257                 <literal>mb_small_req</literal></para>
 258             </entry>
 259             <entry morerows="1">
 260               <para><literal>mb_small_req</literal> - Defines (in MB) the upper bound of "small
 261                 requests".</para>
 262               <para><literal>mb_large_req</literal> - Defines (in MB) the lower bound of "large
 263                 requests".</para>
 264               <para>Requests are handled differently based on size:<itemizedlist>
 265                   <listitem>
 266                     <para>&lt; <literal>mb_small_req</literal> - Requests are packed together to
 267                       form large, aggregated requests.</para>
 268                   </listitem>
 269                   <listitem>
 270                     <para>> <literal>mb_small_req</literal> and &lt; <literal>mb_large_req</literal>
 271                       - Requests are primarily allocated linearly.</para>
 272                   </listitem>
 273                   <listitem>
 274                     <para>> <literal>mb_large_req</literal> - Requests are allocated since hard disk
 275                       seek time is less of a concern in this case.</para>
 276                   </listitem>
 277                 </itemizedlist></para>
 278               <para>In general, small requests are combined to create larger requests, which are
 279                 then placed close to one another to minimize the number of seeks required to access
 280                 the data.</para>
 281             </entry>
 282           </row>
 283           <row>
 284             <entry>
 285               <para>
 286                 <literal>mb_large_req</literal></para>
 287             </entry>
 288           </row>
 289           <row>
 290             <entry>
 291               <para>
 292                 <literal>prealloc_table</literal></para>
 293             </entry>
 294             <entry>
 295               <para>A table of values used to preallocate space when a new request is received. By
 296                 default, the table looks like
 297                 this:<screen>prealloc_table
 298 4 8 16 32 64 128 256 512 1024 2048 </screen></para>
 299               <para>When a new request is received, space is preallocated at the next higher
 300                 increment specified in the table. For example, for requests of less than 4 file
 301                 system blocks, 4 blocks of space are preallocated; for requests between 4 and 8, 8
 302                 blocks are preallocated; and so forth</para>
 303               <para>Although customized values can be entered in the table, the performance of
 304                 general usage file systems will not typically be improved by modifying the table (in
 305                 fact, in ext4 systems, the table values are fixed).  However, for some specialized
 306                 workloads, tuning the <literal>prealloc_table</literal> values may result in smarter
 307                 preallocation decisions. </para>
 308             </entry>
 309           </row>
 310           <row>
 311             <entry>
 312               <para>
 313                 <literal>mb_group_prealloc</literal></para>
 314             </entry>
 315             <entry>
 316               <para>The amount of space (in kilobytes) preallocated for groups of small
 317                 requests.</para>
 318             </entry>
 319           </row>
 320         </tbody>
 321       </tgroup>
 322     </informaltable>
 323     <para>Buddy group cache information found in
 324           <literal>/sys/fs/ldiskfs/<replaceable>disk_device</replaceable>/mb_groups</literal> may
 325       be useful for assessing on-disk fragmentation. For
 326       example:<screen>cat /proc/fs/ldiskfs/loop0/mb_groups
 327 #group: free free frags first pa [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9
 328      2^10 2^11 2^12 2^13]
 329 #0    : 2936 2936 1     42    0  [ 0   0   0   1   1   1   1   2   0   1
 330      2    0    0    0   ]</screen></para>
 331     <para>In this example, the columns show:<itemizedlist>
 332         <listitem>
 333           <para>#group number</para>
 334         </listitem>
 335         <listitem>
 336           <para>Available blocks in the group</para>
 337         </listitem>
 338         <listitem>
 339           <para>Blocks free on a disk</para>
 340         </listitem>
 341         <listitem>
 342           <para>Number of free fragments</para>
 343         </listitem>
 344         <listitem>
 345           <para>First free block in the group</para>
 346         </listitem>
 347         <listitem>
 348           <para>Number of preallocated chunks (not blocks)</para>
 349         </listitem>
 350         <listitem>
 351           <para>A series of available chunks of different sizes</para>
 352         </listitem>
 353       </itemizedlist></para>
 354   </section>
 355   <section>
 356     <title>Monitoring Lustre File System I/O</title>
 357     <para>A number of system utilities are provided to enable collection of data related to I/O
 358       activity in a Lustre file system. In general, the data collected describes:</para>
 359     <itemizedlist>
 360       <listitem>
 361         <para> Data transfer rates and throughput of inputs and outputs external to the Lustre file
 362           system, such as network requests or disk I/O operations performed</para>
 363       </listitem>
 364       <listitem>
 365         <para> Data about the throughput or transfer rates of internal Lustre file system data, such
 366           as locks or allocations. </para>
 367       </listitem>
 368     </itemizedlist>
 369     <note>
 370       <para>It is highly recommended that you complete baseline testing for your Lustre file system
 371         to determine normal I/O activity for your hardware, network, and system workloads. Baseline
 372         data will allow you to easily determine when performance becomes degraded in your system.
 373         Two particularly useful baseline statistics are:</para>
 374       <itemizedlist>
 375         <listitem>
 376           <para><literal>brw_stats</literal> – Histogram data characterizing I/O requests to the
 377             OSTs. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 378               linkend="monitor_ost_block_io_stream"/>.</para>
 379         </listitem>
 380         <listitem>
 381           <para><literal>rpc_stats</literal> – Histogram data showing information about RPCs made by
 382             clients. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 383               linkend="MonitoringClientRCPStream"/>.</para>
 384         </listitem>
 385       </itemizedlist>
 386     </note>
 387     <section remap="h3" xml:id="MonitoringClientRCPStream">
 388       <title><indexterm>
 389           <primary>proc</primary>
 390           <secondary>watching RPC</secondary>
 391         </indexterm>Monitoring the Client RPC Stream</title>
 392       <para>The <literal>rpc_stats</literal> file contains histogram data showing information about
 393         remote procedure calls (RPCs) that have been made since this file was last cleared. The
 394         histogram data can be cleared by writing any value into the <literal>rpc_stats</literal>
 395         file.</para>
 396       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 397       <screen># lctl get_param osc.testfs-OST0000-osc-ffff810058d2f800.rpc_stats
 398 snapshot_time:            1372786692.389858 (secs.usecs)
 399 read RPCs in flight:      0
 400 write RPCs in flight:     1
 401 dio read RPCs in flight:  0
 402 dio write RPCs in flight: 0
 403 pending write pages:      256
 404 pending read pages:       0
 405
 406                      read                   write
 407 pages per rpc   rpcs   % cum % |       rpcs   % cum %
 408 1:                 0   0   0   |          0   0   0
 409 2:                 0   0   0   |          1   0   0
 410 4:                 0   0   0   |          0   0   0
 411 8:                 0   0   0   |          0   0   0
 412 16:                0   0   0   |          0   0   0
 413 32:                0   0   0   |          2   0   0
 414 64:                0   0   0   |          2   0   0
 415 128:               0   0   0   |          5   0   0
 416 256:             850 100 100   |      18346  99 100
 417
 418                      read                   write
 419 rpcs in flight  rpcs   % cum % |       rpcs   % cum %
 420 0:               691  81  81   |       1740   9   9
 421 1:                48   5  86   |        938   5  14
 422 2:                29   3  90   |       1059   5  20
 423 3:                17   2  92   |       1052   5  26
 424 4:                13   1  93   |        920   5  31
 425 5:                12   1  95   |        425   2  33
 426 6:                10   1  96   |        389   2  35
 427 7:                30   3 100   |      11373  61  97
 428 8:                 0   0 100   |        460   2 100
 429
 430                      read                   write
 431 offset          rpcs   % cum % |       rpcs   % cum %
 432 0:               850 100 100   |      18347  99  99
 433 1:                 0   0 100   |          0   0  99
 434 2:                 0   0 100   |          0   0  99
 435 4:                 0   0 100   |          0   0  99
 436 8:                 0   0 100   |          0   0  99
 437 16:                0   0 100   |          1   0  99
 438 32:                0   0 100   |          1   0  99
 439 64:                0   0 100   |          3   0  99
 440 128:               0   0 100   |          4   0 100
 441
 442 </screen>
 443       <para>The header information includes:</para>
 444       <itemizedlist>
 445         <listitem>
 446           <para><literal>snapshot_time</literal> - UNIX epoch instant the file was read.</para>
 447         </listitem>
 448         <listitem>
 449           <para><literal>read RPCs in flight</literal> - Number of read RPCs issued by the OSC, but
 450             not complete at the time of the snapshot. This value should always be less than or equal
 451             to <literal>max_rpcs_in_flight</literal>.</para>
 452         </listitem>
 453         <listitem>
 454           <para><literal>write RPCs in flight</literal> - Number of write RPCs issued by the OSC,
 455             but not complete at the time of the snapshot. This value should always be less than or
 456             equal to <literal>max_rpcs_in_flight</literal>.</para>
 457         </listitem>
 458         <listitem>
 459           <para><literal>dio read RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
 460             read RPCs issued but not completed at the time of the snapshot.</para>
 461         </listitem>
 462         <listitem>
 463           <para><literal>dio write RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
 464             write RPCs issued but not completed at the time of the snapshot.</para>
 465         </listitem>
 466         <listitem>
 467           <para><literal>pending write pages</literal>  - Number of pending write pages that have
 468             been queued for I/O in the OSC.</para>
 469         </listitem>
 470         <listitem>
 471           <para><literal>pending read pages</literal> - Number of pending read pages that have been
 472             queued for I/O in the OSC.</para>
 473         </listitem>
 474       </itemizedlist>
 475       <para>The tabular data is described in the table below. Each row in the table shows the number
 476         of reads or writes (<literal>ios</literal>) occurring for the statistic, the relative
 477         percentage (<literal>%</literal>) of total reads or writes, and the cumulative percentage
 478           (<literal>cum %</literal>) to that point in the table for the statistic.</para>
 479       <informaltable frame="all">
 480         <tgroup cols="2">
 481           <colspec colname="c1" colwidth="40*"/>
 482           <colspec colname="c2" colwidth="60*"/>
 483           <thead>
 484             <row>
 485               <entry>
 486                 <para><emphasis role="bold">Field</emphasis></para>
 487               </entry>
 488               <entry>
 489                 <para><emphasis role="bold">Description</emphasis></para>
 490               </entry>
 491             </row>
 492           </thead>
 493           <tbody>
 494             <row>
 495               <entry>
 496                 <para> pages per RPC</para>
 497               </entry>
 498               <entry>
 499                 <para>Shows cumulative RPC reads and writes organized according to the number of
 500                   pages in the RPC. A single page RPC increments the <literal>0:</literal>
 501                   row.</para>
 502               </entry>
 503             </row>
 504             <row>
 505               <entry>
 506                 <para> RPCs in flight</para>
 507               </entry>
 508               <entry>
 509                 <para> Shows the number of RPCs that are pending when an RPC is sent. When the first
 510                   RPC is sent, the <literal>0:</literal> row is incremented. If the first RPC is
 511                   sent while another RPC is pending, the <literal>1:</literal> row is incremented
 512                   and so on. </para>
 513               </entry>
 514             </row>
 515             <row>
 516               <entry>
 517                 <para> offset</para>
 518               </entry>
 519               <entry>
 520                 <para> The page index of the first page read from or written to the object by the
 521                   RPC. </para>
 522               </entry>
 523             </row>
 524           </tbody>
 525         </tgroup>
 526       </informaltable>
 527       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 528       <para>This table provides a way to visualize the concurrency of the RPC stream. Ideally, you
 529         will see a large clump around the <literal>max_rpcs_in_flight value</literal>, which shows
 530         that the network is being kept busy.</para>
 531       <para>For information about optimizing the client I/O RPC stream, see <xref
 532           xmlns:xlink="http://www.w3.org/1999/xlink" linkend="TuningClientIORPCStream"/>.</para>
 533     </section>
 534     <section xml:id="lustreproc.clientstats" remap="h3">
 535       <title><indexterm>
 536           <primary>proc</primary>
 537           <secondary>client stats</secondary>
 538         </indexterm>Monitoring Client Activity</title>
 539       <para>The <literal>stats</literal> file maintains statistics accumulate during typical
 540         operation of a client across the VFS interface of the Lustre file system. Only non-zero
 541         parameters are displayed in the file. </para>
 542       <para>Client statistics are enabled by default.</para>
 543       <note>
 544         <para>Statistics for all mounted file systems can be discovered by
 545           entering:<screen>lctl get_param llite.*.stats</screen></para>
 546       </note>
 547       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 548       <screen>client# lctl get_param llite.*.stats
 549 snapshot_time          1308343279.169704 secs.usecs
 550 dirty_pages_hits       14819716 samples [regs]
 551 dirty_pages_misses     81473472 samples [regs]
 552 read_bytes             36502963 samples [bytes] 1 26843582 55488794
 553 write_bytes            22985001 samples [bytes] 0 125912 3379002
 554 brw_read               2279 samples [pages] 1 1 2270
 555 ioctl                  186749 samples [regs]
 556 open                   3304805 samples [regs]
 557 close                  3331323 samples [regs]
 558 seek                   48222475 samples [regs]
 559 fsync                  963 samples [regs]
 560 truncate               9073 samples [regs]
 561 setxattr               19059 samples [regs]
 562 getxattr               61169 samples [regs]
 563 </screen>
 564       <para> The statistics can be cleared by echoing an empty string into the
 565           <literal>stats</literal> file or by using the command:
 566         <screen>lctl set_param llite.*.stats=0</screen></para>
 567       <para>The statistics displayed are described in the table below.</para>
 568       <informaltable frame="all">
 569         <tgroup cols="2">
 570           <colspec colname="c1" colwidth="3*"/>
 571           <colspec colname="c2" colwidth="7*"/>
 572           <thead>
 573             <row>
 574               <entry>
 575                 <para><emphasis role="bold">Entry</emphasis></para>
 576               </entry>
 577               <entry>
 578                 <para><emphasis role="bold">Description</emphasis></para>
 579               </entry>
 580             </row>
 581           </thead>
 582           <tbody>
 583             <row>
 584               <entry>
 585                 <para>
 586                   <literal>snapshot_time</literal></para>
 587               </entry>
 588               <entry>
 589                 <para>UNIX epoch instant the stats file was read.</para>
 590               </entry>
 591             </row>
 592             <row>
 593               <entry>
 594                 <para>
 595                   <literal>dirty_page_hits</literal></para>
 596               </entry>
 597               <entry>
 598                 <para>The number of write operations that have been satisfied by the dirty page
 599                   cache. See <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 600                     linkend="TuningClientIORPCStream"/> for more information about dirty cache
 601                   behavior in a Lustre file system.</para>
 602               </entry>
 603             </row>
 604             <row>
 605               <entry>
 606                 <para>
 607                   <literal>dirty_page_misses</literal></para>
 608               </entry>
 609               <entry>
 610                 <para>The number of write operations that were not satisfied by the dirty page
 611                   cache.</para>
 612               </entry>
 613             </row>
 614             <row>
 615               <entry>
 616                 <para>
 617                   <literal>read_bytes</literal></para>
 618               </entry>
 619               <entry>
 620                 <para>The number of read operations that have occurred. Three additional parameters
 621                   are displayed:</para>
 622                 <variablelist>
 623                   <varlistentry>
 624                     <term>min</term>
 625                     <listitem>
 626                       <para>The minimum number of bytes read in a single request since the counter
 627                         was reset.</para>
 628                     </listitem>
 629                   </varlistentry>
 630                   <varlistentry>
 631                     <term>max</term>
 632                     <listitem>
 633                       <para>The maximum number of bytes read in a single request since the counter
 634                         was reset.</para>
 635                     </listitem>
 636                   </varlistentry>
 637                   <varlistentry>
 638                     <term>sum</term>
 639                     <listitem>
 640                       <para>The accumulated sum of bytes of all read requests since the counter was
 641                         reset.</para>
 642                     </listitem>
 643                   </varlistentry>
 644                 </variablelist>
 645               </entry>
 646             </row>
 647             <row>
 648               <entry>
 649                 <para>
 650                   <literal>write_bytes</literal></para>
 651               </entry>
 652               <entry>
 653                 <para>The number of write operations that have occurred. Three additional parameters
 654                   are displayed:</para>
 655                 <variablelist>
 656                   <varlistentry>
 657                     <term>min</term>
 658                     <listitem>
 659                       <para>The minimum number of bytes written in a single request since the
 660                         counter was reset.</para>
 661                     </listitem>
 662                   </varlistentry>
 663                   <varlistentry>
 664                     <term>max</term>
 665                     <listitem>
 666                       <para>The maximum number of bytes written in a single request since the
 667                         counter was reset.</para>
 668                     </listitem>
 669                   </varlistentry>
 670                   <varlistentry>
 671                     <term>sum</term>
 672                     <listitem>
 673                       <para>The accumulated sum of bytes of all write requests since the counter was
 674                         reset.</para>
 675                     </listitem>
 676                   </varlistentry>
 677                 </variablelist>
 678               </entry>
 679             </row>
 680             <row>
 681               <entry>
 682                 <para>
 683                   <literal>brw_read</literal></para>
 684               </entry>
 685               <entry>
 686                 <para>The number of pages that have been read. Three additional parameters are
 687                   displayed:</para>
 688                 <variablelist>
 689                   <varlistentry>
 690                     <term>min</term>
 691                     <listitem>
 692                       <para>The minimum number of bytes read in a single block read/write
 693                           (<literal>brw</literal>) read request since the counter was reset.</para>
 694                     </listitem>
 695                   </varlistentry>
 696                   <varlistentry>
 697                     <term>max</term>
 698                     <listitem>
 699                       <para>The maximum number of bytes read in a single <literal>brw</literal> read
 700                         requests since the counter was reset.</para>
 701                     </listitem>
 702                   </varlistentry>
 703                   <varlistentry>
 704                     <term>sum</term>
 705                     <listitem>
 706                       <para>The accumulated sum of bytes of all <literal>brw</literal> read requests
 707                         since the counter was reset.</para>
 708                     </listitem>
 709                   </varlistentry>
 710                 </variablelist>
 711               </entry>
 712             </row>
 713             <row>
 714               <entry>
 715                 <para>
 716                   <literal>ioctl</literal></para>
 717               </entry>
 718               <entry>
 719                 <para>The number of combined file and directory <literal>ioctl</literal>
 720                   operations.</para>
 721               </entry>
 722             </row>
 723             <row>
 724               <entry>
 725                 <para>
 726                   <literal>open</literal></para>
 727               </entry>
 728               <entry>
 729                 <para>The number of open operations that have succeeded.</para>
 730               </entry>
 731             </row>
 732             <row>
 733               <entry>
 734                 <para>
 735                   <literal>close</literal></para>
 736               </entry>
 737               <entry>
 738                 <para>The number of close operations that have succeeded.</para>
 739               </entry>
 740             </row>
 741             <row>
 742               <entry>
 743                 <para>
 744                   <literal>seek</literal></para>
 745               </entry>
 746               <entry>
 747                 <para>The number of times <literal>seek</literal> has been called.</para>
 748               </entry>
 749             </row>
 750             <row>
 751               <entry>
 752                 <para>
 753                   <literal>fsync</literal></para>
 754               </entry>
 755               <entry>
 756                 <para>The number of times <literal>fsync</literal> has been called.</para>
 757               </entry>
 758             </row>
 759             <row>
 760               <entry>
 761                 <para>
 762                   <literal>truncate</literal></para>
 763               </entry>
 764               <entry>
 765                 <para>The total number of calls to both locked and lockless
 766                     <literal>truncate</literal>.</para>
 767               </entry>
 768             </row>
 769             <row>
 770               <entry>
 771                 <para>
 772                   <literal>setxattr</literal></para>
 773               </entry>
 774               <entry>
 775                 <para>The number of times extended attributes have been set. </para>
 776               </entry>
 777             </row>
 778             <row>
 779               <entry>
 780                 <para>
 781                   <literal>getxattr</literal></para>
 782               </entry>
 783               <entry>
 784                 <para>The number of times value(s) of extended attributes have been fetched.</para>
 785               </entry>
 786             </row>
 787           </tbody>
 788         </tgroup>
 789       </informaltable>
 790       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 791       <para>Information is provided about the amount and type of I/O activity is taking place on the
 792         client.</para>
 793     </section>
 794     <section remap="h3">
 795       <title><indexterm>
 796           <primary>proc</primary>
 797           <secondary>read/write survey</secondary>
 798         </indexterm>Monitoring Client Read-Write Offset Statistics</title>
 799       <para>When the <literal>offset_stats</literal> parameter is set, statistics are maintained for
 800         occurrences of a series of read or write calls from a process that did not access the next
 801         sequential location. The <literal>OFFSET</literal> field is reset to 0 (zero) whenever a
 802         different file is read or written.</para>
 803       <note>
 804         <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
 805             <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
 806           to reduce monitoring overhead when this information is not needed.  The collection of
 807           statistics in all three of these files is activated by writing
 808           anything, except for 0 (zero) and "disable", into any one of the
 809           files.</para>
 810       </note>
 811       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 812       <screen># lctl get_param llite.testfs-f57dee0.offset_stats
 813 snapshot_time: 1155748884.591028 (secs.usecs)
 814              RANGE   RANGE    SMALLEST   LARGEST
 815 R/W   PID    START   END      EXTENT     EXTENT    OFFSET
 816 R     8385   0       128      128        128       0
 817 R     8385   0       224      224        224       -128
 818 W     8385   0       250      50         100       0
 819 W     8385   100     1110     10         500       -150
 820 W     8384   0       5233     5233       5233      0
 821 R     8385   500     600      100        100       -610</screen>
 822       <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file was
 823         read. The tabular data is described in the table below.</para>
 824       <para>The <literal>offset_stats</literal> file can be cleared by
 825         entering:<screen>lctl set_param llite.*.offset_stats=0</screen></para>
 826       <informaltable frame="all">
 827         <tgroup cols="2">
 828           <colspec colname="c1" colwidth="50*"/>
 829           <colspec colname="c2" colwidth="50*"/>
 830           <thead>
 831             <row>
 832               <entry>
 833                 <para><emphasis role="bold">Field</emphasis></para>
 834               </entry>
 835               <entry>
 836                 <para><emphasis role="bold">Description</emphasis></para>
 837               </entry>
 838             </row>
 839           </thead>
 840           <tbody>
 841             <row>
 842               <entry>
 843                 <para>R/W</para>
 844               </entry>
 845               <entry>
 846                 <para>Indicates if the non-sequential call was a read or write</para>
 847               </entry>
 848             </row>
 849             <row>
 850               <entry>
 851                 <para>PID </para>
 852               </entry>
 853               <entry>
 854                 <para>Process ID of the process that made the read/write call.</para>
 855               </entry>
 856             </row>
 857             <row>
 858               <entry>
 859                 <para>RANGE START/RANGE END</para>
 860               </entry>
 861               <entry>
 862                 <para>Range in which the read/write calls were sequential.</para>
 863               </entry>
 864             </row>
 865             <row>
 866               <entry>
 867                 <para>SMALLEST EXTENT </para>
 868               </entry>
 869               <entry>
 870                 <para>Smallest single read/write in the corresponding range (in bytes).</para>
 871               </entry>
 872             </row>
 873             <row>
 874               <entry>
 875                 <para>LARGEST EXTENT </para>
 876               </entry>
 877               <entry>
 878                 <para>Largest single read/write in the corresponding range (in bytes).</para>
 879               </entry>
 880             </row>
 881             <row>
 882               <entry>
 883                 <para>OFFSET </para>
 884               </entry>
 885               <entry>
 886                 <para>Difference between the previous range end and the current range start.</para>
 887               </entry>
 888             </row>
 889           </tbody>
 890         </tgroup>
 891       </informaltable>
 892       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 893       <para>This data provides an indication of how contiguous or fragmented the data is. For
 894         example, the fourth entry in the example above shows the writes for this RPC were sequential
 895         in the range 100 to 1110 with the minimum write 10 bytes and the maximum write 500 bytes.
 896         The range started with an offset of -150 from the <literal>RANGE END</literal> of the
 897         previous entry in the example.</para>
 898     </section>
 899     <section remap="h3">
 900       <title><indexterm>
 901           <primary>proc</primary>
 902           <secondary>read/write survey</secondary>
 903         </indexterm>Monitoring Client Read-Write Extent Statistics</title>
 904       <para>For in-depth troubleshooting, client read-write extent statistics can be accessed to
 905         obtain more detail about read/write I/O extents for the file system or for a particular
 906         process.</para>
 907       <note>
 908         <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
 909             <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
 910           to reduce monitoring overhead when this information is not needed.  The collection of
 911           statistics in all three of these files is activated by writing
 912           anything, except for 0 (zero) and "disable", into any one of the
 913           files.</para>
 914       </note>
 915       <section remap="h3">
 916         <title>Client-Based I/O Extent Size Survey</title>
 917         <para>The <literal>extents_stats</literal> histogram in the
 918           <literal>llite</literal> directory shows the statistics for the sizes
 919           of the read/write I/O extents. This file does not maintain the per
 920           process statistics.</para>
 921         <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 922         <screen># lctl get_param llite.testfs-*.extents_stats
 923 snapshot_time:                     1213828728.348516 (secs.usecs)
 924                        read           |            write
 925 extents          calls  %      cum%   |     calls  %     cum%
 926
 927 0K - 4K :        0      0      0      |     2      2     2
 928 4K - 8K :        0      0      0      |     0      0     2
 929 8K - 16K :       0      0      0      |     0      0     2
 930 16K - 32K :      0      0      0      |     20     23    26
 931 32K - 64K :      0      0      0      |     0      0     26
 932 64K - 128K :     0      0      0      |     51     60    86
 933 128K - 256K :    0      0      0      |     0      0     86
 934 256K - 512K :    0      0      0      |     0      0     86
 935 512K - 1024K :   0      0      0      |     0      0     86
 936 1M - 2M :        0      0      0      |     11     13    100</screen>
 937         <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file
 938           was read. The table shows cumulative extents organized according to size with statistics
 939           provided separately for reads and writes. Each row in the table shows the number of RPCs
 940           for reads and writes respectively (<literal>calls</literal>), the relative percentage of
 941           total calls (<literal>%</literal>), and the cumulative percentage to
 942           that point in the table of calls (<literal>cum %</literal>). </para>
 943         <para> The file can be cleared by issuing the following command:
 944         <screen># lctl set_param llite.testfs-*.extents_stats=1</screen></para>
 945       </section>
 946       <section>
 947         <title>Per-Process Client I/O Statistics</title>
 948         <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size
 949           statistics on a per-process basis.</para>
 950         <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 951         <screen># lctl get_param llite.testfs-*.extents_stats_per_process
 952 snapshot_time:                     1213828762.204440 (secs.usecs)
 953                           read            |             write
 954 extents            calls   %      cum%    |      calls   %       cum%
 955
 956 PID: 11488
 957    0K - 4K :       0       0       0      |      0       0       0
 958    4K - 8K :       0       0       0      |      0       0       0
 959    8K - 16K :      0       0       0      |      0       0       0
 960    16K - 32K :     0       0       0      |      0       0       0
 961    32K - 64K :     0       0       0      |      0       0       0
 962    64K - 128K :    0       0       0      |      0       0       0
 963    128K - 256K :   0       0       0      |      0       0       0
 964    256K - 512K :   0       0       0      |      0       0       0
 965    512K - 1024K :  0       0       0      |      0       0       0
 966    1M - 2M :       0       0       0      |      10      100     100
 967
 968 PID: 11491
 969    0K - 4K :       0       0       0      |      0       0       0
 970    4K - 8K :       0       0       0      |      0       0       0
 971    8K - 16K :      0       0       0      |      0       0       0
 972    16K - 32K :     0       0       0      |      20      100     100
 973
 974 PID: 11424
 975    0K - 4K :       0       0       0      |      0       0       0
 976    4K - 8K :       0       0       0      |      0       0       0
 977    8K - 16K :      0       0       0      |      0       0       0
 978    16K - 32K :     0       0       0      |      0       0       0
 979    32K - 64K :     0       0       0      |      0       0       0
 980    64K - 128K :    0       0       0      |      16      100     100
 981
 982 PID: 11426
 983    0K - 4K :       0       0       0      |      1       100     100
 984
 985 PID: 11429
 986    0K - 4K :       0       0       0      |      1       100     100
 987
 988 </screen>
 989         <para>This table shows cumulative extents organized according to size for each process ID
 990           (PID) with statistics provided separately for reads and writes. Each row in the table
 991           shows the number of RPCs for reads and writes respectively (<literal>calls</literal>), the
 992           relative percentage of total calls (<literal>%</literal>), and the cumulative percentage
 993           to that point in the table of calls (<literal>cum %</literal>). </para>
 994       </section>
 995     </section>
 996     <section xml:id="monitor_ost_block_io_stream">
 997       <title><indexterm>
 998           <primary>proc</primary>
 999           <secondary>block I/O</secondary>
1000         </indexterm>Monitoring the OST Block I/O Stream</title>
1001       <para>The <literal>brw_stats</literal> parameter file below the
1002       <literal>osd-ldiskfs</literal> or <literal>osd-zfs</literal> directory
1003         contains histogram data showing statistics for number of I/O requests
1004         sent to the disk, their size, and whether they are contiguous on the
1005         disk or not.</para>
1006       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
1007       <para>Enter on the OSS or MDS:</para>
1008       <screen>oss# lctl get_param osd-*.*.brw_stats
1009 snapshot_time:         1372775039.769045 (secs.usecs)
1010                            read      |      write
1011 pages per bulk r/w     rpcs  % cum % |  rpcs   % cum %
1012 1:                     108 100 100   |    39   0   0
1013 2:                       0   0 100   |     6   0   0
1014 4:                       0   0 100   |     1   0   0
1015 8:                       0   0 100   |     0   0   0
1016 16:                      0   0 100   |     4   0   0
1017 32:                      0   0 100   |    17   0   0
1018 64:                      0   0 100   |    12   0   0
1019 128:                     0   0 100   |    24   0   0
1020 256:                     0   0 100   | 23142  99 100
1021
1022                            read      |      write
1023 discontiguous pages    rpcs  % cum % |  rpcs   % cum %
1024 0:                     108 100 100   | 23245 100 100
1025
1026                            read      |      write
1027 discontiguous blocks   rpcs  % cum % |  rpcs   % cum %
1028 0:                     108 100 100   | 23243  99  99
1029 1:                       0   0 100   |     2   0 100
1030
1031                            read      |      write
1032 disk fragmented I/Os   ios   % cum % |   ios   % cum %
1033 0:                      94  87  87   |     0   0   0
1034 1:                      14  12 100   | 23243  99  99
1035 2:                       0   0 100   |     2   0 100
1036
1037                            read      |      write
1038 disk I/Os in flight    ios   % cum % |   ios   % cum %
1039 1:                      14 100 100   | 20896  89  89
1040 2:                       0   0 100   |  1071   4  94
1041 3:                       0   0 100   |   573   2  96
1042 4:                       0   0 100   |   300   1  98
1043 5:                       0   0 100   |   166   0  98
1044 6:                       0   0 100   |   108   0  99
1045 7:                       0   0 100   |    81   0  99
1046 8:                       0   0 100   |    47   0  99
1047 9:                       0   0 100   |     5   0 100
1048
1049                            read      |      write
1050 I/O time (1/1000s)     ios   % cum % |   ios   % cum %
1051 1:                      94  87  87   |     0   0   0
1052 2:                       0   0  87   |     7   0   0
1053 4:                      14  12 100   |    27   0   0
1054 8:                       0   0 100   |    14   0   0
1055 16:                      0   0 100   |    31   0   0
1056 32:                      0   0 100   |    38   0   0
1057 64:                      0   0 100   | 18979  81  82
1058 128:                     0   0 100   |   943   4  86
1059 256:                     0   0 100   |  1233   5  91
1060 512:                     0   0 100   |  1825   7  99
1061 1K:                      0   0 100   |   99   0  99
1062 2K:                      0   0 100   |     0   0  99
1063 4K:                      0   0 100   |     0   0  99
1064 8K:                      0   0 100   |    49   0 100
1065
1066                            read      |      write
1067 disk I/O size          ios   % cum % |   ios   % cum %
1068 4K:                     14 100 100   |    41   0   0
1069 8K:                      0   0 100   |     6   0   0
1070 16K:                     0   0 100   |     1   0   0
1071 32K:                     0   0 100   |     0   0   0
1072 64K:                     0   0 100   |     4   0   0
1073 128K:                    0   0 100   |    17   0   0
1074 256K:                    0   0 100   |    12   0   0
1075 512K:                    0   0 100   |    24   0   0
1076 1M:                      0   0 100   | 23142  99 100
1077 </screen>
1078       <para>The tabular data is described in the table below. Each row in the
1079         table shows the number of reads and writes occurring for the statistic
1080         (<literal>ios</literal>), the relative percentage of total reads or
1081         writes (<literal>%</literal>), and the cumulative percentage to that
1082         point in the table for the statistic (<literal>cum %</literal>). </para>
1083       <informaltable frame="all">
1084         <tgroup cols="2">
1085           <colspec colname="c1" colwidth="40*"/>
1086           <colspec colname="c2" colwidth="60*"/>
1087           <thead>
1088             <row>
1089               <entry>
1090                 <para><emphasis role="bold">Field</emphasis></para>
1091               </entry>
1092               <entry>
1093                 <para><emphasis role="bold">Description</emphasis></para>
1094               </entry>
1095             </row>
1096           </thead>
1097           <tbody>
1098             <row>
1099               <entry>
1100                 <para>
1101                   <literal>pages per bulk r/w</literal></para>
1102               </entry>
1103               <entry>
1104                 <para>Number of pages per RPC request, which should match aggregate client
1105                     <literal>rpc_stats</literal> (see <xref
1106                     xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"
1107                   />).</para>
1108               </entry>
1109             </row>
1110             <row>
1111               <entry>
1112                 <para>
1113                   <literal>discontiguous pages</literal></para>
1114               </entry>
1115               <entry>
1116                 <para>Number of discontinuities in the logical file offset of each page in a single
1117                   RPC.</para>
1118               </entry>
1119             </row>
1120             <row>
1121               <entry>
1122                 <para>
1123                   <literal>discontiguous blocks</literal></para>
1124               </entry>
1125               <entry>
1126                 <para>Number of discontinuities in the physical block allocation in the file system
1127                   for a single RPC.</para>
1128               </entry>
1129             </row>
1130             <row>
1131               <entry>
1132                 <para><literal>disk fragmented I/Os</literal></para>
1133               </entry>
1134               <entry>
1135                 <para>Number of I/Os that were not written entirely sequentially.</para>
1136               </entry>
1137             </row>
1138             <row>
1139               <entry>
1140                 <para><literal>disk I/Os in flight</literal></para>
1141               </entry>
1142               <entry>
1143                 <para>Number of disk I/Os currently pending.</para>
1144               </entry>
1145             </row>
1146             <row>
1147               <entry>
1148                 <para><literal>I/O time (1/1000s)</literal></para>
1149               </entry>
1150               <entry>
1151                 <para>Amount of time for each I/O operation to complete.</para>
1152               </entry>
1153             </row>
1154             <row>
1155               <entry>
1156                 <para><literal>disk I/O size</literal></para>
1157               </entry>
1158               <entry>
1159                 <para>Size of each I/O operation.</para>
1160               </entry>
1161             </row>
1162           </tbody>
1163         </tgroup>
1164       </informaltable>
1165       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
1166       <para>This data provides an indication of extent size and distribution in the file
1167         system.</para>
1168     </section>
1169   </section>
1170   <section>
1171     <title>Tuning Lustre File System I/O</title>
1172     <para>Each OSC has its own tree of tunables. For example:</para>
1173     <screen>$ lctl lctl list_param osc.*.*
1174 osc.myth-OST0000-osc-ffff8804296c2800.active
1175 osc.myth-OST0000-osc-ffff8804296c2800.blocksize
1176 osc.myth-OST0000-osc-ffff8804296c2800.checksum_dump
1177 osc.myth-OST0000-osc-ffff8804296c2800.checksum_type
1178 osc.myth-OST0000-osc-ffff8804296c2800.checksums
1179 osc.myth-OST0000-osc-ffff8804296c2800.connect_flags
1180 :
1181 :
1182 osc.myth-OST0000-osc-ffff8804296c2800.state
1183 osc.myth-OST0000-osc-ffff8804296c2800.stats
1184 osc.myth-OST0000-osc-ffff8804296c2800.timeouts
1185 osc.myth-OST0000-osc-ffff8804296c2800.unstable_stats
1186 osc.myth-OST0000-osc-ffff8804296c2800.uuid
1187 osc.myth-OST0001-osc-ffff8804296c2800.active
1188 osc.myth-OST0001-osc-ffff8804296c2800.blocksize
1189 osc.myth-OST0001-osc-ffff8804296c2800.checksum_dump
1190 osc.myth-OST0001-osc-ffff8804296c2800.checksum_type
1191 :
1192 :
1193 </screen>
1194     <para>The following sections describe some of the parameters that can
1195       be tuned in a Lustre file system.</para>
1196     <section remap="h3" xml:id="TuningClientIORPCStream">
1197       <title><indexterm>
1198           <primary>proc</primary>
1199           <secondary>RPC tunables</secondary>
1200         </indexterm>Tuning the Client I/O RPC Stream</title>
1201       <para>Ideally, an optimal amount of data is packed into each I/O RPC
1202         and a consistent number of issued RPCs are in progress at any time.
1203         To help optimize the client I/O RPC stream, several tuning variables
1204         are provided to adjust behavior according to network conditions and
1205         cluster size. For information about monitoring the client I/O RPC
1206         stream, see <xref
1207           xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"/>.</para>
1208       <para>RPC stream tunables include:</para>
1209       <para>
1210         <itemizedlist>
1211           <listitem>
1212             <para><literal>osc.<replaceable>osc_instance</replaceable>.checksums</literal>
1213               - Controls whether the client will calculate data integrity
1214               checksums for the bulk data transferred to the OST.  Data
1215               integrity checksums are enabled by default.  The algorithm used
1216               can be set using the <literal>checksum_type</literal> parameter.
1217             </para>
1218           </listitem>
1219           <listitem>
1220             <para><literal>osc.<replaceable>osc_instance</replaceable>.checksum_type</literal>
1221               - Controls the data integrity checksum algorithm used by the
1222               client.  The available algorithms are determined by the set of
1223               algorihtms.  The checksum algorithm used by default is determined
1224               by first selecting the fastest algorithms available on the OST,
1225               and then selecting the fastest of those algorithms on the client,
1226               which depends on available optimizations in the CPU hardware and
1227               kernel.  The default algorithm can be overridden by writing the
1228               algorithm name into the <literal>checksum_type</literal>
1229               parameter.  Available checksum types can be seen on the client by
1230               reading the <literal>checksum_type</literal> parameter. Currently
1231               supported checksum types are:
1232               <literal>adler</literal>,
1233               <literal>crc32</literal>,
1234               <literal>crc32c</literal>
1235             </para>
1236             <para condition="l2C">
1237               In Lustre release 2.12 additional checksum types were added to
1238               allow end-to-end checksum integration with T10-PI capable
1239               hardware.  The client will compute the appropriate checksum
1240               type, based on the checksum type used by the storage, for the
1241               RPC checksum, which will be verified by the server and passed
1242               on to the storage.  The T10-PI checksum types are:
1243               <literal>t10ip512</literal>,
1244               <literal>t10ip4K</literal>,
1245               <literal>t10crc512</literal>,
1246               <literal>t10crc4K</literal>
1247             </para>
1248           </listitem>
1249           <listitem>
1250             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal>
1251               - Controls how many MiB of dirty data can be written into the
1252               client pagecache for writes by <emphasis>each</emphasis> OSC.
1253               When this limit is reached, additional writes block until
1254               previously-cached data is written to the server. This may be
1255               changed by the <literal>lctl set_param</literal> command. Only
1256               values larger than 0 and smaller than the lesser of 2048 MiB or
1257               1/4 of client RAM are valid. Performance can suffers if the
1258               client cannot aggregate enough data per OSC to form a full RPC
1259               (as set by the <literal>max_pages_per_rpc</literal>) parameter,
1260               unless the application is doing very large writes itself.
1261             </para>
1262             <para>To maximize performance, the value for
1263               <literal>max_dirty_mb</literal> is recommended to be at least
1264               4 * <literal>max_pages_per_rpc</literal> *
1265               <literal>max_rpcs_in_flight</literal>.
1266             </para>
1267           </listitem>
1268           <listitem>
1269             <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal>
1270               - A read-only value that returns the current number of bytes
1271               written and cached by this OSC.
1272             </para>
1273           </listitem>
1274           <listitem>
1275             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal>
1276               - The maximum number of pages that will be sent in a single RPC
1277               request to the OST. The minimum value is one page and the maximum
1278               value is 16 MiB (4096 on systems with <literal>PAGE_SIZE</literal>
1279               of 4 KiB), with the default value of 4 MiB in one RPC.  The upper
1280               limit may also be constrained by <literal>ofd.*.brw_size</literal>
1281               setting on the OSS, and applies to all clients connected to that
1282               OST.  It is also possible to specify a units suffix (e.g.
1283               <literal>max_pages_per_rpc=4M</literal>), so the RPC size can be
1284               set independently of the client <literal>PAGE_SIZE</literal>.
1285             </para>
1286           </listitem>
1287           <listitem>
1288             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
1289               - The maximum number of concurrent RPCs in flight from an OSC to
1290               its OST. If the OSC tries to initiate an RPC but finds that it
1291               already has the same number of RPCs outstanding, it will wait to
1292               issue further RPCs until some complete. The minimum setting is 1
1293               and maximum setting is 256. The default value is 8 RPCs.
1294             </para>
1295             <para>To improve small file I/O performance, increase the
1296               <literal>max_rpcs_in_flight</literal> value.
1297             </para>
1298           </listitem>
1299           <listitem>
1300             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_cached_mb</literal>
1301               - Maximum amount of read+write data cached by the client.  The
1302               default value is 1/2 of the client RAM.
1303             </para>
1304           </listitem>
1305         </itemizedlist>
1306       </para>
1307       <note>
1308         <para>The value for <literal><replaceable>osc_instance</replaceable></literal>
1309           and <literal><replaceable>fsname_instance</replaceable></literal>
1310           are unique to each mount point to allow associating osc, mdc, lov,
1311           lmv, and llite parameters with the same mount point.  However, it is
1312           common for scripts to use a wildcard <literal>*</literal> or a
1313           filesystem-specific wildcard
1314           <literal><replaceable>fsname-*</replaceable></literal> to specify
1315           the parameter settings uniformly on all clients. For example:
1316 <screen>
1317 client$ lctl get_param osc.testfs-OST0000*.rpc_stats
1318 osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats=
1319 snapshot_time:         1375743284.337839 (secs.usecs)
1320 read RPCs in flight:  0
1321 write RPCs in flight: 0
1322 </screen></para>
1323       </note>
1324     </section>
1325     <section remap="h3" xml:id="TuningClientReadahead">
1326       <title><indexterm>
1327           <primary>proc</primary>
1328           <secondary>readahead</secondary>
1329         </indexterm>Tuning File Readahead and Directory Statahead</title>
1330       <para>File readahead and directory statahead enable reading of data
1331       into memory before a process requests the data. File readahead prefetches
1332       file content data into memory for <literal>read()</literal> related
1333       calls, while directory statahead fetches file metadata into memory for
1334       <literal>readdir()</literal> and <literal>stat()</literal> related
1335       calls.  When readahead and statahead work well, a process that accesses
1336       data finds that the information it needs is available immediately in
1337       memory on the client when requested without the delay of network I/O.
1338       </para>
1339       <section remap="h4">
1340         <title>Tuning File Readahead</title>
1341         <para>File readahead is triggered when two or more sequential reads
1342           by an application fail to be satisfied by data in the Linux buffer
1343           cache. The size of the initial readahead is determined by the RPC
1344           size and the file stripe size, but will typically be at least 1 MiB.
1345           Additional readaheads grow linearly and increment until the per-file
1346           or per-system readahead cache limit on the client is reached.</para>
1347         <para>Readahead tunables include:</para>
1348         <itemizedlist>
1349           <listitem>
1350             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_mb</literal>
1351               - Controls the maximum amount of data readahead on all files.
1352               Files are read ahead in RPC-sized chunks (4 MiB, or the size of
1353               the <literal>read()</literal> call, if larger) after the second
1354               sequential read on a file descriptor. Random reads are done at
1355               the size of the <literal>read()</literal> call only (no
1356               readahead). Reads to non-contiguous regions of the file reset
1357               the readahead algorithm, and readahead is not triggered until
1358               sequential reads take place again.
1359             </para>
1360             <para>
1361               This is the global limit for all files and cannot be larger than
1362               1/2 of the client RAM.  To disable readahead, set
1363               <literal>max_read_ahead_mb=0</literal>.
1364             </para>
1365           </listitem>
1366           <listitem>
1367             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_per_file_mb</literal>
1368               - Controls the maximum number of megabytes (MiB) of data that
1369               should be prefetched by the client when sequential reads are
1370               detected on one file.  This is the per-file readahead limit and
1371               cannot be larger than <literal>max_read_ahead_mb</literal>.
1372             </para>
1373           </listitem>
1374           <listitem>
1375             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_whole_mb</literal>
1376               - Controls the maximum size of a file in MiB that is read in its
1377               entirety upon access, regardless of the size of the
1378               <literal>read()</literal> call.  This avoids multiple small read
1379               RPCs on relatively small files, when it is not possible to
1380               efficiently detect a sequential read pattern before the whole
1381               file has been read.
1382             </para>
1383             <para>The default value is the greater of 2 MiB or the size of one
1384               RPC, as given by <literal>max_pages_per_rpc</literal>.
1385             </para>
1386           </listitem>
1387         </itemizedlist>
1388       </section>
1389       <section>
1390         <title>Tuning Directory Statahead and AGL</title>
1391         <para>Many system commands, such as <literal>ls –l</literal>,
1392         <literal>du</literal>, and <literal>find</literal>, traverse a
1393         directory sequentially. To make these commands run efficiently, the
1394         directory statahead can be enabled to improve the performance of
1395         directory traversal.</para>
1396         <para>The statahead tunables are:</para>
1397         <itemizedlist>
1398           <listitem>
1399             <para><literal>statahead_max</literal> -
1400             Controls the maximum number of file attributes that will be
1401             prefetched by the statahead thread. By default, statahead is
1402             enabled and <literal>statahead_max</literal> is 32 files.</para>
1403             <para>To disable statahead, set <literal>statahead_max</literal>
1404             to zero via the following command on the client:</para>
1405             <screen>lctl set_param llite.*.statahead_max=0</screen>
1406             <para>To change the maximum statahead window size on a client:</para>
1407             <screen>lctl set_param llite.*.statahead_max=<replaceable>n</replaceable></screen>
1408             <para>The maximum <literal>statahead_max</literal> is 8192 files.
1409             </para>
1410             <para>The directory statahead thread will also prefetch the file
1411             size/block attributes from the OSTs, so that all file attributes
1412             are available on the client when requested by an application.
1413             This is controlled by the asynchronous glimpse lock (AGL) setting.
1414             The AGL behaviour can be disabled by setting:</para>
1415             <screen>lctl set_param llite.*.statahead_agl=0</screen>
1416           </listitem>
1417           <listitem>
1418             <para><literal>statahead_stats</literal> -
1419             A read-only interface that provides current statahead and AGL
1420             statistics, such as how many times statahead/AGL has been triggered
1421             since the last mount, how many statahead/AGL failures have occurred
1422             due to an incorrect prediction or other causes.</para>
1423             <note>
1424               <para>AGL behaviour is affected by statahead since the inodes
1425               processed by AGL are built by the statahead thread.  If
1426               statahead is disabled, then AGL is also disabled.</para>
1427             </note>
1428           </listitem>
1429         </itemizedlist>
1430       </section>
1431     </section>
1432     <section remap="h3">
1433       <title><indexterm>
1434           <primary>proc</primary>
1435           <secondary>read cache</secondary>
1436         </indexterm>Tuning Server Read Cache</title>
1437       <para>The server read cache feature provides read-only caching of file
1438         data on an OSS or MDS (for Data-on-MDT). This functionality uses the
1439         Linux page cache to store the data and uses as much physical memory
1440         as is allocated.</para>
1441       <para>The server read cache can improves Lustre file system performance
1442         in these situations:</para>
1443       <itemizedlist>
1444         <listitem>
1445           <para>Many clients are accessing the same data set (as in HPC
1446             applications or when diskless clients boot from the Lustre file
1447             system).</para>
1448         </listitem>
1449         <listitem>
1450           <para>One client is writing data while another client is reading
1451             it (i.e., clients are exchanging data via the filesystem).</para>
1452         </listitem>
1453         <listitem>
1454           <para>A client has very limited caching of its own.</para>
1455         </listitem>
1456       </itemizedlist>
1457       <para>The server read cache offers these benefits:</para>
1458       <itemizedlist>
1459         <listitem>
1460           <para>Allows servers to cache read data more frequently.</para>
1461         </listitem>
1462         <listitem>
1463           <para>Improves repeated reads to match network speeds instead of
1464              storage speeds.</para>
1465         </listitem>
1466         <listitem>
1467           <para>Provides the building blocks for server write cache
1468             (small-write aggregation).</para>
1469         </listitem>
1470       </itemizedlist>
1471       <section remap="h4">
1472         <title>Using Server Read Cache</title>
1473         <para>The server read cache is implemented on the OSS and MDS, and does
1474           not require any special support on the client side. Since the server
1475           read cache uses the memory available in the Linux page cache, the
1476           appropriate amount of memory for the cache should be determined based
1477           on I/O patterns.  If the data is mostly reads, then more cache is
1478           beneficial on the server than would be needed for mostly writes.
1479         </para>
1480         <para>The server read cache is managed using the following tunables.
1481           Many tunables are available for both <literal>osd-ldiskfs</literal>
1482           and <literal>osd-zfs</literal>, but in some cases the implementation
1483           of <literal>osd-zfs</literal> prevents their use.</para>
1484         <itemizedlist>
1485           <listitem>
1486             <para><literal>read_cache_enable</literal> - High-level control of
1487               whether data read from storage during a read request is kept in
1488               memory and available for later read requests for the same data,
1489               without having to re-read it from storage. By default, read cache
1490               is enabled (<literal>read_cache_enable=1</literal>) for HDD OSDs
1491               and automatically disabled for flash OSDs
1492               (<literal>nonrotational=1</literal>).
1493               The read cache cannot be disabled for <literal>osd-zfs</literal>,
1494               and as a result this parameter is unavailable for that backend.
1495               </para>
1496             <para>When the server receives a read request from a client,
1497               it reads data from storage into its memory and sends the data
1498               to the client. If read cache is enabled for the target,
1499               and the RPC and object size also meet the other criterion below,
1500               this data may stay in memory after the client request has
1501               completed.  If later read requests for the same data are received,
1502               if the data is still in cache the server skips reading it from
1503               storage. The cache is managed by the Linux kernel globally
1504               across all targets on that server so that the infrequently used
1505               cache pages are dropped from memory when the free memory is
1506               running low.</para>
1507             <para>If read cache is disabled
1508               (<literal>read_cache_enable=0</literal>), or the read or object
1509               is large enough that it will not benefit from caching, the server
1510               discards the data after the read request from the client is
1511               completed. For subsequent read requests the server again reads
1512               the data from storage.</para>
1513             <para>To disable read cache on all targets of a server, run:</para>
1514             <screen>
1515               oss1# lctl set_param osd-*.*.read_cache_enable=0
1516             </screen>
1517             <para>To re-enable read cache on one target, run:</para>
1518             <screen>
1519               oss1# lctl set_param osd-*.{target_name}.read_cache_enable=1
1520             </screen>
1521             <para>To check if read cache is enabled on targets on a server, run:
1522             </para>
1523             <screen>
1524               oss1# lctl get_param osd-*.*.read_cache_enable
1525             </screen>
1526           </listitem>
1527           <listitem>
1528             <para><literal>writethrough_cache_enable</literal> - High-level
1529               control of whether data sent to the server as a write request is
1530               kept in the read cache and available for later reads, or if it is
1531               discarded when the write completes. By default, writethrough
1532               cache is enabled (<literal>writethrough_cache_enable=1</literal>)
1533               for HDD OSDs and automatically disabled for flash OSDs
1534               (<literal>nonrotational=1</literal>).
1535               The write cache cannot be disabled for <literal>osd-zfs</literal>,
1536               and as a result this parameter is unavailable for that backend.
1537               </para>
1538             <para>When the server receives write requests from a client, it
1539               fetches data from the client into its memory and writes the data
1540               to storage. If the writethrough cache is enabled for the target,
1541               and the RPC and object size meet the other criterion below,
1542               this data may stay in memory after the write request has
1543               completed. If later read or partial-block write requests for this
1544               same data are received, if the data is still in cache the server
1545               skips reading it from storage.
1546               </para>
1547             <para>If the writethrough cache is disabled
1548                (<literal>writethrough_cache_enabled=0</literal>), or the
1549                write or object is large enough that it will not benefit from
1550                caching, the server discards the data after the write request
1551                from the client is completed. For subsequent read requests, or
1552                partial-page write requests, the server must re-read the data
1553                from storage.</para>
1554             <para>Enabling writethrough cache is advisable if clients are doing
1555               small or unaligned writes that would cause partial-page updates,
1556               or if the files written by one node are immediately being read by
1557               other nodes. Some examples where enabling writethrough cache
1558               might be useful include producer-consumer I/O models or
1559               shared-file writes that are not aligned on 4096-byte boundaries.
1560             </para>
1561             <para>Disabling the writethrough cache is advisable when files are
1562               mostly written to the file system but are not re-read within a
1563               short time period, or files are only written and re-read by the
1564               same node, regardless of whether the I/O is aligned or not.</para>
1565             <para>To disable writethrough cache on all targets on a server, run:
1566             </para>
1567             <screen>
1568               oss1# lctl set_param osd-*.*.writethrough_cache_enable=0
1569             </screen>
1570             <para>To re-enable the writethrough cache on one OST, run:</para>
1571             <screen>
1572               oss1# lctl set_param osd-*.{OST_name}.writethrough_cache_enable=1
1573             </screen>
1574             <para>To check if the writethrough cache is enabled, run:</para>
1575             <screen>
1576               oss1# lctl get_param osd-*.*.writethrough_cache_enable
1577             </screen>
1578           </listitem>
1579           <listitem>
1580             <para><literal>readcache_max_filesize</literal> - Controls the
1581               maximum size of an object that both the read cache and
1582               writethrough cache will try to keep in memory. Objects larger
1583               than <literal>readcache_max_filesize</literal> will not be kept
1584               in cache for either reads or writes regardless of the
1585               <literal>read_cache_enable</literal> or
1586               <literal>writethrough_cache_enable</literal> settings.</para>
1587             <para>Setting this tunable can be useful for workloads where
1588               relatively small objects are repeatedly accessed by many clients,
1589               such as job startup objects, executables, log objects, etc., but
1590               large objects are read or written only once. By not putting the
1591               larger objects into the cache, it is much more likely that more
1592               of the smaller objects will remain in cache for a longer time.
1593             </para>
1594             <para>When setting <literal>readcache_max_filesize</literal>,
1595               the input value can be specified in bytes, or can have a suffix
1596               to indicate other binary units such as
1597                 <literal>K</literal> (kibibytes),
1598                 <literal>M</literal> (mebibytes),
1599                 <literal>G</literal> (gibibytes),
1600                 <literal>T</literal> (tebibytes), or
1601                 <literal>P</literal> (pebibytes).</para>
1602             <para>
1603               To limit the maximum cached object size to 64 MiB on all OSTs of
1604               a server, run:
1605             </para>
1606             <screen>
1607               oss1# lctl set_param osd-*.*.readcache_max_filesize=64M
1608             </screen>
1609             <para>To disable the maximum cached object size on all targets, run:
1610             </para>
1611             <screen>
1612               oss1# lctl set_param osd-*.*.readcache_max_filesize=-1
1613             </screen>
1614             <para>
1615               To check the current maximum cached object size on all targets of
1616               a server, run:
1617             </para>
1618             <screen>
1619               oss1# lctl get_param osd-*.*.readcache_max_filesize
1620             </screen>
1621           </listitem>
1622           <listitem>
1623             <para><literal>readcache_max_io_mb</literal> - Controls the maximum
1624               size of a single read IO that will be cached in memory. Reads
1625               larger than <literal>readcache_max_io_mb</literal> will be read
1626               directly from storage and bypass the page cache completely.
1627               This avoids significant CPU overhead at high IO rates.
1628               The read cache cannot be disabled for <literal>osd-zfs</literal>,
1629               and as a result this parameter is unavailable for that backend.
1630             </para>
1631             <para>When setting <literal>readcache_max_io_mb</literal>, the
1632               input value can be specified in mebibytes, or can have a suffix
1633               to indicate other binary units such as
1634                 <literal>K</literal> (kibibytes),
1635                 <literal>M</literal> (mebibytes),
1636                 <literal>G</literal> (gibibytes),
1637                 <literal>T</literal> (tebibytes), or
1638                 <literal>P</literal> (pebibytes).</para>
1639           </listitem>
1640           <listitem>
1641             <para><literal>writethrough_max_io_mb</literal> - Controls the
1642               maximum size of a single writes IO that will be cached in memory.
1643               Writes larger than <literal>writethrough_max_io_mb</literal> will
1644               be written directly to storage and bypass the page cache entirely.
1645               This avoids significant CPU overhead at high IO rates.
1646               The write cache cannot be disabled for <literal>osd-zfs</literal>,
1647               and as a result this parameter is unavailable for that backend.
1648             </para>
1649             <para>When setting <literal>writethrough_max_io_mb</literal>, the
1650               input value can be specified in mebibytes, or can have a suffix
1651               to indicate other binary units such as
1652                 <literal>K</literal> (kibibytes),
1653                 <literal>M</literal> (mebibytes),
1654                 <literal>G</literal> (gibibytes),
1655                 <literal>T</literal> (tebibytes), or
1656                 <literal>P</literal> (pebibytes).</para>
1657           </listitem>
1658         </itemizedlist>
1659       </section>
1660     </section>
1661     <section>
1662       <title><indexterm>
1663           <primary>proc</primary>
1664           <secondary>OSS journal</secondary>
1665         </indexterm>Enabling OSS Asynchronous Journal Commit</title>
1666       <para>The OSS asynchronous journal commit feature asynchronously writes data to disk without
1667         forcing a journal flush. This reduces the number of seeks and significantly improves
1668         performance on some hardware.</para>
1669       <note>
1670         <para>Asynchronous journal commit cannot work with direct I/O-originated writes
1671             (<literal>O_DIRECT</literal> flag set). In this case, a journal flush is forced. </para>
1672       </note>
1673       <para>When the asynchronous journal commit feature is enabled, client nodes keep data in the
1674         page cache (a page reference). Lustre clients monitor the last committed transaction number
1675           (<literal>transno</literal>) in messages sent from the OSS to the clients. When a client
1676         sees that the last committed <literal>transno</literal> reported by the OSS is at least
1677         equal to the bulk write <literal>transno</literal>, it releases the reference on the
1678         corresponding pages. To avoid page references being held for too long on clients after a
1679         bulk write, a 7 second ping request is scheduled (the default OSS file system commit time
1680         interval is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity
1681         to report the last committed <literal>transno</literal>.</para>
1682       <para>If the OSS crashes before the journal commit occurs, then intermediate data is lost.
1683         However, OSS recovery functionality incorporated into the asynchronous journal commit
1684         feature causes clients to replay their write requests and compensate for the missing disk
1685         updates by restoring the state of the file system.</para>
1686       <para>By default, <literal>sync_journal</literal> is enabled
1687           (<literal>sync_journal=1</literal>), so that journal entries are committed synchronously.
1688         To enable asynchronous journal commit, set the <literal>sync_journal</literal> parameter to
1689           <literal>0</literal> by entering: </para>
1690       <screen>$ lctl set_param obdfilter.*.sync_journal=0
1691 obdfilter.lol-OST0001.sync_journal=0</screen>
1692       <para>An associated <literal>sync-on-lock-cancel</literal> feature (enabled by default)
1693         addresses a data consistency issue that can result if an OSS crashes after multiple clients
1694         have written data into intersecting regions of an object, and then one of the clients also
1695         crashes. A condition is created in which the POSIX requirement for continuous writes is
1696         violated along with a potential for corrupted data. With
1697           <literal>sync-on-lock-cancel</literal> enabled, if a cancelled lock has any volatile
1698         writes attached to it, the OSS synchronously writes the journal to disk on lock
1699         cancellation. Disabling the <literal>sync-on-lock-cancel</literal> feature may enhance
1700         performance for concurrent write workloads, but it is recommended that you not disable this
1701         feature.</para>
1702       <para> The <literal>sync_on_lock_cancel</literal> parameter can be set to the following
1703         values:</para>
1704       <itemizedlist>
1705         <listitem>
1706           <para><literal>always</literal> - Always force a journal flush on lock cancellation
1707             (default when <literal>async_journal</literal> is enabled).</para>
1708         </listitem>
1709         <listitem>
1710           <para><literal>blocking</literal> - Force a journal flush only when the local cancellation
1711             is due to a blocking callback.</para>
1712         </listitem>
1713         <listitem>
1714           <para><literal>never</literal> - Do not force any journal flush (default when
1715               <literal>async_journal</literal> is disabled).</para>
1716         </listitem>
1717       </itemizedlist>
1718       <para>For example, to set <literal>sync_on_lock_cancel</literal> to not to force a journal
1719         flush, use a command similar to:</para>
1720       <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
1721 obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
1722     </section>
1723     <section xml:id="TuningModRPCs" condition='l28'>
1724       <title>
1725         <indexterm>
1726           <primary>proc</primary>
1727           <secondary>client metadata performance</secondary>
1728         </indexterm>
1729         Tuning the Client Metadata RPC Stream
1730       </title>
1731       <para>The client metadata RPC stream represents the metadata RPCs issued
1732         in parallel by a client to a MDT target. The metadata RPCs can be split
1733         in two categories: the requests that do not modify the file system
1734         (like getattr operation), and the requests that do modify the file system
1735         (like create, unlink, setattr operations). To help optimize the client
1736         metadata RPC stream, several tuning variables are provided to adjust
1737         behavior according to network conditions and cluster size.</para>
1738       <para>Note that increasing the number of metadata RPCs issued in parallel
1739         might improve the performance metadata intensive parallel applications,
1740         but as a consequence it will consume more memory on the client and on
1741         the MDS.</para>
1742       <section>
1743         <title>Configuring the Client Metadata RPC Stream</title>
1744         <para>The MDC <literal>max_rpcs_in_flight</literal> parameter defines
1745           the maximum number of metadata RPCs, both modifying and
1746           non-modifying RPCs, that can be sent in parallel by a client to a MDT
1747           target. This includes every file system metadata operations, such as
1748           file or directory stat, creation, unlink. The default setting is 8,
1749           minimum setting is 1 and maximum setting is 256.</para>
1750         <para>To set the <literal>max_rpcs_in_flight</literal> parameter, run
1751           the following command on the Lustre client:</para>
1752         <screen>client$ lctl set_param mdc.*.max_rpcs_in_flight=16</screen>
1753         <para>The MDC <literal>max_mod_rpcs_in_flight</literal> parameter
1754           defines the maximum number of file system modifying RPCs that can be
1755           sent in parallel by a client to a MDT target. For example, the Lustre
1756           client sends modify RPCs when it performs file or directory creation,
1757           unlink, access permission modification or ownership modification. The
1758           default setting is 7, minimum setting is 1 and maximum setting is
1759           256.</para>
1760         <para>To set the <literal>max_mod_rpcs_in_flight</literal> parameter,
1761           run the following command on the Lustre client:</para>
1762         <screen>client$ lctl set_param mdc.*.max_mod_rpcs_in_flight=12</screen>
1763         <para>The <literal>max_mod_rpcs_in_flight</literal> value must be
1764           strictly less than the <literal>max_rpcs_in_flight</literal> value.
1765           It must also be less or equal to the MDT
1766           <literal>max_mod_rpcs_per_client</literal> value. If one of theses
1767           conditions is not enforced, the setting fails and an explicit message
1768           is written in the Lustre log.</para>
1769         <para>The MDT <literal>max_mod_rpcs_per_client</literal> parameter is a
1770           tunable of the kernel module <literal>mdt</literal> that defines the
1771           maximum number of file system modifying RPCs in flight allowed per
1772           client. The parameter can be updated at runtime, but the change is
1773           effective to new client connections only. The default setting is 8.
1774         </para>
1775         <para>To set the <literal>max_mod_rpcs_per_client</literal> parameter,
1776           run the following command on the MDS:</para>
1777         <screen>mds$ echo 12 > /sys/module/mdt/parameters/max_mod_rpcs_per_client</screen>
1778       </section>
1779       <section>
1780         <title>Monitoring the Client Metadata RPC Stream</title>
1781         <para>The <literal>rpc_stats</literal> file contains histogram data
1782           showing information about modify metadata RPCs. It can be helpful to
1783           identify the level of parallelism achieved by an application doing
1784           modify metadata operations.</para>
1785         <para><emphasis role="bold">Example:</emphasis></para>
1786         <screen>client$ lctl get_param mdc.*.rpc_stats
1787 snapshot_time:         1441876896.567070 (secs.usecs)
1788 modify_RPCs_in_flight:  0
1789
1790                         modify
1791 rpcs in flight        rpcs   % cum %
1792 0:                       0   0   0
1793 1:                      56   0   0
1794 2:                      40   0   0
1795 3:                      70   0   0
1796 4                       41   0   0
1797 5:                      51   0   1
1798 6:                      88   0   1
1799 7:                     366   1   2
1800 8:                    1321   5   8
1801 9:                    3624  15  23
1802 10:                   6482  27  50
1803 11:                   7321  30  81
1804 12:                   4540  18 100</screen>
1805         <para>The file information includes:</para>
1806         <itemizedlist>
1807           <listitem>
1808             <para><literal>snapshot_time</literal> - UNIX epoch instant the
1809               file was read.</para>
1810           </listitem>
1811           <listitem>
1812             <para><literal>modify_RPCs_in_flight</literal> - Number of modify
1813               RPCs issued by the MDC, but not completed at the time of the
1814               snapshot. This value should always be less than or equal to
1815               <literal>max_mod_rpcs_in_flight</literal>.</para>
1816           </listitem>
1817           <listitem>
1818             <para><literal>rpcs in flight</literal> - Number of modify RPCs
1819               that are pending when a RPC is sent, the relative percentage
1820               (<literal>%</literal>) of total modify RPCs, and the cumulative
1821               percentage (<literal>cum %</literal>) to that point.</para>
1822           </listitem>
1823         </itemizedlist>
1824         <para>If a large proportion of modify metadata RPCs are issued with a
1825           number of pending metadata RPCs close to the
1826           <literal>max_mod_rpcs_in_flight</literal> value, it means the
1827           <literal>max_mod_rpcs_in_flight</literal> value could be increased to
1828           improve the modify metadata performance.</para>
1829       </section>
1830     </section>
1831   </section>
1832   <section>
1833     <title>Configuring Timeouts in a Lustre File System</title>
1834     <para>In a Lustre file system, RPC timeouts are set using an adaptive timeouts mechanism, which
1835       is enabled by default. Servers track RPC completion times and then report back to clients
1836       estimates for completion times for future RPCs. Clients  use these estimates to set RPC
1837       timeout values. If the processing of server requests slows down for any reason, the server
1838       estimates for RPC completion increase, and clients then revise RPC timeout values to allow
1839       more time for RPC completion.</para>
1840     <para>If the RPCs queued on the server approach the RPC timeout specified by the client, to
1841       avoid RPC timeouts and disconnect/reconnect cycles, the server sends an "early reply" to the
1842       client, telling the client to allow more time. Conversely, as server processing speeds up, RPC
1843       timeout values decrease, resulting in faster detection if the server becomes non-responsive
1844       and quicker connection to the failover partner of the server.</para>
1845     <section>
1846       <title><indexterm>
1847           <primary>proc</primary>
1848           <secondary>configuring adaptive timeouts</secondary>
1849         </indexterm><indexterm>
1850           <primary>configuring</primary>
1851           <secondary>adaptive timeouts</secondary>
1852         </indexterm><indexterm>
1853           <primary>proc</primary>
1854           <secondary>adaptive timeouts</secondary>
1855         </indexterm>Configuring Adaptive Timeouts</title>
1856       <para>The adaptive timeout parameters in the table below can be set
1857         persistently system-wide using <literal>lctl set_param -P</literal>
1858         on the MGS. For example, the following command sets the
1859         <literal>at_max</literal> value for all servers and clients
1860         associated with the file systems connected to this MGS:
1861       </para>
1862 <screen>
1863 mgs# lctl set_param -P at_max=1500
1864 </screen>
1865       <note>
1866         <para>Clients that access multiple Lustre file systems
1867         <emphasis>must</emphasis> use the same adaptive timeout values
1868           for all file systems.</para>
1869       </note>
1870       <para condition="l2G">
1871         Since Lustre 2.16 it is preferred to set
1872         <literal>at_min</literal> as a per-target tunable using the
1873         <literal>*.<replaceable>fsname</replaceable>*.at_min</literal>
1874         parameter instead of the global <literal>at_min</literal>
1875         parameter.  This avoids issues if a single client mounts two
1876         separate filesystems with different <literal>at_min</literal>
1877         tunable settings.
1878       </para>
1879 <screen>
1880 mgs# lctl set_param -P *.testfs-*.at_max=1500
1881 </screen>
1882       <informaltable frame="all">
1883         <tgroup cols="2">
1884           <colspec colname="c1" colwidth="30*"/>
1885           <colspec colname="c2" colwidth="80*"/>
1886           <thead>
1887             <row>
1888               <entry>
1889                 <para><emphasis role="bold">Parameter</emphasis></para>
1890               </entry>
1891               <entry>
1892                 <para><emphasis role="bold">Description</emphasis></para>
1893               </entry>
1894             </row>
1895           </thead>
1896           <tbody>
1897             <row>
1898               <entry>
1899                 <para>
1900                   <literal> at_min </literal></para>
1901               </entry>
1902               <entry>
1903                 <para>Minimum adaptive timeout (in seconds). The default value
1904                   is 5 (since 2.16). The <literal>at_min</literal> parameter is
1905                   the minimum processing time that a server will report.
1906                   Ideally, <literal>at_min</literal> should be left at its
1907                   default value.  Clients base their timeouts on this value,
1908                   but they do not use this value directly.
1909                 </para>
1910                 <para>If, for some reason (usually due to temporary network
1911                   outages or sudden spikes in load immediately after mount),
1912                   the adaptive timeout value is too short and clients time
1913                   out their RPCs, you can increase the <literal>at_min</literal>
1914                   value to compensate for this.
1915                 </para>
1916                 <para condition="l2G">
1917                 Since Lustre 2.16 it is preferred to set
1918                 <literal>at_min</literal> as a per-target tunable using the
1919                 <literal>*.<replaceable>fsname</replaceable>*.at_min</literal>
1920                 parameter instead of the global <literal>at_min</literal>
1921                 parameter.  This avoids issues if a single client mounts two
1922                 separate filesystems with different <literal>at_min</literal>
1923                 tunable settings.
1924                 </para>
1925               </entry>
1926             </row>
1927             <row>
1928               <entry>
1929                 <para>
1930                   <literal> at_max </literal></para>
1931               </entry>
1932               <entry>
1933                 <para>Maximum adaptive timeout (in seconds). The
1934                   <literal>at_max</literal> parameter is an upper-limit on the
1935                   service time estimate. If <literal>at_max</literal> is
1936                   reached, an RPC request times out.</para>
1937                 <para>Setting <literal>at_max</literal> to 0 causes adaptive
1938                   timeouts to be disabled
1939                   and a fixed timeout method to be used instead (see <xref
1940                     xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_c24_nt5_dl"/></para>
1941                 <para condition="l2G">
1942                 Since Lustre 2.16 it is preferred to set
1943                 <literal>at_max</literal> as a per-target tunable using the
1944                 <literal>*.<replaceable>fsname</replaceable>*.at_max</literal>
1945                 parameter instead of the global <literal>at_max</literal>
1946                 parameter.  This avoids issues if a single client mounts two
1947                 separate filesystems with different <literal>at_max</literal>
1948                 settings.
1949                 </para>
1950                 <note>
1951                   <para>If slow hardware causes the service estimate to
1952                     increase beyond the default <literal>at_max</literal> value,
1953                     increase <literal>at_max</literal> to the maximum time you
1954                     are willing to wait for an RPC completion.</para>
1955                 </note>
1956               </entry>
1957             </row>
1958             <row>
1959               <entry>
1960                 <para>
1961                   <literal> at_history </literal></para>
1962               </entry>
1963               <entry>
1964                 <para>Time period (in seconds) within which adaptive timeouts
1965                   remember the slowest
1966                   event that occurred. The default is 600.</para>
1967                 <para condition="l2G">
1968                 Since Lustre 2.16 it is preferred to set
1969                 <literal>at_history</literal> as a per-target tunable using the
1970                 <literal>*.<replaceable>fsname</replaceable>*.at_history</literal>
1971                 parameter instead of the global <literal>at_history</literal>
1972                 parameter.  This avoids issues if a single client mounts two
1973                 filesystems with different <literal>at_history</literal>
1974                 values.
1975                 </para>
1976               </entry>
1977             </row>
1978             <row>
1979               <entry>
1980                 <para>
1981                   <literal> at_early_margin </literal></para>
1982               </entry>
1983               <entry>
1984                 <para>Amount of time before the Lustre server sends an early
1985                   reply (in seconds).  Default is 5.</para>
1986               </entry>
1987             </row>
1988             <row>
1989               <entry>
1990                 <para>
1991                   <literal> at_extra </literal></para>
1992               </entry>
1993               <entry>
1994                 <para>Incremental amount of time that a server requests with
1995                   each early reply (in seconds). The server does not know how
1996                   much time the RPC will take, so it asks for a fixed value.
1997                   The default is 30, which provides a balance between sending
1998                   too many early replies for the same RPC and overestimating
1999                   the actual completion time.</para>
2000                 <para>When a server finds a queued request about to time out
2001                   and needs to send an early reply out, the server adds the
2002                   <literal>at_extra</literal> value. If the time expires, the
2003                   Lustre server drops the request, and the client enters
2004                   recovery status and reconnects to restore the connection to
2005                   normal status.</para>
2006                 <para>If you see multiple early replies for the same RPC asking
2007                   for 30-second increases, change <literal>at_extra</literal>
2008                   to a larger number to cut down on early replies sent and,
2009                   therefore, network load.</para>
2010               </entry>
2011             </row>
2012             <row>
2013               <entry>
2014                 <para>
2015                   <literal> ldlm_enqueue_min </literal></para>
2016               </entry>
2017               <entry>
2018                 <para>Minimum lock enqueue time (in seconds). The default is
2019                   100. The it takes to enqueue a lock, shown as the
2020                   <literal>ldlm_enqueue</literal> operation in the stats files,
2021                   is the maximum of the measured enqueue estimate (influenced
2022                   by <literal>at_min</literal> and <literal>at_max</literal>
2023                   parameters), multiplied by a weighting factor and the value
2024                   of <literal>ldlm_enqueue_min</literal>. </para>
2025                 <para>Lustre Distributed Lock Manager (LDLM) lock enqueues
2026                   have a dedicated minimum <literal>ldlm_enqueue_min</literal>.
2027                   Lock enqueue timeouts increase as the measured enqueue times
2028                   increase (similar to adaptive timeouts).</para>
2029                 <para condition="l2G">
2030                 Since Lustre 2.16 it is preferred to set
2031                 <literal>ldlm_enqueue_min</literal> as a per-target tunable with
2032                 <literal>*.<replaceable>fsname</replaceable>*.ldlm_enqueue_min</literal>
2033                 instead of the global <literal>ldlm_enqueue_min</literal>
2034                 parameter.  This avoids issues if a client mounts multiple
2035                 filesystems with different <literal>ldlm_enqueue_min</literal>
2036                 tunable settings.
2037                 </para>
2038               </entry>
2039             </row>
2040           </tbody>
2041         </tgroup>
2042       </informaltable>
2043       <section>
2044         <title>Interpreting Adaptive Timeout Information</title>
2045         <para>Adaptive timeout information can be obtained via
2046           <literal>lctl get_param {osc,mdc}.*.timeouts</literal> files on each
2047           client and <literal>lctl get_param {ost,mds}.*.*.timeouts</literal>
2048           on each server.  To read information from a
2049           <literal>timeouts</literal> file, enter a command similar to:</para>
2050         <screen># lctl get_param -n ost.*.ost_io.timeouts
2051 service : cur 33  worst 34 (at 1193427052, 1600s ago) 1 1 33 2</screen>
2052         <para>In this example, the <literal>ost_io</literal> service on this
2053           node is currently reporting an estimated RPC service time of 33
2054           seconds. The worst RPC service time was 34 seconds, which occurred
2055           26 minutes ago.</para>
2056         <para>The output also provides a history of service times.
2057           Four &quot;bins&quot; of adaptive timeout history are shown, with the
2058           maximum RPC time in each bin reported. In both the 0-150s bin and the
2059           150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the
2060           worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a
2061           maximum of RPC time of 2 seconds. The estimated service time is the
2062           maximum value in the four bins (33 seconds in this example).</para>
2063         <para>Service times (as reported by the servers) are also tracked in
2064           the client OBDs, as shown in this example:</para>
2065         <screen># lctl get_param osc.*.timeouts
2066 last reply : 1193428639, 0d0h00m00s ago
2067 network    : cur  1 worst  2 (at 1193427053, 0d0h26m26s ago)  1  1  1  1
2068 portal 6   : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33  2
2069 portal 28  : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  1  1  1
2070 portal 7   : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  0  1  1
2071 portal 17  : cur  1 worst  1 (at 1193426177, 0d0h41m02s ago)  1  0  0  1
2072 </screen>
2073         <para>In this example, portal 6, the <literal>ost_io</literal> service
2074           portal, shows the history of service estimates reported by the portal.
2075         </para>
2076         <para>Server statistic files also show the range of estimates including
2077           min, max, sum, and sum-squared. For example:</para>
2078         <screen># lctl get_param mdt.*.mdt.stats
2079 ...
2080 req_timeout               6 samples [sec] 1 10 15 105
2081 ...
2082 </screen>
2083       </section>
2084     </section>
2085     <section xml:id="section_c24_nt5_dl">
2086       <title>Setting Static Timeouts<indexterm>
2087           <primary>proc</primary>
2088           <secondary>static timeouts</secondary>
2089         </indexterm></title>
2090       <para>The Lustre software provides two sets of static (fixed) timeouts, LND timeouts and
2091         Lustre timeouts, which are used when adaptive timeouts are not enabled.</para>
2092       <para>
2093         <itemizedlist>
2094           <listitem>
2095             <para><emphasis role="italic"><emphasis role="bold">LND timeouts</emphasis></emphasis> -
2096               LND timeouts ensure that point-to-point communications across a network complete in a
2097               finite time in the presence of failures, such as packages lost or broken connections.
2098               LND timeout parameters are set for each individual LND.</para>
2099             <para>LND timeouts are logged with the <literal>S_LND</literal> flag set. They are not
2100               printed as console messages, so check the Lustre log for <literal>D_NETERROR</literal>
2101               messages or enable printing of <literal>D_NETERROR</literal> messages to the console
2102               using:<screen>lctl set_param printk=+neterror</screen></para>
2103             <para>Congested routers can be a source of spurious LND timeouts. To avoid this
2104               situation, increase the number of LNet router buffers to reduce back-pressure and/or
2105               increase LND timeouts on all nodes on all connected networks. Also consider increasing
2106               the total number of LNet router nodes in the system so that the aggregate router
2107               bandwidth matches the aggregate server bandwidth.</para>
2108           </listitem>
2109           <listitem>
2110             <para><emphasis role="italic"><emphasis role="bold">Lustre timeouts
2111               </emphasis></emphasis>- Lustre timeouts ensure that Lustre RPCs
2112               complete in a finite time in the presence of failures when
2113               adaptive timeouts are not enabled. Adaptive timeouts are enabled
2114               by default. To disable adaptive timeouts at run time, set
2115               <literal>at_max</literal> to 0 by running on the MGS:
2116 <screen>
2117 # lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0
2118 </screen>
2119             </para>
2120             <note>
2121               <para>Changing the state of adaptive timeouts at runtime may
2122                 cause transient client timeouts, recovery, and reconnection.</para>
2123             </note>
2124             <para>Lustre timeouts are always printed as console messages.
2125             </para>
2126             <para>If Lustre timeouts are not accompanied by LND timeouts,
2127               increase the Lustre timeout on both servers and clients. Lustre
2128               timeouts are set across the whole filesystem using a command
2129               such as the following:
2130 <screen>
2131 mgs# lctl set_param -P timeout=30
2132 </screen>
2133             </para>
2134             <para>Timeout parameters are described in the table below.</para>
2135           </listitem>
2136         </itemizedlist>
2137         <informaltable frame="all">
2138           <tgroup cols="2">
2139             <colspec colname="c1" colnum="1" colwidth="30*"/>
2140             <colspec colname="c2" colnum="2" colwidth="70*"/>
2141             <thead>
2142               <row>
2143                 <entry>Parameter</entry>
2144                 <entry>Description</entry>
2145               </row>
2146             </thead>
2147             <tbody>
2148               <row>
2149                 <entry><literal>timeout</literal></entry>
2150                 <entry>
2151                   <para>The time that a client waits for a server to complete
2152                     an RPC (default 100s).  Servers wait half this time for a
2153                     normal client RPC to complete and a quarter of this time
2154                     for a single bulk request (read or write of up to 4 MB)
2155                     to complete.  The client pings recoverable targets (MDS
2156                     and OSTs) at one quarter of the timeout, and the server
2157                     waits one and a half times the timeout before evicting a
2158                     client for being &quot;stale.&quot;</para>
2159                   <para>Lustre client sends periodic &apos;ping&apos; messages
2160                     to servers with which it has had no communication for the
2161                     specified period of time. Any network activity between a
2162                     client and a server in the file system also serves as a
2163                     ping.</para>
2164                 </entry>
2165               </row>
2166               <row>
2167                 <entry><literal>ldlm_timeout</literal></entry>
2168                 <entry>
2169                   <para>The time that a server waits for a client to reply to
2170                     an initial AST (lock cancellation request). The default
2171                     is 20s for an OST and 6s for an MDS. If the client replies
2172                     to the AST, the server will give it a normal timeout (half
2173                     the client timeout) to flush any dirty data and release
2174                     the lock.</para>
2175                 </entry>
2176               </row>
2177               <row>
2178                 <entry><literal>fail_loc</literal></entry>
2179                 <entry>
2180                   <para>An internal debugging failure hook. The default value of
2181                     <literal>0</literal> means that no failure will be triggered or
2182                     injected.</para>
2183                 </entry>
2184               </row>
2185               <row>
2186                 <entry><literal>dump_on_timeout</literal></entry>
2187                 <entry>
2188                   <para>Triggers a dump of the Lustre debug log when a timeout
2189                     occurs. The default value of <literal>0</literal> (zero)
2190                     means a dump of the Lustre debug log will not be triggered.
2191                   </para>
2192                 </entry>
2193               </row>
2194               <row>
2195                 <entry><literal>dump_on_eviction</literal></entry>
2196                 <entry>
2197                   <para>Triggers a dump of the Lustre debug log when an
2198                     eviction occurs. The default value of <literal>0</literal>
2199                     (zero) means a dump of the Lustre debug log will
2200                     not be triggered. </para>
2201                 </entry>
2202               </row>
2203             </tbody>
2204           </tgroup>
2205         </informaltable>
2206       </para>
2207     </section>
2208   </section>
2209   <section remap="h3">
2210     <title><indexterm>
2211         <primary>proc</primary>
2212         <secondary>LNet</secondary>
2213       </indexterm><indexterm>
2214         <primary>LNet</primary>
2215         <secondary>proc</secondary>
2216       </indexterm>Monitoring LNet</title>
2217     <para>LNet information is located via <literal>lctl get_param</literal>
2218       in these parameters:
2219       <itemizedlist>
2220         <listitem>
2221           <para><literal>peers</literal> - Shows all NIDs known to this node
2222             and provides information on the queue state.</para>
2223           <para>Example:</para>
2224           <screen># lctl get_param peers
2225 nid                refs   state  max  rtr  min   tx    min   queue
2226 0@lo               1      ~rtr   0    0    0     0     0     0
2227 192.168.10.35@tcp  1      ~rtr   8    8    8     8     6     0
2228 192.168.10.36@tcp  1      ~rtr   8    8    8     8     6     0
2229 192.168.10.37@tcp  1      ~rtr   8    8    8     8     6     0</screen>
2230           <para>The fields are explained in the table below:</para>
2231           <informaltable frame="all">
2232             <tgroup cols="2">
2233               <colspec colname="c1" colwidth="30*"/>
2234               <colspec colname="c2" colwidth="80*"/>
2235               <thead>
2236                 <row>
2237                   <entry>
2238                     <para><emphasis role="bold">Field</emphasis></para>
2239                   </entry>
2240                   <entry>
2241                     <para><emphasis role="bold">Description</emphasis></para>
2242                   </entry>
2243                 </row>
2244               </thead>
2245               <tbody>
2246                 <row>
2247                   <entry>
2248                     <para>
2249                       <literal>refs</literal>
2250                     </para>
2251                   </entry>
2252                   <entry>
2253                     <para>A reference count. </para>
2254                   </entry>
2255                 </row>
2256                 <row>
2257                   <entry>
2258                     <para>
2259                       <literal>state</literal>
2260                     </para>
2261                   </entry>
2262                   <entry>
2263                     <para>If the node is a router, indicates the state of the router. Possible
2264                       values are:</para>
2265                     <itemizedlist>
2266                       <listitem>
2267                         <para><literal>NA</literal> - Indicates the node is not a router.</para>
2268                       </listitem>
2269                       <listitem>
2270                         <para><literal>up/down</literal>- Indicates if the node (router) is up or
2271                           down.</para>
2272                       </listitem>
2273                     </itemizedlist>
2274                   </entry>
2275                 </row>
2276                 <row>
2277                   <entry>
2278                     <para>
2279                       <literal>max </literal></para>
2280                   </entry>
2281                   <entry>
2282                     <para>Maximum number of concurrent sends from this peer.</para>
2283                   </entry>
2284                 </row>
2285                 <row>
2286                   <entry>
2287                     <para>
2288                       <literal>rtr </literal></para>
2289                   </entry>
2290                   <entry>
2291                     <para>Number of available routing buffer credits.</para>
2292                   </entry>
2293                 </row>
2294                 <row>
2295                   <entry>
2296                     <para>
2297                       <literal>min </literal></para>
2298                   </entry>
2299                   <entry>
2300                     <para>Minimum number of routing buffer credits seen.</para>
2301                   </entry>
2302                 </row>
2303                 <row>
2304                   <entry>
2305                     <para>
2306                       <literal>tx </literal></para>
2307                   </entry>
2308                   <entry>
2309                     <para>Number of available send credits.</para>
2310                   </entry>
2311                 </row>
2312                 <row>
2313                   <entry>
2314                     <para>
2315                       <literal>min </literal></para>
2316                   </entry>
2317                   <entry>
2318                     <para>Minimum number of send credits seen.</para>
2319                   </entry>
2320                 </row>
2321                 <row>
2322                   <entry>
2323                     <para>
2324                       <literal>queue </literal></para>
2325                   </entry>
2326                   <entry>
2327                     <para>Total bytes in active/queued sends.</para>
2328                   </entry>
2329                 </row>
2330               </tbody>
2331             </tgroup>
2332           </informaltable>
2333           <para>Credits are initialized to allow a certain number of operations
2334             (in the example above the table, eight as shown in the
2335             <literal>max</literal> column. LNet keeps track of the minimum
2336             number of credits ever seen over time showing the peak congestion
2337             that has occurred during the time monitored. Fewer available credits
2338             indicates a more congested resource. </para>
2339           <para>The number of credits currently available is shown in the
2340             <literal>tx</literal> column. The maximum number of send credits is
2341             shown in the <literal>max</literal> column and never changes. The
2342             number of currently active transmits can be derived by
2343             <literal>(max - tx)</literal>, as long as
2344             <literal>tx</literal> is greater than or equal to 0. Once
2345             <literal>tx</literal> is less than 0, it indicates the number of
2346             transmits on that peer which have been queued for lack of credits.
2347           </para>
2348           <para>The number of router buffer credits available for consumption
2349             by a peer is shown in <literal>rtr</literal> column. The number of
2350             routing credits can be configured separately at the LND level or at
2351             the LNet level by using the <literal>peer_buffer_credits</literal>
2352             module parameter for the appropriate module. If the routing credits
2353             is not set explicitly, it'll default to the maximum transmit credits
2354             defined by <literal>peer_credits</literal> module parameter.
2355             Whenever a gateway routes a message from a peer, it decrements the
2356             number of available routing credits for that peer. If that value
2357             goes to zero, then messages will be queued. Negative values show the
2358             number of queued message waiting to be routed. The number of
2359             messages which are currently being routed from a peer can be derived
2360             by <literal>(max_rtr_credits - rtr)</literal>.</para>
2361           <para>LNet also limits concurrent sends and number of router buffers
2362             allocated to a single peer so that no peer can occupy all resources.
2363           </para>
2364         </listitem>
2365         <listitem>
2366           <para><literal>nis</literal> - Shows current queue health on the node.
2367           </para>
2368           <para>Example:</para>
2369           <screen># lctl get_param nis
2370 nid                    refs   peer    max   tx    min
2371 0@lo                   3      0       0     0     0
2372 192.168.10.34@tcp      4      8       256   256   252
2373 </screen>
2374           <para> The fields are explained in the table below.</para>
2375           <informaltable frame="all">
2376             <tgroup cols="2">
2377               <colspec colname="c1" colwidth="30*"/>
2378               <colspec colname="c2" colwidth="80*"/>
2379               <thead>
2380                 <row>
2381                   <entry>
2382                     <para><emphasis role="bold">Field</emphasis></para>
2383                   </entry>
2384                   <entry>
2385                     <para><emphasis role="bold">Description</emphasis></para>
2386                   </entry>
2387                 </row>
2388               </thead>
2389               <tbody>
2390                 <row>
2391                   <entry>
2392                     <para>
2393                       <literal> nid </literal></para>
2394                   </entry>
2395                   <entry>
2396                     <para>Network interface.</para>
2397                   </entry>
2398                 </row>
2399                 <row>
2400                   <entry>
2401                     <para>
2402                       <literal> refs </literal></para>
2403                   </entry>
2404                   <entry>
2405                     <para>Internal reference counter.</para>
2406                   </entry>
2407                 </row>
2408                 <row>
2409                   <entry>
2410                     <para>
2411                       <literal> peer </literal></para>
2412                   </entry>
2413                   <entry>
2414                     <para>Number of peer-to-peer send credits on this NID. Credits are used to size
2415                       buffer pools.</para>
2416                   </entry>
2417                 </row>
2418                 <row>
2419                   <entry>
2420                     <para>
2421                       <literal> max </literal></para>
2422                   </entry>
2423                   <entry>
2424                     <para>Total number of send credits on this NID.</para>
2425                   </entry>
2426                 </row>
2427                 <row>
2428                   <entry>
2429                     <para>
2430                       <literal> tx </literal></para>
2431                   </entry>
2432                   <entry>
2433                     <para>Current number of send credits available on this NID.</para>
2434                   </entry>
2435                 </row>
2436                 <row>
2437                   <entry>
2438                     <para>
2439                       <literal> min </literal></para>
2440                   </entry>
2441                   <entry>
2442                     <para>Lowest number of send credits available on this NID.</para>
2443                   </entry>
2444                 </row>
2445                 <row>
2446                   <entry>
2447                     <para>
2448                       <literal> queue </literal></para>
2449                   </entry>
2450                   <entry>
2451                     <para>Total bytes in active/queued sends.</para>
2452                   </entry>
2453                 </row>
2454               </tbody>
2455             </tgroup>
2456           </informaltable>
2457           <para><emphasis role="bold"><emphasis role="italic">Analysis:</emphasis></emphasis></para>
2458           <para>Subtracting <literal>max</literal> from <literal>tx</literal>
2459               (<literal>max</literal> - <literal>tx</literal>) yields the number of sends currently
2460             active. A large or increasing number of active sends may indicate a problem.</para>
2461         </listitem>
2462       </itemizedlist></para>
2463   </section>
2464   <section remap="h3" xml:id="balancing_free_space">
2465     <title><indexterm>
2466         <primary>proc</primary>
2467         <secondary>free space</secondary>
2468       </indexterm>Allocating Free Space on OSTs</title>
2469     <para>Free space is allocated using either a round-robin or a weighted
2470     algorithm. The allocation method is determined by the maximum amount of
2471     free-space imbalance between the OSTs. When free space is relatively
2472     balanced across OSTs, the faster round-robin allocator is used, which
2473     maximizes network balancing. The weighted allocator is used when any two
2474     OSTs are out of balance by more than a specified threshold.</para>
2475     <para>Free space distribution can be tuned using these two
2476     tunable parameters:</para>
2477     <itemizedlist>
2478       <listitem>
2479         <para><literal>lod.*.qos_threshold_rr</literal> - The threshold at which
2480         the allocation method switches from round-robin to weighted is set
2481         in this file. The default is to switch to the weighted algorithm when
2482         any two OSTs are out of balance by more than 17 percent.</para>
2483       </listitem>
2484       <listitem>
2485         <para><literal>lod.*.qos_prio_free</literal> - The weighting priority
2486         used by the weighted allocator can be adjusted in this file. Increasing
2487         the value of <literal>qos_prio_free</literal> puts more weighting on the
2488         amount of free space available on each OST and less on how stripes are
2489         distributed across OSTs. The default value is 91 percent weighting for
2490         free space rebalancing and 9 percent for OST balancing. When the
2491         free space priority is set to 100, weighting is based entirely on free
2492         space and location is no longer used by the striping algorithm.</para>
2493       </listitem>
2494       <listitem>
2495         <para condition="l29"><literal>osp.*.reserved_mb_low</literal>
2496           - The low watermark used to stop object allocation if available space
2497           is less than this. The default is 0.1% of total OST size.</para>
2498       </listitem>
2499        <listitem>
2500         <para condition="l29"><literal>osp.*.reserved_mb_high</literal>
2501           - The high watermark used to start object allocation if available
2502           space is more than this. The default is 0.2% of total OST size.</para>
2503       </listitem>
2504     </itemizedlist>
2505     <para>For more information about monitoring and managing free space, see
2506     <xref xmlns:xlink="http://www.w3.org/1999/xlink"
2507           linkend="file_striping.managing_free_space"/>.</para>
2508   </section>
2509   <section remap="h3">
2510     <title><indexterm>
2511         <primary>proc</primary>
2512         <secondary>locking</secondary>
2513       </indexterm>Configuring Locking</title>
2514     <para>The <literal>lru_size</literal> parameter is used to control the
2515       number of client-side locks in the LRU cached locks queue. LRU size is
2516       normally dynamic, based on load to optimize the number of locks cached
2517       on nodes that have different workloads (e.g., login/build nodes vs.
2518       compute nodes vs. backup nodes).</para>
2519     <para>The total number of locks available is a function of the server RAM.
2520       The default limit is 50 locks/1 MB of RAM. If memory pressure is too high,
2521       the LRU size is shrunk. The number of locks on the server is limited to
2522       <replaceable>num_osts_per_oss * num_clients * lru_size</replaceable>
2523       as follows: </para>
2524     <itemizedlist>
2525       <listitem>
2526         <para>To enable automatic LRU sizing, set the
2527         <literal>lru_size</literal> parameter to 0. In this case, the
2528         <literal>lru_size</literal> parameter shows the current number of locks
2529         being used on the client. Dynamic LRU resizing is enabled by default.
2530         </para>
2531       </listitem>
2532       <listitem>
2533         <para>To specify a maximum number of locks, set the
2534         <literal>lru_size</literal> parameter to a value other than zero.
2535         A good default value for compute nodes is around
2536         <literal>100 * <replaceable>num_cpus</replaceable></literal>.
2537         It is recommended that you only set <literal>lru_size</literal>
2538         to be signifivantly larger on a few login nodes where multiple
2539         users access the file system interactively.</para>
2540       </listitem>
2541     </itemizedlist>
2542     <para>To clear the LRU on a single client, and, as a result, flush client
2543       cache without changing the <literal>lru_size</literal> value, run:</para>
2544     <screen># lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
2545     <para>If the LRU size is set lower than the number of existing locks,
2546       <emphasis>unused</emphasis> locks are canceled immediately. Use
2547       <literal>clear</literal> to cancel all locks without changing the value.
2548     </para>
2549     <note>
2550       <para>The <literal>lru_size</literal> parameter can only be set
2551         temporarily using <literal>lctl set_param</literal>, it cannot be set
2552         permanently.</para>
2553     </note>
2554     <para>To disable dynamic LRU resizing on the clients, run for example:
2555     </para>
2556     <screen># lctl set_param ldlm.namespaces.*osc*.lru_size=5000</screen>
2557     <para>To determine the number of locks being granted with dynamic LRU
2558       resizing, run:</para>
2559     <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
2560     <para>The <literal>lru_max_age</literal> parameter is used to control the
2561       age of client-side locks in the LRU cached locks queue. This limits how
2562       long unused locks are cached on the client, and avoids idle clients from
2563       holding locks for an excessive time, which reduces memory usage on both
2564       the client and server, as well as reducing work during server recovery.
2565     </para>
2566     <para>The <literal>lru_max_age</literal> is printed in milliseconds.</para>
2567     <para condition='l2B'>Since Lustre 2.11, in addition to setting the
2568       maximum lock age in milliseconds, it can also be set using a suffix of
2569       <literal>s</literal> or <literal>ms</literal> to indicate seconds or
2570       milliseconds, respectively.  For example to set the client's maximum
2571       lock age to 15 minutes (900s) run:
2572     </para>
2573     <screen>
2574 # lctl set_param ldlm.namespaces.*MDT*.lru_max_age=900s
2575 # lctl get_param ldlm.namespaces.*MDT*.lru_max_age
2576 ldlm.namespaces.myth-MDT0000-mdc-ffff8804296c2800.lru_max_age=900000
2577     </screen>
2578   </section>
2579   <section xml:id="tuning_setting_thread_count">
2580     <title><indexterm>
2581         <primary>proc</primary>
2582         <secondary>thread counts</secondary>
2583       </indexterm>Setting MDS and OSS Thread Counts</title>
2584     <para>MDS and OSS thread counts tunable can be used to set the minimum and maximum thread counts
2585       or get the current number of running threads for the services listed in the table
2586       below.</para>
2587     <informaltable frame="all">
2588       <tgroup cols="2">
2589         <colspec colname="c1" colwidth="50*"/>
2590         <colspec colname="c2" colwidth="50*"/>
2591         <tbody>
2592           <row>
2593             <entry>
2594               <para>
2595                 <emphasis role="bold">Service</emphasis></para>
2596             </entry>
2597             <entry>
2598               <para>
2599                 <emphasis role="bold">Description</emphasis></para>
2600             </entry>
2601           </row>
2602           <row>
2603             <entry>
2604               <literal> mds.MDS.mdt </literal>
2605             </entry>
2606             <entry>
2607               <para>Main metadata operations service</para>
2608             </entry>
2609           </row>
2610           <row>
2611             <entry>
2612               <literal> mds.MDS.mdt_readpage </literal>
2613             </entry>
2614             <entry>
2615               <para>Metadata <literal>readdir</literal> service</para>
2616             </entry>
2617           </row>
2618           <row>
2619             <entry>
2620               <literal> mds.MDS.mdt_setattr </literal>
2621             </entry>
2622             <entry>
2623               <para>Metadata <literal>setattr/close</literal> operations service </para>
2624             </entry>
2625           </row>
2626           <row>
2627             <entry>
2628               <literal> ost.OSS.ost </literal>
2629             </entry>
2630             <entry>
2631               <para>Main data operations service</para>
2632             </entry>
2633           </row>
2634           <row>
2635             <entry>
2636               <literal> ost.OSS.ost_io </literal>
2637             </entry>
2638             <entry>
2639               <para>Bulk data I/O services</para>
2640             </entry>
2641           </row>
2642           <row>
2643             <entry>
2644               <literal> ost.OSS.ost_create </literal>
2645             </entry>
2646             <entry>
2647               <para>OST object pre-creation service</para>
2648             </entry>
2649           </row>
2650           <row>
2651             <entry>
2652               <literal> ldlm.services.ldlm_canceld </literal>
2653             </entry>
2654             <entry>
2655               <para>DLM lock cancel service</para>
2656             </entry>
2657           </row>
2658           <row>
2659             <entry>
2660               <literal> ldlm.services.ldlm_cbd </literal>
2661             </entry>
2662             <entry>
2663               <para>DLM lock grant service</para>
2664             </entry>
2665           </row>
2666         </tbody>
2667       </tgroup>
2668     </informaltable>
2669     <para>For each service, tunable parameters as shown below are available.
2670     </para>
2671     <itemizedlist>
2672       <listitem>
2673         <para>To temporarily set these tunables, run:</para>
2674         <screen># lctl set_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started=num</replaceable> </screen>
2675         </listitem>
2676       <listitem>
2677         <para>To permanently set this tunable, run the following command on
2678         the MGS:
2679         <screen>mgs# lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
2680         <para condition='l25'>For Lustre 2.5 or earlier, run:
2681         <screen>mgs# lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable></screen>
2682         </para>
2683       </listitem>
2684     </itemizedlist>
2685       <para>The following examples show how to set thread counts and get the
2686         number of running threads for the service <literal>ost_io</literal>
2687         using the tunable
2688         <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
2689     <itemizedlist>
2690       <listitem>
2691         <para>To get the number of running threads, run:</para>
2692         <screen># lctl get_param ost.OSS.ost_io.threads_started
2693 ost.OSS.ost_io.threads_started=128</screen>
2694       </listitem>
2695       <listitem>
2696         <para>To set the number of threads to the maximum value (512), run:</para>
2697         <screen># lctl get_param ost.OSS.ost_io.threads_max
2698 ost.OSS.ost_io.threads_max=512</screen>
2699       </listitem>
2700       <listitem>
2701         <para>To set the maximum thread count to 256 instead of 512 (to avoid overloading the
2702           storage or for an array with requests), run:</para>
2703         <screen># lctl set_param ost.OSS.ost_io.threads_max=256
2704 ost.OSS.ost_io.threads_max=256</screen>
2705       </listitem>
2706       <listitem>
2707         <para>To set the maximum thread count to 256 instead of 512 permanently, run:</para>
2708         <screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen>
2709         <para condition='l25'>For version 2.5 or later, run:
2710         <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
2711 ost.OSS.ost_io.threads_max=256 </screen> </para>
2712       </listitem>
2713       <listitem>
2714         <para> To check if the <literal>threads_max</literal> setting is active, run:</para>
2715         <screen># lctl get_param ost.OSS.ost_io.threads_max
2716 ost.OSS.ost_io.threads_max=256</screen>
2717       </listitem>
2718     </itemizedlist>
2719     <note>
2720       <para>If the number of service threads is changed while the file system is running, the change
2721         may not take effect until the file system is stopped and rest. If the number of service
2722         threads in use exceeds the new <literal>threads_max</literal> value setting, service threads
2723         that are already running will not be stopped.</para>
2724     </note>
2725     <para>See also <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="lustretuning"/></para>
2726   </section>
2727   <section xml:id="enabling_interpreting_debugging_logs">
2728     <title><indexterm>
2729         <primary>proc</primary>
2730         <secondary>debug</secondary>
2731       </indexterm>Enabling and Interpreting Debugging Logs</title>
2732     <para>By default, a detailed log of all operations is generated to aid in
2733       debugging. Flags that control debugging are found via
2734       <literal>lctl get_param debug</literal>.</para>
2735     <para>The overhead of debugging can affect the performance of Lustre file
2736       system. Therefore, to minimize the impact on performance, the debug level
2737       can be lowered, which affects the amount of debugging information kept in
2738       the internal log buffer but does not alter the amount of information to
2739       goes into syslog. You can raise the debug level when you need to collect
2740       logs to debug problems. </para>
2741     <para>The debugging mask can be set using &quot;symbolic names&quot;. The
2742       symbolic format is shown in the examples below.
2743       <itemizedlist>
2744         <listitem>
2745           <para>To verify the debug level used, examine the parameter that
2746             controls debugging by running:</para>
2747           <screen># lctl get_param debug
2748 debug=
2749 ioctl neterror warning error emerg ha config console</screen>
2750         </listitem>
2751         <listitem>
2752           <para>To turn off debugging except for network error debugging, run
2753           the following command on all nodes concerned:</para>
2754           <screen># sysctl -w lnet.debug=&quot;neterror&quot;
2755 debug=neterror</screen>
2756         </listitem>
2757       </itemizedlist>
2758       <itemizedlist>
2759         <listitem>
2760           <para>To turn off debugging completely (except for the minimum error
2761             reporting to the console), run the following command on all nodes
2762             concerned:</para>
2763           <screen># lctl set_param debug=0
2764 debug=0</screen>
2765         </listitem>
2766         <listitem>
2767           <para>To set an appropriate debug level for a production environment,
2768             run:</para>
2769           <screen># lctl set_param debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot;
2770 debug=warning dlmtrace error emerg ha rpctrace vfstrace</screen>
2771           <para>The flags shown in this example collect enough high-level
2772             information to aid debugging, but they do not cause any serious
2773             performance impact.</para>
2774         </listitem>
2775       </itemizedlist>
2776       <itemizedlist>
2777         <listitem>
2778           <para>To add new flags to flags that have already been set,
2779             precede each one with a &quot;<literal>+</literal>&quot;:</para>
2780           <screen># lctl set_param debug=&quot;+neterror +ha&quot;
2781 debug=+neterror +ha
2782 # lctl get_param debug
2783 debug=neterror warning error emerg ha console</screen>
2784         </listitem>
2785         <listitem>
2786           <para>To remove individual flags, precede them with a
2787             &quot;<literal>-</literal>&quot;:</para>
2788           <screen># lctl set_param debug=&quot;-ha&quot;
2789 debug=-ha
2790 # lctl get_param debug
2791 debug=neterror warning error emerg console</screen>
2792         </listitem>
2793       </itemizedlist>
2794     </para>
2795     <para>Debugging parameters include:</para>
2796     <itemizedlist>
2797       <listitem>
2798         <para><literal>subsystem_debug</literal> - Controls the debug logs for subsystems.</para>
2799       </listitem>
2800       <listitem>
2801         <para><literal>debug_path</literal> - Indicates the location where the debug log is dumped
2802           when triggered automatically or manually. The default path is
2803             <literal>/tmp/lustre-log</literal>.</para>
2804       </listitem>
2805     </itemizedlist>
2806     <para>These parameters can also be set using:<screen>sysctl -w lnet.debug={value}</screen></para>
2807     <para>Additional useful parameters: <itemizedlist>
2808         <listitem>
2809           <para><literal>panic_on_lbug</literal> - Causes &apos;&apos;panic&apos;&apos; to be called
2810             when the Lustre software detects an internal problem (an <literal>LBUG</literal> log
2811             entry); panic crashes the node. This is particularly useful when a kernel crash dump
2812             utility is configured. The crash dump is triggered when the internal inconsistency is
2813             detected by the Lustre software. </para>
2814         </listitem>
2815         <listitem>
2816           <para><literal>upcall</literal> - Allows you to specify the path to the binary which will
2817             be invoked when an <literal>LBUG</literal> log entry is encountered. This binary is
2818             called with four parameters:</para>
2819           <para> - The string &apos;&apos;<literal>LBUG</literal>&apos;&apos;.</para>
2820           <para> - The file where the <literal>LBUG</literal> occurred.</para>
2821           <para> - The function name.</para>
2822           <para> - The line number in the file</para>
2823         </listitem>
2824       </itemizedlist></para>
2825     <section>
2826       <title>Interpreting OST Statistics</title>
2827       <note>
2828         <para>See also
2829             <xref linkend="collectl"/> (<literal>collectl</literal>).</para>
2830       </note>
2831       <para>OST <literal>stats</literal> files can be used to provide statistics showing activity
2832         for each OST. For example:</para>
2833       <screen># lctl get_param osc.testfs-OST0000-osc.stats
2834 snapshot_time                      1189732762.835363
2835 ost_create                 1
2836 ost_get_info               1
2837 ost_connect                1
2838 ost_set_info               1
2839 obd_ping                   212</screen>
2840       <para>Use the <literal>llstat</literal> utility to monitor statistics over time.</para>
2841       <para>To clear the statistics, use the <literal>-c</literal> option to
2842         <literal>llstat</literal>. To specify how frequently the statistics
2843         should be reported (in seconds), use the <literal>-i</literal> option.
2844         In the example below, the <literal>-c</literal> option clears the
2845         statistics and <literal>-i10</literal> option reports statistics every
2846         10 seconds:</para>
2847 <screen role="smaller">$ llstat -c -i10 ost_io
2848
2849 /usr/bin/llstat: STATS on 06/06/07
2850         /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
2851 snapshot_time                              1181074093.276072
2852
2853 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
2854 Name        Cur.  Cur. #
2855             Count Rate Events Unit  last   min    avg       max    stddev
2856 req_waittime 8    0    8    [usec]  2078   34     259.75    868    317.49
2857 req_qdepth   8    0    8    [reqs]  1      0      0.12      1      0.35
2858 req_active   8    0    8    [reqs]  11     1      1.38      2      0.52
2859 reqbuf_avail 8    0    8    [bufs]  511    63     63.88     64     0.35
2860 ost_write    8    0    8    [bytes] 169767 72914  212209.62 387579 91874.29
2861
2862 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
2863 Name        Cur.  Cur. #
2864             Count Rate Events Unit  last    min   avg       max    stddev
2865 req_waittime 31   3    39   [usec]  30011   34    822.79    12245  2047.71
2866 req_qdepth   31   3    39   [reqs]  0       0     0.03      1      0.16
2867 req_active   31   3    39   [reqs]  58      1     1.77      3      0.74
2868 reqbuf_avail 31   3    39   [bufs]  1977    63    63.79     64     0.41
2869 ost_write    30   3    38   [bytes] 1028467 15019 315325.16 910694 197776.51
2870
2871 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
2872 Name        Cur.  Cur. #
2873             Count Rate Events Unit  last    min    avg       max    stddev
2874 req_waittime 21   2    60   [usec]  14970   34     784.32    12245  1878.66
2875 req_qdepth   21   2    60   [reqs]  0       0      0.02      1      0.13
2876 req_active   21   2    60   [reqs]  33      1      1.70      3      0.70
2877 reqbuf_avail 21   2    60   [bufs]  1341    63     63.82     64     0.39
2878 ost_write    21   2    59   [bytes] 7648424 15019  332725.08 910694 180397.87
2879 </screen>
2880       <para>The columns in this example are described in the table below.</para>
2881       <informaltable frame="all">
2882         <tgroup cols="2">
2883           <colspec colname="c1" colwidth="50*"/>
2884           <colspec colname="c2" colwidth="50*"/>
2885           <thead>
2886             <row>
2887               <entry>
2888                 <para><emphasis role="bold">Parameter</emphasis></para>
2889               </entry>
2890               <entry>
2891                 <para><emphasis role="bold">Description</emphasis></para>
2892               </entry>
2893             </row>
2894           </thead>
2895           <tbody>
2896             <row>
2897               <entry><literal>Name</literal></entry>
2898               <entry>Name of the service event.  See the tables below for descriptions of service
2899                 events that are tracked.</entry>
2900             </row>
2901             <row>
2902               <entry>
2903                 <para>
2904                   <literal>Cur. Count </literal></para>
2905               </entry>
2906               <entry>
2907                 <para>Number of events of each type sent in the last interval.</para>
2908               </entry>
2909             </row>
2910             <row>
2911               <entry>
2912                 <para>
2913                   <literal>Cur. Rate </literal></para>
2914               </entry>
2915               <entry>
2916                 <para>Number of events per second in the last interval.</para>
2917               </entry>
2918             </row>
2919             <row>
2920               <entry>
2921                 <para>
2922                   <literal> # Events </literal></para>
2923               </entry>
2924               <entry>
2925                 <para>Total number of such events since the events have been cleared.</para>
2926               </entry>
2927             </row>
2928             <row>
2929               <entry>
2930                 <para>
2931                   <literal> Unit </literal></para>
2932               </entry>
2933               <entry>
2934                 <para>Unit of measurement for that statistic (microseconds, requests,
2935                   buffers).</para>
2936               </entry>
2937             </row>
2938             <row>
2939               <entry>
2940                 <para>
2941                   <literal> last </literal></para>
2942               </entry>
2943               <entry>
2944                 <para>Average rate of these events (in units/event) for the last interval during
2945                   which they arrived. For instance, in the above mentioned case of
2946                     <literal>ost_destroy</literal> it took an average of 736 microseconds per
2947                   destroy for the 400 object destroys in the previous 10 seconds.</para>
2948               </entry>
2949             </row>
2950             <row>
2951               <entry>
2952                 <para>
2953                   <literal> min </literal></para>
2954               </entry>
2955               <entry>
2956                 <para>Minimum rate (in units/events) since the service started.</para>
2957               </entry>
2958             </row>
2959             <row>
2960               <entry>
2961                 <para>
2962                   <literal> avg </literal></para>
2963               </entry>
2964               <entry>
2965                 <para>Average rate.</para>
2966               </entry>
2967             </row>
2968             <row>
2969               <entry>
2970                 <para>
2971                   <literal> max </literal></para>
2972               </entry>
2973               <entry>
2974                 <para>Maximum rate.</para>
2975               </entry>
2976             </row>
2977             <row>
2978               <entry>
2979                 <para>
2980                   <literal> stddev </literal></para>
2981               </entry>
2982               <entry>
2983                 <para>Standard deviation (not measured in some cases)</para>
2984               </entry>
2985             </row>
2986           </tbody>
2987         </tgroup>
2988       </informaltable>
2989       <para>Events common to all services are shown in the table below.</para>
2990       <informaltable frame="all">
2991         <tgroup cols="2">
2992           <colspec colname="c1" colwidth="50*"/>
2993           <colspec colname="c2" colwidth="50*"/>
2994           <thead>
2995             <row>
2996               <entry>
2997                 <para><emphasis role="bold">Parameter</emphasis></para>
2998               </entry>
2999               <entry>
3000                 <para><emphasis role="bold">Description</emphasis></para>
3001               </entry>
3002             </row>
3003           </thead>
3004           <tbody>
3005             <row>
3006               <entry>
3007                 <para>
3008                   <literal> req_waittime </literal></para>
3009               </entry>
3010               <entry>
3011                 <para>Amount of time a request waited in the queue before being handled by an
3012                   available server thread.</para>
3013               </entry>
3014             </row>
3015             <row>
3016               <entry>
3017                 <para>
3018                   <literal> req_qdepth </literal></para>
3019               </entry>
3020               <entry>
3021                 <para>Number of requests waiting to be handled in the queue for this service.</para>
3022               </entry>
3023             </row>
3024             <row>
3025               <entry>
3026                 <para>
3027                   <literal> req_active </literal></para>
3028               </entry>
3029               <entry>
3030                 <para>Number of requests currently being handled.</para>
3031               </entry>
3032             </row>
3033             <row>
3034               <entry>
3035                 <para>
3036                   <literal> reqbuf_avail </literal></para>
3037               </entry>
3038               <entry>
3039                 <para>Number of unsolicited lnet request buffers for this service.</para>
3040               </entry>
3041             </row>
3042           </tbody>
3043         </tgroup>
3044       </informaltable>
3045       <para>Some service-specific events of interest are described in the table below.</para>
3046       <informaltable frame="all">
3047         <tgroup cols="2">
3048           <colspec colname="c1" colwidth="50*"/>
3049           <colspec colname="c2" colwidth="50*"/>
3050           <thead>
3051             <row>
3052               <entry>
3053                 <para><emphasis role="bold">Parameter</emphasis></para>
3054               </entry>
3055               <entry>
3056                 <para><emphasis role="bold">Description</emphasis></para>
3057               </entry>
3058             </row>
3059           </thead>
3060           <tbody>
3061             <row>
3062               <entry>
3063                 <para>
3064                   <literal> ldlm_enqueue </literal></para>
3065               </entry>
3066               <entry>
3067                 <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
3068               </entry>
3069             </row>
3070             <row>
3071               <entry>
3072                 <para>
3073                   <literal> mds_reint </literal></para>
3074               </entry>
3075               <entry>
3076                 <para>Time it takes to process an MDS modification record (includes
3077                     <literal>create</literal>, <literal>mkdir</literal>, <literal>unlink</literal>,
3078                     <literal>rename</literal> and <literal>setattr</literal>)</para>
3079               </entry>
3080             </row>
3081           </tbody>
3082         </tgroup>
3083       </informaltable>
3084     </section>
3085     <section>
3086       <title>Interpreting MDT Statistics</title>
3087       <note>
3088         <para>See also
3089             <xref linkend="collectl"/> (<literal>collectl</literal>).</para>
3090       </note>
3091       <para>MDT <literal>stats</literal> files can be used to track MDT
3092       statistics for the MDS. The example below shows sample output from an
3093       MDT <literal>stats</literal> file.</para>
3094       <screen># lctl get_param mds.*-MDT0000.stats
3095 snapshot_time                   1244832003.676892 secs.usecs
3096 open                            2 samples [reqs]
3097 close                           1 samples [reqs]
3098 getxattr                        3 samples [reqs]
3099 process_config                  1 samples [reqs]
3100 connect                         2 samples [reqs]
3101 disconnect                      2 samples [reqs]
3102 statfs                          3 samples [reqs]
3103 setattr                         1 samples [reqs]
3104 getattr                         3 samples [reqs]
3105 llog_init                       6 samples [reqs]
3106 notify                          16 samples [reqs]</screen>
3107     </section>
3108   </section>
3109 </chapter>
3110 <!--
3111   vim:expandtab:shiftwidth=2:tabstop=8:
3112   -->