LustreProc.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
   3   xml:lang="en-US" xml:id="lustreproc">
   4   <title xml:id="lustreproc.title">Lustre Parameters</title>
   5   <para>The <literal>/proc</literal> and <literal>/sys</literal> file systems
   6   acts as an interface to internal data structures in the kernel. This chapter
   7   describes parameters and tunables that are useful for optimizing and
   8   monitoring aspects of a Lustre file system. It includes these sections:</para>
   9   <itemizedlist>
  10     <listitem>
  11       <para><xref linkend="dbdoclet.50438271_83523"/></para>
  12       <para>.</para>
  13     </listitem>
  14   </itemizedlist>
  15   <section>
  16     <title>Introduction to Lustre Parameters</title>
  17     <para>Lustre parameters and statistics files provide an interface to
  18     internal data structures in the kernel that enables monitoring and
  19     tuning of many aspects of Lustre file system and application performance.
  20     These data structures include settings and metrics for components such
  21     as memory, networking, file systems, and kernel housekeeping routines,
  22     which are available throughout the hierarchical file layout.
  23     </para>
  24     <para>Typically, metrics are accessed via <literal>lctl get_param</literal>
  25     files and settings are changed by via <literal>lctl set_param</literal>.
  26     While it is possible to access parameters in <literal>/proc</literal>
  27     and <literal>/sys</literal> directly, the location of these parameters may
  28     change between releases, so it is recommended to always use
  29     <literal>lctl</literal> to access the parameters from userspace scripts.
  30     Some data is server-only, some data is client-only, and some data is
  31     exported from the client to the server and is thus duplicated in both
  32     locations.</para>
  33     <note>
  34       <para>In the examples in this chapter, <literal>#</literal> indicates
  35       a command is entered as root.  Lustre servers are named according to the
  36       convention <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
  37         The standard UNIX wildcard designation (*) is used.</para>
  38     </note>
  39     <para>Some examples are shown below:</para>
  40     <itemizedlist>
  41       <listitem>
  42         <para> To obtain data from a Lustre client:</para>
  43         <screen># lctl list_param osc.*
  44 osc.testfs-OST0000-osc-ffff881071d5cc00
  45 osc.testfs-OST0001-osc-ffff881071d5cc00
  46 osc.testfs-OST0002-osc-ffff881071d5cc00
  47 osc.testfs-OST0003-osc-ffff881071d5cc00
  48 osc.testfs-OST0004-osc-ffff881071d5cc00
  49 osc.testfs-OST0005-osc-ffff881071d5cc00
  50 osc.testfs-OST0006-osc-ffff881071d5cc00
  51 osc.testfs-OST0007-osc-ffff881071d5cc00
  52 osc.testfs-OST0008-osc-ffff881071d5cc00</screen>
  53         <para>In this example, information about OST connections available
  54         on a client is displayed (indicated by "osc").</para>
  55       </listitem>
  56     </itemizedlist>
  57     <itemizedlist>
  58       <listitem>
  59         <para> To see multiple levels of parameters, use multiple
  60           wildcards:<screen># lctl list_param osc.*.*
  61 osc.testfs-OST0000-osc-ffff881071d5cc00.active
  62 osc.testfs-OST0000-osc-ffff881071d5cc00.blocksize
  63 osc.testfs-OST0000-osc-ffff881071d5cc00.checksum_type
  64 osc.testfs-OST0000-osc-ffff881071d5cc00.checksums
  65 osc.testfs-OST0000-osc-ffff881071d5cc00.connect_flags
  66 osc.testfs-OST0000-osc-ffff881071d5cc00.contention_seconds
  67 osc.testfs-OST0000-osc-ffff881071d5cc00.cur_dirty_bytes
  68 ...
  69 osc.testfs-OST0000-osc-ffff881071d5cc00.rpc_stats</screen></para>
  70       </listitem>
  71     </itemizedlist>
  72     <itemizedlist>
  73       <listitem>
  74         <para> To view a specific file, use <literal>lctl get_param</literal>:
  75           <screen># lctl get_param osc.lustre-OST0000*.rpc_stats</screen></para>
  76       </listitem>
  77     </itemizedlist>
  78     <para>For more information about using <literal>lctl</literal>, see <xref
  79         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_51490"/>.</para>
  80     <para>Data can also be viewed using the <literal>cat</literal> command
  81     with the full path to the file. The form of the <literal>cat</literal>
  82     command is similar to that of the <literal>lctl get_param</literal>
  83     command with some differences.  Unfortunately, as the Linux kernel has
  84     changed over the years, the location of statistics and parameter files
  85     has also changed, which means that the Lustre parameter files may be
  86     located in either the <literal>/proc</literal> directory, in the
  87     <literal>/sys</literal> directory, and/or in the
  88     <literal>/sys/kernel/debug</literal> directory, depending on the kernel
  89     version and the Lustre version being used.  The <literal>lctl</literal>
  90     command insulates scripts from these changes and is preferred over direct
  91     file access, unless as part of a high-performance monitoring system.
  92     In the <literal>cat</literal> command:</para>
  93     <itemizedlist>
  94       <listitem>
  95         <para>Replace the dots in the path with slashes.</para>
  96       </listitem>
  97       <listitem>
  98         <para>Prepend the path with the appropriate directory component:
  99           <screen>/{proc,sys}/{fs,sys}/{lustre,lnet}</screen></para>
 100       </listitem>
 101     </itemizedlist>
 102     <para>For example, an <literal>lctl get_param</literal> command may look like
 103       this:<screen># lctl get_param osc.*.uuid
 104 osc.testfs-OST0000-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
 105 osc.testfs-OST0001-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
 106 ...</screen></para>
 107     <para>The equivalent <literal>cat</literal> command may look like this:
 108      <screen># cat /proc/fs/lustre/osc/*/uuid
 109 594db456-0685-bd16-f59b-e72ee90e9819
 110 594db456-0685-bd16-f59b-e72ee90e9819
 111 ...</screen></para>
 112     <para>or like this:
 113      <screen># cat /sys/fs/lustre/osc/*/uuid
 114 594db456-0685-bd16-f59b-e72ee90e9819
 115 594db456-0685-bd16-f59b-e72ee90e9819
 116 ...</screen></para>
 117     <para>The <literal>llstat</literal> utility can be used to monitor some
 118     Lustre file system I/O activity over a specified time period. For more
 119     details, see
 120     <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438219_23232"/></para>
 121     <para>Some data is imported from attached clients and is available in a
 122     directory called <literal>exports</literal> located in the corresponding
 123     per-service directory on a Lustre server. For example:
 124     <screen>oss:/root# lctl list_param obdfilter.testfs-OST0000.exports.*
 125 # hash ldlm_stats stats uuid</screen></para>
 126     <section remap="h3">
 127       <title>Identifying Lustre File Systems and Servers</title>
 128       <para>Several parameter files on the MGS list existing
 129       Lustre file systems and file system servers. The examples below are for
 130       a Lustre file system called
 131           <literal>testfs</literal> with one MDT and three OSTs.</para>
 132       <itemizedlist>
 133         <listitem>
 134           <para> To view all known Lustre file systems, enter:</para>
 135           <screen>mgs# lctl get_param mgs.*.filesystems
 136 testfs</screen>
 137         </listitem>
 138         <listitem>
 139           <para> To view the names of the servers in a file system in which least one server is
 140             running,
 141             enter:<screen>lctl get_param mgs.*.live.<replaceable>&lt;filesystem name></replaceable></screen></para>
 142           <para>For example:</para>
 143           <screen>mgs# lctl get_param mgs.*.live.testfs
 144 fsname: testfs
 145 flags: 0x20     gen: 45
 146 testfs-MDT0000
 147 testfs-OST0000
 148 testfs-OST0001
 149 testfs-OST0002
 150
 151 Secure RPC Config Rules:
 152
 153 imperative_recovery_state:
 154     state: startup
 155     nonir_clients: 0
 156     nidtbl_version: 6
 157     notify_duration_total: 0.001000
 158     notify_duation_max:  0.001000
 159     notify_count: 4</screen>
 160         </listitem>
 161         <listitem>
 162           <para>To list all configured devices on the local node, enter:</para>
 163           <screen># lctl device_list
 164 0 UP mgs MGS MGS 11
 165 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
 166 2 UP mdt MDS MDS_uuid 3
 167 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
 168 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 7
 169 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
 170 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
 171 7 UP lov testfs-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
 172 8 UP mdc testfs-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
 173 9 UP osc testfs-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
 174 10 UP osc testfs-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
 175           <para>The information provided on each line includes:</para>
 176           <para> -  Device number</para>
 177           <para> - Device status (UP, INactive, or STopping) </para>
 178           <para> -  Device name</para>
 179           <para> -  Device UUID</para>
 180           <para> -  Reference count (how many users this device has)</para>
 181         </listitem>
 182         <listitem>
 183           <para>To display the name of any server, view the device
 184             label:<screen>mds# e2label /dev/sda
 185 testfs-MDT0000</screen></para>
 186         </listitem>
 187       </itemizedlist>
 188     </section>
 189   </section>
 190   <section>
 191     <title>Tuning Multi-Block Allocation (mballoc)</title>
 192     <para>Capabilities supported by <literal>mballoc</literal> include:</para>
 193     <itemizedlist>
 194       <listitem>
 195         <para> Pre-allocation for single files to help to reduce fragmentation.</para>
 196       </listitem>
 197       <listitem>
 198         <para> Pre-allocation for a group of files to enable packing of small files into large,
 199           contiguous chunks.</para>
 200       </listitem>
 201       <listitem>
 202         <para> Stream allocation to help decrease the seek rate.</para>
 203       </listitem>
 204     </itemizedlist>
 205     <para>The following <literal>mballoc</literal> tunables are available:</para>
 206     <informaltable frame="all">
 207       <tgroup cols="2">
 208         <colspec colname="c1" colwidth="30*"/>
 209         <colspec colname="c2" colwidth="70*"/>
 210         <thead>
 211           <row>
 212             <entry>
 213               <para><emphasis role="bold">Field</emphasis></para>
 214             </entry>
 215             <entry>
 216               <para><emphasis role="bold">Description</emphasis></para>
 217             </entry>
 218           </row>
 219         </thead>
 220         <tbody>
 221           <row>
 222             <entry>
 223               <para>
 224                 <literal>mb_max_to_scan</literal></para>
 225             </entry>
 226             <entry>
 227               <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a
 228                 final decision to avoid a livelock situation.</para>
 229             </entry>
 230           </row>
 231           <row>
 232             <entry>
 233               <para>
 234                 <literal>mb_min_to_scan</literal></para>
 235             </entry>
 236             <entry>
 237               <para>Minimum number of free chunks that <literal>mballoc</literal> searches before
 238                 picking the best chunk for allocation. This is useful for small requests to reduce
 239                 fragmentation of big free chunks.</para>
 240             </entry>
 241           </row>
 242           <row>
 243             <entry>
 244               <para>
 245                 <literal>mb_order2_req</literal></para>
 246             </entry>
 247             <entry>
 248               <para>For requests equal to 2^N, where N &gt;= <literal>mb_order2_req</literal>, a
 249                 fast search is done using a base 2 buddy allocation service.</para>
 250             </entry>
 251           </row>
 252           <row>
 253             <entry>
 254               <para>
 255                 <literal>mb_small_req</literal></para>
 256             </entry>
 257             <entry morerows="1">
 258               <para><literal>mb_small_req</literal> - Defines (in MB) the upper bound of "small
 259                 requests".</para>
 260               <para><literal>mb_large_req</literal> - Defines (in MB) the lower bound of "large
 261                 requests".</para>
 262               <para>Requests are handled differently based on size:<itemizedlist>
 263                   <listitem>
 264                     <para>&lt; <literal>mb_small_req</literal> - Requests are packed together to
 265                       form large, aggregated requests.</para>
 266                   </listitem>
 267                   <listitem>
 268                     <para>> <literal>mb_small_req</literal> and &lt; <literal>mb_large_req</literal>
 269                       - Requests are primarily allocated linearly.</para>
 270                   </listitem>
 271                   <listitem>
 272                     <para>> <literal>mb_large_req</literal> - Requests are allocated since hard disk
 273                       seek time is less of a concern in this case.</para>
 274                   </listitem>
 275                 </itemizedlist></para>
 276               <para>In general, small requests are combined to create larger requests, which are
 277                 then placed close to one another to minimize the number of seeks required to access
 278                 the data.</para>
 279             </entry>
 280           </row>
 281           <row>
 282             <entry>
 283               <para>
 284                 <literal>mb_large_req</literal></para>
 285             </entry>
 286           </row>
 287           <row>
 288             <entry>
 289               <para>
 290                 <literal>prealloc_table</literal></para>
 291             </entry>
 292             <entry>
 293               <para>A table of values used to preallocate space when a new request is received. By
 294                 default, the table looks like
 295                 this:<screen>prealloc_table
 296 4 8 16 32 64 128 256 512 1024 2048 </screen></para>
 297               <para>When a new request is received, space is preallocated at the next higher
 298                 increment specified in the table. For example, for requests of less than 4 file
 299                 system blocks, 4 blocks of space are preallocated; for requests between 4 and 8, 8
 300                 blocks are preallocated; and so forth</para>
 301               <para>Although customized values can be entered in the table, the performance of
 302                 general usage file systems will not typically be improved by modifying the table (in
 303                 fact, in ext4 systems, the table values are fixed).  However, for some specialized
 304                 workloads, tuning the <literal>prealloc_table</literal> values may result in smarter
 305                 preallocation decisions. </para>
 306             </entry>
 307           </row>
 308           <row>
 309             <entry>
 310               <para>
 311                 <literal>mb_group_prealloc</literal></para>
 312             </entry>
 313             <entry>
 314               <para>The amount of space (in kilobytes) preallocated for groups of small
 315                 requests.</para>
 316             </entry>
 317           </row>
 318         </tbody>
 319       </tgroup>
 320     </informaltable>
 321     <para>Buddy group cache information found in
 322           <literal>/sys/fs/ldiskfs/<replaceable>disk_device</replaceable>/mb_groups</literal> may
 323       be useful for assessing on-disk fragmentation. For
 324       example:<screen>cat /proc/fs/ldiskfs/loop0/mb_groups
 325 #group: free free frags first pa [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9
 326      2^10 2^11 2^12 2^13]
 327 #0    : 2936 2936 1     42    0  [ 0   0   0   1   1   1   1   2   0   1
 328      2    0    0    0   ]</screen></para>
 329     <para>In this example, the columns show:<itemizedlist>
 330         <listitem>
 331           <para>#group number</para>
 332         </listitem>
 333         <listitem>
 334           <para>Available blocks in the group</para>
 335         </listitem>
 336         <listitem>
 337           <para>Blocks free on a disk</para>
 338         </listitem>
 339         <listitem>
 340           <para>Number of free fragments</para>
 341         </listitem>
 342         <listitem>
 343           <para>First free block in the group</para>
 344         </listitem>
 345         <listitem>
 346           <para>Number of preallocated chunks (not blocks)</para>
 347         </listitem>
 348         <listitem>
 349           <para>A series of available chunks of different sizes</para>
 350         </listitem>
 351       </itemizedlist></para>
 352   </section>
 353   <section>
 354     <title>Monitoring Lustre File System I/O</title>
 355     <para>A number of system utilities are provided to enable collection of data related to I/O
 356       activity in a Lustre file system. In general, the data collected describes:</para>
 357     <itemizedlist>
 358       <listitem>
 359         <para> Data transfer rates and throughput of inputs and outputs external to the Lustre file
 360           system, such as network requests or disk I/O operations performed</para>
 361       </listitem>
 362       <listitem>
 363         <para> Data about the throughput or transfer rates of internal Lustre file system data, such
 364           as locks or allocations. </para>
 365       </listitem>
 366     </itemizedlist>
 367     <note>
 368       <para>It is highly recommended that you complete baseline testing for your Lustre file system
 369         to determine normal I/O activity for your hardware, network, and system workloads. Baseline
 370         data will allow you to easily determine when performance becomes degraded in your system.
 371         Two particularly useful baseline statistics are:</para>
 372       <itemizedlist>
 373         <listitem>
 374           <para><literal>brw_stats</literal> – Histogram data characterizing I/O requests to the
 375             OSTs. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 376               linkend="dbdoclet.50438271_55057"/>.</para>
 377         </listitem>
 378         <listitem>
 379           <para><literal>rpc_stats</literal> – Histogram data showing information about RPCs made by
 380             clients. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 381               linkend="MonitoringClientRCPStream"/>.</para>
 382         </listitem>
 383       </itemizedlist>
 384     </note>
 385     <section remap="h3" xml:id="MonitoringClientRCPStream">
 386       <title><indexterm>
 387           <primary>proc</primary>
 388           <secondary>watching RPC</secondary>
 389         </indexterm>Monitoring the Client RPC Stream</title>
 390       <para>The <literal>rpc_stats</literal> file contains histogram data showing information about
 391         remote procedure calls (RPCs) that have been made since this file was last cleared. The
 392         histogram data can be cleared by writing any value into the <literal>rpc_stats</literal>
 393         file.</para>
 394       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 395       <screen># lctl get_param osc.testfs-OST0000-osc-ffff810058d2f800.rpc_stats
 396 snapshot_time:            1372786692.389858 (secs.usecs)
 397 read RPCs in flight:      0
 398 write RPCs in flight:     1
 399 dio read RPCs in flight:  0
 400 dio write RPCs in flight: 0
 401 pending write pages:      256
 402 pending read pages:       0
 403
 404                      read                   write
 405 pages per rpc   rpcs   % cum % |       rpcs   % cum %
 406 1:                 0   0   0   |          0   0   0
 407 2:                 0   0   0   |          1   0   0
 408 4:                 0   0   0   |          0   0   0
 409 8:                 0   0   0   |          0   0   0
 410 16:                0   0   0   |          0   0   0
 411 32:                0   0   0   |          2   0   0
 412 64:                0   0   0   |          2   0   0
 413 128:               0   0   0   |          5   0   0
 414 256:             850 100 100   |      18346  99 100
 415
 416                      read                   write
 417 rpcs in flight  rpcs   % cum % |       rpcs   % cum %
 418 0:               691  81  81   |       1740   9   9
 419 1:                48   5  86   |        938   5  14
 420 2:                29   3  90   |       1059   5  20
 421 3:                17   2  92   |       1052   5  26
 422 4:                13   1  93   |        920   5  31
 423 5:                12   1  95   |        425   2  33
 424 6:                10   1  96   |        389   2  35
 425 7:                30   3 100   |      11373  61  97
 426 8:                 0   0 100   |        460   2 100
 427
 428                      read                   write
 429 offset          rpcs   % cum % |       rpcs   % cum %
 430 0:               850 100 100   |      18347  99  99
 431 1:                 0   0 100   |          0   0  99
 432 2:                 0   0 100   |          0   0  99
 433 4:                 0   0 100   |          0   0  99
 434 8:                 0   0 100   |          0   0  99
 435 16:                0   0 100   |          1   0  99
 436 32:                0   0 100   |          1   0  99
 437 64:                0   0 100   |          3   0  99
 438 128:               0   0 100   |          4   0 100
 439
 440 </screen>
 441       <para>The header information includes:</para>
 442       <itemizedlist>
 443         <listitem>
 444           <para><literal>snapshot_time</literal> - UNIX epoch instant the file was read.</para>
 445         </listitem>
 446         <listitem>
 447           <para><literal>read RPCs in flight</literal> - Number of read RPCs issued by the OSC, but
 448             not complete at the time of the snapshot. This value should always be less than or equal
 449             to <literal>max_rpcs_in_flight</literal>.</para>
 450         </listitem>
 451         <listitem>
 452           <para><literal>write RPCs in flight</literal> - Number of write RPCs issued by the OSC,
 453             but not complete at the time of the snapshot. This value should always be less than or
 454             equal to <literal>max_rpcs_in_flight</literal>.</para>
 455         </listitem>
 456         <listitem>
 457           <para><literal>dio read RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
 458             read RPCs issued but not completed at the time of the snapshot.</para>
 459         </listitem>
 460         <listitem>
 461           <para><literal>dio write RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
 462             write RPCs issued but not completed at the time of the snapshot.</para>
 463         </listitem>
 464         <listitem>
 465           <para><literal>pending write pages</literal>  - Number of pending write pages that have
 466             been queued for I/O in the OSC.</para>
 467         </listitem>
 468         <listitem>
 469           <para><literal>pending read pages</literal> - Number of pending read pages that have been
 470             queued for I/O in the OSC.</para>
 471         </listitem>
 472       </itemizedlist>
 473       <para>The tabular data is described in the table below. Each row in the table shows the number
 474         of reads or writes (<literal>ios</literal>) occurring for the statistic, the relative
 475         percentage (<literal>%</literal>) of total reads or writes, and the cumulative percentage
 476           (<literal>cum %</literal>) to that point in the table for the statistic.</para>
 477       <informaltable frame="all">
 478         <tgroup cols="2">
 479           <colspec colname="c1" colwidth="40*"/>
 480           <colspec colname="c2" colwidth="60*"/>
 481           <thead>
 482             <row>
 483               <entry>
 484                 <para><emphasis role="bold">Field</emphasis></para>
 485               </entry>
 486               <entry>
 487                 <para><emphasis role="bold">Description</emphasis></para>
 488               </entry>
 489             </row>
 490           </thead>
 491           <tbody>
 492             <row>
 493               <entry>
 494                 <para> pages per RPC</para>
 495               </entry>
 496               <entry>
 497                 <para>Shows cumulative RPC reads and writes organized according to the number of
 498                   pages in the RPC. A single page RPC increments the <literal>0:</literal>
 499                   row.</para>
 500               </entry>
 501             </row>
 502             <row>
 503               <entry>
 504                 <para> RPCs in flight</para>
 505               </entry>
 506               <entry>
 507                 <para> Shows the number of RPCs that are pending when an RPC is sent. When the first
 508                   RPC is sent, the <literal>0:</literal> row is incremented. If the first RPC is
 509                   sent while another RPC is pending, the <literal>1:</literal> row is incremented
 510                   and so on. </para>
 511               </entry>
 512             </row>
 513             <row>
 514               <entry>
 515                 <para> offset</para>
 516               </entry>
 517               <entry>
 518                 <para> The page index of the first page read from or written to the object by the
 519                   RPC. </para>
 520               </entry>
 521             </row>
 522           </tbody>
 523         </tgroup>
 524       </informaltable>
 525       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 526       <para>This table provides a way to visualize the concurrency of the RPC stream. Ideally, you
 527         will see a large clump around the <literal>max_rpcs_in_flight value</literal>, which shows
 528         that the network is being kept busy.</para>
 529       <para>For information about optimizing the client I/O RPC stream, see <xref
 530           xmlns:xlink="http://www.w3.org/1999/xlink" linkend="TuningClientIORPCStream"/>.</para>
 531     </section>
 532     <section xml:id="lustreproc.clientstats" remap="h3">
 533       <title><indexterm>
 534           <primary>proc</primary>
 535           <secondary>client stats</secondary>
 536         </indexterm>Monitoring Client Activity</title>
 537       <para>The <literal>stats</literal> file maintains statistics accumulate during typical
 538         operation of a client across the VFS interface of the Lustre file system. Only non-zero
 539         parameters are displayed in the file. </para>
 540       <para>Client statistics are enabled by default.</para>
 541       <note>
 542         <para>Statistics for all mounted file systems can be discovered by
 543           entering:<screen>lctl get_param llite.*.stats</screen></para>
 544       </note>
 545       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 546       <screen>client# lctl get_param llite.*.stats
 547 snapshot_time          1308343279.169704 secs.usecs
 548 dirty_pages_hits       14819716 samples [regs]
 549 dirty_pages_misses     81473472 samples [regs]
 550 read_bytes             36502963 samples [bytes] 1 26843582 55488794
 551 write_bytes            22985001 samples [bytes] 0 125912 3379002
 552 brw_read               2279 samples [pages] 1 1 2270
 553 ioctl                  186749 samples [regs]
 554 open                   3304805 samples [regs]
 555 close                  3331323 samples [regs]
 556 seek                   48222475 samples [regs]
 557 fsync                  963 samples [regs]
 558 truncate               9073 samples [regs]
 559 setxattr               19059 samples [regs]
 560 getxattr               61169 samples [regs]
 561 </screen>
 562       <para> The statistics can be cleared by echoing an empty string into the
 563           <literal>stats</literal> file or by using the command:
 564         <screen>lctl set_param llite.*.stats=0</screen></para>
 565       <para>The statistics displayed are described in the table below.</para>
 566       <informaltable frame="all">
 567         <tgroup cols="2">
 568           <colspec colname="c1" colwidth="3*"/>
 569           <colspec colname="c2" colwidth="7*"/>
 570           <thead>
 571             <row>
 572               <entry>
 573                 <para><emphasis role="bold">Entry</emphasis></para>
 574               </entry>
 575               <entry>
 576                 <para><emphasis role="bold">Description</emphasis></para>
 577               </entry>
 578             </row>
 579           </thead>
 580           <tbody>
 581             <row>
 582               <entry>
 583                 <para>
 584                   <literal>snapshot_time</literal></para>
 585               </entry>
 586               <entry>
 587                 <para>UNIX epoch instant the stats file was read.</para>
 588               </entry>
 589             </row>
 590             <row>
 591               <entry>
 592                 <para>
 593                   <literal>dirty_page_hits</literal></para>
 594               </entry>
 595               <entry>
 596                 <para>The number of write operations that have been satisfied by the dirty page
 597                   cache. See <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 598                     linkend="TuningClientIORPCStream"/> for more information about dirty cache
 599                   behavior in a Lustre file system.</para>
 600               </entry>
 601             </row>
 602             <row>
 603               <entry>
 604                 <para>
 605                   <literal>dirty_page_misses</literal></para>
 606               </entry>
 607               <entry>
 608                 <para>The number of write operations that were not satisfied by the dirty page
 609                   cache.</para>
 610               </entry>
 611             </row>
 612             <row>
 613               <entry>
 614                 <para>
 615                   <literal>read_bytes</literal></para>
 616               </entry>
 617               <entry>
 618                 <para>The number of read operations that have occurred. Three additional parameters
 619                   are displayed:</para>
 620                 <variablelist>
 621                   <varlistentry>
 622                     <term>min</term>
 623                     <listitem>
 624                       <para>The minimum number of bytes read in a single request since the counter
 625                         was reset.</para>
 626                     </listitem>
 627                   </varlistentry>
 628                   <varlistentry>
 629                     <term>max</term>
 630                     <listitem>
 631                       <para>The maximum number of bytes read in a single request since the counter
 632                         was reset.</para>
 633                     </listitem>
 634                   </varlistentry>
 635                   <varlistentry>
 636                     <term>sum</term>
 637                     <listitem>
 638                       <para>The accumulated sum of bytes of all read requests since the counter was
 639                         reset.</para>
 640                     </listitem>
 641                   </varlistentry>
 642                 </variablelist>
 643               </entry>
 644             </row>
 645             <row>
 646               <entry>
 647                 <para>
 648                   <literal>write_bytes</literal></para>
 649               </entry>
 650               <entry>
 651                 <para>The number of write operations that have occurred. Three additional parameters
 652                   are displayed:</para>
 653                 <variablelist>
 654                   <varlistentry>
 655                     <term>min</term>
 656                     <listitem>
 657                       <para>The minimum number of bytes written in a single request since the
 658                         counter was reset.</para>
 659                     </listitem>
 660                   </varlistentry>
 661                   <varlistentry>
 662                     <term>max</term>
 663                     <listitem>
 664                       <para>The maximum number of bytes written in a single request since the
 665                         counter was reset.</para>
 666                     </listitem>
 667                   </varlistentry>
 668                   <varlistentry>
 669                     <term>sum</term>
 670                     <listitem>
 671                       <para>The accumulated sum of bytes of all write requests since the counter was
 672                         reset.</para>
 673                     </listitem>
 674                   </varlistentry>
 675                 </variablelist>
 676               </entry>
 677             </row>
 678             <row>
 679               <entry>
 680                 <para>
 681                   <literal>brw_read</literal></para>
 682               </entry>
 683               <entry>
 684                 <para>The number of pages that have been read. Three additional parameters are
 685                   displayed:</para>
 686                 <variablelist>
 687                   <varlistentry>
 688                     <term>min</term>
 689                     <listitem>
 690                       <para>The minimum number of bytes read in a single block read/write
 691                           (<literal>brw</literal>) read request since the counter was reset.</para>
 692                     </listitem>
 693                   </varlistentry>
 694                   <varlistentry>
 695                     <term>max</term>
 696                     <listitem>
 697                       <para>The maximum number of bytes read in a single <literal>brw</literal> read
 698                         requests since the counter was reset.</para>
 699                     </listitem>
 700                   </varlistentry>
 701                   <varlistentry>
 702                     <term>sum</term>
 703                     <listitem>
 704                       <para>The accumulated sum of bytes of all <literal>brw</literal> read requests
 705                         since the counter was reset.</para>
 706                     </listitem>
 707                   </varlistentry>
 708                 </variablelist>
 709               </entry>
 710             </row>
 711             <row>
 712               <entry>
 713                 <para>
 714                   <literal>ioctl</literal></para>
 715               </entry>
 716               <entry>
 717                 <para>The number of combined file and directory <literal>ioctl</literal>
 718                   operations.</para>
 719               </entry>
 720             </row>
 721             <row>
 722               <entry>
 723                 <para>
 724                   <literal>open</literal></para>
 725               </entry>
 726               <entry>
 727                 <para>The number of open operations that have succeeded.</para>
 728               </entry>
 729             </row>
 730             <row>
 731               <entry>
 732                 <para>
 733                   <literal>close</literal></para>
 734               </entry>
 735               <entry>
 736                 <para>The number of close operations that have succeeded.</para>
 737               </entry>
 738             </row>
 739             <row>
 740               <entry>
 741                 <para>
 742                   <literal>seek</literal></para>
 743               </entry>
 744               <entry>
 745                 <para>The number of times <literal>seek</literal> has been called.</para>
 746               </entry>
 747             </row>
 748             <row>
 749               <entry>
 750                 <para>
 751                   <literal>fsync</literal></para>
 752               </entry>
 753               <entry>
 754                 <para>The number of times <literal>fsync</literal> has been called.</para>
 755               </entry>
 756             </row>
 757             <row>
 758               <entry>
 759                 <para>
 760                   <literal>truncate</literal></para>
 761               </entry>
 762               <entry>
 763                 <para>The total number of calls to both locked and lockless
 764                     <literal>truncate</literal>.</para>
 765               </entry>
 766             </row>
 767             <row>
 768               <entry>
 769                 <para>
 770                   <literal>setxattr</literal></para>
 771               </entry>
 772               <entry>
 773                 <para>The number of times extended attributes have been set. </para>
 774               </entry>
 775             </row>
 776             <row>
 777               <entry>
 778                 <para>
 779                   <literal>getxattr</literal></para>
 780               </entry>
 781               <entry>
 782                 <para>The number of times value(s) of extended attributes have been fetched.</para>
 783               </entry>
 784             </row>
 785           </tbody>
 786         </tgroup>
 787       </informaltable>
 788       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 789       <para>Information is provided about the amount and type of I/O activity is taking place on the
 790         client.</para>
 791     </section>
 792     <section remap="h3">
 793       <title><indexterm>
 794           <primary>proc</primary>
 795           <secondary>read/write survey</secondary>
 796         </indexterm>Monitoring Client Read-Write Offset Statistics</title>
 797       <para>When the <literal>offset_stats</literal> parameter is set, statistics are maintained for
 798         occurrences of a series of read or write calls from a process that did not access the next
 799         sequential location. The <literal>OFFSET</literal> field is reset to 0 (zero) whenever a
 800         different file is read or written.</para>
 801       <note>
 802         <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
 803             <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
 804           to reduce monitoring overhead when this information is not needed.  The collection of
 805           statistics in all three of these files is activated by writing
 806           anything, except for 0 (zero) and "disable", into any one of the
 807           files.</para>
 808       </note>
 809       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 810       <screen># lctl get_param llite.testfs-f57dee0.offset_stats
 811 snapshot_time: 1155748884.591028 (secs.usecs)
 812              RANGE   RANGE    SMALLEST   LARGEST
 813 R/W   PID    START   END      EXTENT     EXTENT    OFFSET
 814 R     8385   0       128      128        128       0
 815 R     8385   0       224      224        224       -128
 816 W     8385   0       250      50         100       0
 817 W     8385   100     1110     10         500       -150
 818 W     8384   0       5233     5233       5233      0
 819 R     8385   500     600      100        100       -610</screen>
 820       <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file was
 821         read. The tabular data is described in the table below.</para>
 822       <para>The <literal>offset_stats</literal> file can be cleared by
 823         entering:<screen>lctl set_param llite.*.offset_stats=0</screen></para>
 824       <informaltable frame="all">
 825         <tgroup cols="2">
 826           <colspec colname="c1" colwidth="50*"/>
 827           <colspec colname="c2" colwidth="50*"/>
 828           <thead>
 829             <row>
 830               <entry>
 831                 <para><emphasis role="bold">Field</emphasis></para>
 832               </entry>
 833               <entry>
 834                 <para><emphasis role="bold">Description</emphasis></para>
 835               </entry>
 836             </row>
 837           </thead>
 838           <tbody>
 839             <row>
 840               <entry>
 841                 <para>R/W</para>
 842               </entry>
 843               <entry>
 844                 <para>Indicates if the non-sequential call was a read or write</para>
 845               </entry>
 846             </row>
 847             <row>
 848               <entry>
 849                 <para>PID </para>
 850               </entry>
 851               <entry>
 852                 <para>Process ID of the process that made the read/write call.</para>
 853               </entry>
 854             </row>
 855             <row>
 856               <entry>
 857                 <para>RANGE START/RANGE END</para>
 858               </entry>
 859               <entry>
 860                 <para>Range in which the read/write calls were sequential.</para>
 861               </entry>
 862             </row>
 863             <row>
 864               <entry>
 865                 <para>SMALLEST EXTENT </para>
 866               </entry>
 867               <entry>
 868                 <para>Smallest single read/write in the corresponding range (in bytes).</para>
 869               </entry>
 870             </row>
 871             <row>
 872               <entry>
 873                 <para>LARGEST EXTENT </para>
 874               </entry>
 875               <entry>
 876                 <para>Largest single read/write in the corresponding range (in bytes).</para>
 877               </entry>
 878             </row>
 879             <row>
 880               <entry>
 881                 <para>OFFSET </para>
 882               </entry>
 883               <entry>
 884                 <para>Difference between the previous range end and the current range start.</para>
 885               </entry>
 886             </row>
 887           </tbody>
 888         </tgroup>
 889       </informaltable>
 890       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 891       <para>This data provides an indication of how contiguous or fragmented the data is. For
 892         example, the fourth entry in the example above shows the writes for this RPC were sequential
 893         in the range 100 to 1110 with the minimum write 10 bytes and the maximum write 500 bytes.
 894         The range started with an offset of -150 from the <literal>RANGE END</literal> of the
 895         previous entry in the example.</para>
 896     </section>
 897     <section remap="h3">
 898       <title><indexterm>
 899           <primary>proc</primary>
 900           <secondary>read/write survey</secondary>
 901         </indexterm>Monitoring Client Read-Write Extent Statistics</title>
 902       <para>For in-depth troubleshooting, client read-write extent statistics can be accessed to
 903         obtain more detail about read/write I/O extents for the file system or for a particular
 904         process.</para>
 905       <note>
 906         <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
 907             <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
 908           to reduce monitoring overhead when this information is not needed.  The collection of
 909           statistics in all three of these files is activated by writing
 910           anything, except for 0 (zero) and "disable", into any one of the
 911           files.</para>
 912       </note>
 913       <section remap="h3">
 914         <title>Client-Based I/O Extent Size Survey</title>
 915         <para>The <literal>extents_stats</literal> histogram in the
 916           <literal>llite</literal> directory shows the statistics for the sizes
 917           of the read/write I/O extents. This file does not maintain the per
 918           process statistics.</para>
 919         <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 920         <screen># lctl get_param llite.testfs-*.extents_stats
 921 snapshot_time:                     1213828728.348516 (secs.usecs)
 922                        read           |            write
 923 extents          calls  %      cum%   |     calls  %     cum%
 924
 925 0K - 4K :        0      0      0      |     2      2     2
 926 4K - 8K :        0      0      0      |     0      0     2
 927 8K - 16K :       0      0      0      |     0      0     2
 928 16K - 32K :      0      0      0      |     20     23    26
 929 32K - 64K :      0      0      0      |     0      0     26
 930 64K - 128K :     0      0      0      |     51     60    86
 931 128K - 256K :    0      0      0      |     0      0     86
 932 256K - 512K :    0      0      0      |     0      0     86
 933 512K - 1024K :   0      0      0      |     0      0     86
 934 1M - 2M :        0      0      0      |     11     13    100</screen>
 935         <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file
 936           was read. The table shows cumulative extents organized according to size with statistics
 937           provided separately for reads and writes. Each row in the table shows the number of RPCs
 938           for reads and writes respectively (<literal>calls</literal>), the relative percentage of
 939           total calls (<literal>%</literal>), and the cumulative percentage to
 940           that point in the table of calls (<literal>cum %</literal>). </para>
 941         <para> The file can be cleared by issuing the following command:
 942         <screen># lctl set_param llite.testfs-*.extents_stats=1</screen></para>
 943       </section>
 944       <section>
 945         <title>Per-Process Client I/O Statistics</title>
 946         <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size
 947           statistics on a per-process basis.</para>
 948         <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 949         <screen># lctl get_param llite.testfs-*.extents_stats_per_process
 950 snapshot_time:                     1213828762.204440 (secs.usecs)
 951                           read            |             write
 952 extents            calls   %      cum%    |      calls   %       cum%
 953
 954 PID: 11488
 955    0K - 4K :       0       0       0      |      0       0       0
 956    4K - 8K :       0       0       0      |      0       0       0
 957    8K - 16K :      0       0       0      |      0       0       0
 958    16K - 32K :     0       0       0      |      0       0       0
 959    32K - 64K :     0       0       0      |      0       0       0
 960    64K - 128K :    0       0       0      |      0       0       0
 961    128K - 256K :   0       0       0      |      0       0       0
 962    256K - 512K :   0       0       0      |      0       0       0
 963    512K - 1024K :  0       0       0      |      0       0       0
 964    1M - 2M :       0       0       0      |      10      100     100
 965
 966 PID: 11491
 967    0K - 4K :       0       0       0      |      0       0       0
 968    4K - 8K :       0       0       0      |      0       0       0
 969    8K - 16K :      0       0       0      |      0       0       0
 970    16K - 32K :     0       0       0      |      20      100     100
 971
 972 PID: 11424
 973    0K - 4K :       0       0       0      |      0       0       0
 974    4K - 8K :       0       0       0      |      0       0       0
 975    8K - 16K :      0       0       0      |      0       0       0
 976    16K - 32K :     0       0       0      |      0       0       0
 977    32K - 64K :     0       0       0      |      0       0       0
 978    64K - 128K :    0       0       0      |      16      100     100
 979
 980 PID: 11426
 981    0K - 4K :       0       0       0      |      1       100     100
 982
 983 PID: 11429
 984    0K - 4K :       0       0       0      |      1       100     100
 985
 986 </screen>
 987         <para>This table shows cumulative extents organized according to size for each process ID
 988           (PID) with statistics provided separately for reads and writes. Each row in the table
 989           shows the number of RPCs for reads and writes respectively (<literal>calls</literal>), the
 990           relative percentage of total calls (<literal>%</literal>), and the cumulative percentage
 991           to that point in the table of calls (<literal>cum %</literal>). </para>
 992       </section>
 993     </section>
 994     <section xml:id="dbdoclet.50438271_55057">
 995       <title><indexterm>
 996           <primary>proc</primary>
 997           <secondary>block I/O</secondary>
 998         </indexterm>Monitoring the OST Block I/O Stream</title>
 999       <para>The <literal>brw_stats</literal> file in the <literal>obdfilter</literal> directory
1000         contains histogram data showing statistics for number of I/O requests sent to the disk,
1001         their size, and whether they are contiguous on the disk or not.</para>
1002       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
1003       <para>Enter on the OSS:</para>
1004       <screen># lctl get_param obdfilter.testfs-OST0000.brw_stats
1005 snapshot_time:         1372775039.769045 (secs.usecs)
1006                            read      |      write
1007 pages per bulk r/w     rpcs  % cum % |  rpcs   % cum %
1008 1:                     108 100 100   |    39   0   0
1009 2:                       0   0 100   |     6   0   0
1010 4:                       0   0 100   |     1   0   0
1011 8:                       0   0 100   |     0   0   0
1012 16:                      0   0 100   |     4   0   0
1013 32:                      0   0 100   |    17   0   0
1014 64:                      0   0 100   |    12   0   0
1015 128:                     0   0 100   |    24   0   0
1016 256:                     0   0 100   | 23142  99 100
1017
1018                            read      |      write
1019 discontiguous pages    rpcs  % cum % |  rpcs   % cum %
1020 0:                     108 100 100   | 23245 100 100
1021
1022                            read      |      write
1023 discontiguous blocks   rpcs  % cum % |  rpcs   % cum %
1024 0:                     108 100 100   | 23243  99  99
1025 1:                       0   0 100   |     2   0 100
1026
1027                            read      |      write
1028 disk fragmented I/Os   ios   % cum % |   ios   % cum %
1029 0:                      94  87  87   |     0   0   0
1030 1:                      14  12 100   | 23243  99  99
1031 2:                       0   0 100   |     2   0 100
1032
1033                            read      |      write
1034 disk I/Os in flight    ios   % cum % |   ios   % cum %
1035 1:                      14 100 100   | 20896  89  89
1036 2:                       0   0 100   |  1071   4  94
1037 3:                       0   0 100   |   573   2  96
1038 4:                       0   0 100   |   300   1  98
1039 5:                       0   0 100   |   166   0  98
1040 6:                       0   0 100   |   108   0  99
1041 7:                       0   0 100   |    81   0  99
1042 8:                       0   0 100   |    47   0  99
1043 9:                       0   0 100   |     5   0 100
1044
1045                            read      |      write
1046 I/O time (1/1000s)     ios   % cum % |   ios   % cum %
1047 1:                      94  87  87   |     0   0   0
1048 2:                       0   0  87   |     7   0   0
1049 4:                      14  12 100   |    27   0   0
1050 8:                       0   0 100   |    14   0   0
1051 16:                      0   0 100   |    31   0   0
1052 32:                      0   0 100   |    38   0   0
1053 64:                      0   0 100   | 18979  81  82
1054 128:                     0   0 100   |   943   4  86
1055 256:                     0   0 100   |  1233   5  91
1056 512:                     0   0 100   |  1825   7  99
1057 1K:                      0   0 100   |   99   0  99
1058 2K:                      0   0 100   |     0   0  99
1059 4K:                      0   0 100   |     0   0  99
1060 8K:                      0   0 100   |    49   0 100
1061
1062                            read      |      write
1063 disk I/O size          ios   % cum % |   ios   % cum %
1064 4K:                     14 100 100   |    41   0   0
1065 8K:                      0   0 100   |     6   0   0
1066 16K:                     0   0 100   |     1   0   0
1067 32K:                     0   0 100   |     0   0   0
1068 64K:                     0   0 100   |     4   0   0
1069 128K:                    0   0 100   |    17   0   0
1070 256K:                    0   0 100   |    12   0   0
1071 512K:                    0   0 100   |    24   0   0
1072 1M:                      0   0 100   | 23142  99 100
1073 </screen>
1074       <para>The tabular data is described in the table below. Each row in the table shows the number
1075         of reads and writes occurring for the statistic (<literal>ios</literal>), the relative
1076         percentage of total reads or writes (<literal>%</literal>), and the cumulative percentage to
1077         that point in the table for the statistic (<literal>cum %</literal>). </para>
1078       <informaltable frame="all">
1079         <tgroup cols="2">
1080           <colspec colname="c1" colwidth="40*"/>
1081           <colspec colname="c2" colwidth="60*"/>
1082           <thead>
1083             <row>
1084               <entry>
1085                 <para><emphasis role="bold">Field</emphasis></para>
1086               </entry>
1087               <entry>
1088                 <para><emphasis role="bold">Description</emphasis></para>
1089               </entry>
1090             </row>
1091           </thead>
1092           <tbody>
1093             <row>
1094               <entry>
1095                 <para>
1096                   <literal>pages per bulk r/w</literal></para>
1097               </entry>
1098               <entry>
1099                 <para>Number of pages per RPC request, which should match aggregate client
1100                     <literal>rpc_stats</literal> (see <xref
1101                     xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"
1102                   />).</para>
1103               </entry>
1104             </row>
1105             <row>
1106               <entry>
1107                 <para>
1108                   <literal>discontiguous pages</literal></para>
1109               </entry>
1110               <entry>
1111                 <para>Number of discontinuities in the logical file offset of each page in a single
1112                   RPC.</para>
1113               </entry>
1114             </row>
1115             <row>
1116               <entry>
1117                 <para>
1118                   <literal>discontiguous blocks</literal></para>
1119               </entry>
1120               <entry>
1121                 <para>Number of discontinuities in the physical block allocation in the file system
1122                   for a single RPC.</para>
1123               </entry>
1124             </row>
1125             <row>
1126               <entry>
1127                 <para><literal>disk fragmented I/Os</literal></para>
1128               </entry>
1129               <entry>
1130                 <para>Number of I/Os that were not written entirely sequentially.</para>
1131               </entry>
1132             </row>
1133             <row>
1134               <entry>
1135                 <para><literal>disk I/Os in flight</literal></para>
1136               </entry>
1137               <entry>
1138                 <para>Number of disk I/Os currently pending.</para>
1139               </entry>
1140             </row>
1141             <row>
1142               <entry>
1143                 <para><literal>I/O time (1/1000s)</literal></para>
1144               </entry>
1145               <entry>
1146                 <para>Amount of time for each I/O operation to complete.</para>
1147               </entry>
1148             </row>
1149             <row>
1150               <entry>
1151                 <para><literal>disk I/O size</literal></para>
1152               </entry>
1153               <entry>
1154                 <para>Size of each I/O operation.</para>
1155               </entry>
1156             </row>
1157           </tbody>
1158         </tgroup>
1159       </informaltable>
1160       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
1161       <para>This data provides an indication of extent size and distribution in the file
1162         system.</para>
1163     </section>
1164   </section>
1165   <section>
1166     <title>Tuning Lustre File System I/O</title>
1167     <para>Each OSC has its own tree of tunables. For example:</para>
1168     <screen>$ lctl lctl list_param osc.*.*
1169 osc.myth-OST0000-osc-ffff8804296c2800.active
1170 osc.myth-OST0000-osc-ffff8804296c2800.blocksize
1171 osc.myth-OST0000-osc-ffff8804296c2800.checksum_dump
1172 osc.myth-OST0000-osc-ffff8804296c2800.checksum_type
1173 osc.myth-OST0000-osc-ffff8804296c2800.checksums
1174 osc.myth-OST0000-osc-ffff8804296c2800.connect_flags
1175 :
1176 :
1177 osc.myth-OST0000-osc-ffff8804296c2800.state
1178 osc.myth-OST0000-osc-ffff8804296c2800.stats
1179 osc.myth-OST0000-osc-ffff8804296c2800.timeouts
1180 osc.myth-OST0000-osc-ffff8804296c2800.unstable_stats
1181 osc.myth-OST0000-osc-ffff8804296c2800.uuid
1182 osc.myth-OST0001-osc-ffff8804296c2800.active
1183 osc.myth-OST0001-osc-ffff8804296c2800.blocksize
1184 osc.myth-OST0001-osc-ffff8804296c2800.checksum_dump
1185 osc.myth-OST0001-osc-ffff8804296c2800.checksum_type
1186 :
1187 :
1188 </screen>
1189     <para>The following sections describe some of the parameters that can
1190       be tuned in a Lustre file system.</para>
1191     <section remap="h3" xml:id="TuningClientIORPCStream">
1192       <title><indexterm>
1193           <primary>proc</primary>
1194           <secondary>RPC tunables</secondary>
1195         </indexterm>Tuning the Client I/O RPC Stream</title>
1196       <para>Ideally, an optimal amount of data is packed into each I/O RPC
1197         and a consistent number of issued RPCs are in progress at any time.
1198         To help optimize the client I/O RPC stream, several tuning variables
1199         are provided to adjust behavior according to network conditions and
1200         cluster size. For information about monitoring the client I/O RPC
1201         stream, see <xref
1202           xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"/>.</para>
1203       <para>RPC stream tunables include:</para>
1204       <para>
1205         <itemizedlist>
1206           <listitem>
1207             <para><literal>osc.<replaceable>osc_instance</replaceable>.checksums</literal>
1208               - Controls whether the client will calculate data integrity
1209               checksums for the bulk data transferred to the OST.  Data
1210               integrity checksums are enabled by default.  The algorithm used
1211               can be set using the <literal>checksum_type</literal> parameter.
1212             </para>
1213           </listitem>
1214           <listitem>
1215             <para><literal>osc.<replaceable>osc_instance</replaceable>.checksum_type</literal>
1216               - Controls the data integrity checksum algorithm used by the
1217               client.  The available algorithms are determined by the set of
1218               algorihtms.  The checksum algorithm used by default is determined
1219               by first selecting the fastest algorithms available on the OST,
1220               and then selecting the fastest of those algorithms on the client,
1221               which depends on available optimizations in the CPU hardware and
1222               kernel.  The default algorithm can be overridden by writing the
1223               algorithm name into the <literal>checksum_type</literal>
1224               parameter.  Available checksum types can be seen on the client by
1225               reading the <literal>checksum_type</literal> parameter. Currently
1226               supported checksum types are:
1227               <literal>adler</literal>,
1228               <literal>crc32</literal>,
1229               <literal>crc32c</literal>
1230             </para>
1231             <para condition="l2C">
1232               In Lustre release 2.12 additional checksum types were added to
1233               allow end-to-end checksum integration with T10-PI capable
1234               hardware.  The client will compute the appropriate checksum
1235               type, based on the checksum type used by the storage, for the
1236               RPC checksum, which will be verified by the server and passed
1237               on to the storage.  The T10-PI checksum types are:
1238               <literal>t10ip512</literal>,
1239               <literal>t10ip4K</literal>,
1240               <literal>t10crc512</literal>,
1241               <literal>t10crc4K</literal>
1242             </para>
1243           </listitem>
1244           <listitem>
1245             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal>
1246               - Controls how many MiB of dirty data can be written into the
1247               client pagecache for writes by <emphasis>each</emphasis> OSC.
1248               When this limit is reached, additional writes block until
1249               previously-cached data is written to the server. This may be
1250               changed by the <literal>lctl set_param</literal> command. Only
1251               values larger than 0 and smaller than the lesser of 2048 MiB or
1252               1/4 of client RAM are valid. Performance can suffers if the
1253               client cannot aggregate enough data per OSC to form a full RPC
1254               (as set by the <literal>max_pages_per_rpc</literal>) parameter,
1255               unless the application is doing very large writes itself.
1256             </para>
1257             <para>To maximize performance, the value for
1258               <literal>max_dirty_mb</literal> is recommended to be at least
1259               4 * <literal>max_pages_per_rpc</literal> *
1260               <literal>max_rpcs_in_flight</literal>.
1261             </para>
1262           </listitem>
1263           <listitem>
1264             <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal>
1265               - A read-only value that returns the current number of bytes
1266               written and cached by this OSC.
1267             </para>
1268           </listitem>
1269           <listitem>
1270             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal>
1271               - The maximum number of pages that will be sent in a single RPC
1272               request to the OST. The minimum value is one page and the maximum
1273               value is 16 MiB (4096 on systems with <literal>PAGE_SIZE</literal>
1274               of 4 KiB), with the default value of 4 MiB in one RPC.  The upper
1275               limit may also be constrained by <literal>ofd.*.brw_size</literal>
1276               setting on the OSS, and applies to all clients connected to that
1277               OST.  It is also possible to specify a units suffix (e.g.
1278               <literal>max_pages_per_rpc=4M</literal>), so the RPC size can be
1279               set independently of the client <literal>PAGE_SIZE</literal>.
1280             </para>
1281           </listitem>
1282           <listitem>
1283             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
1284               - The maximum number of concurrent RPCs in flight from an OSC to
1285               its OST. If the OSC tries to initiate an RPC but finds that it
1286               already has the same number of RPCs outstanding, it will wait to
1287               issue further RPCs until some complete. The minimum setting is 1
1288               and maximum setting is 256. The default value is 8 RPCs.
1289             </para>
1290             <para>To improve small file I/O performance, increase the
1291               <literal>max_rpcs_in_flight</literal> value.
1292             </para>
1293           </listitem>
1294           <listitem>
1295             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_cache_mb</literal>
1296               - Maximum amount of inactive data cached by the client.  The
1297               default value is 3/4 of the client RAM.
1298             </para>
1299           </listitem>
1300         </itemizedlist>
1301       </para>
1302       <note>
1303         <para>The value for <literal><replaceable>osc_instance</replaceable></literal>
1304           and <literal><replaceable>fsname_instance</replaceable></literal>
1305           are unique to each mount point to allow associating osc, mdc, lov,
1306           lmv, and llite parameters with the same mount point.  However, it is
1307           common for scripts to use a wildcard <literal>*</literal> or a
1308           filesystem-specific wildcard
1309           <literal><replaceable>fsname-*</replaceable></literal> to specify
1310           the parameter settings uniformly on all clients. For example:
1311 <screen>
1312 client$ lctl get_param osc.testfs-OST0000*.rpc_stats
1313 osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats=
1314 snapshot_time:         1375743284.337839 (secs.usecs)
1315 read RPCs in flight:  0
1316 write RPCs in flight: 0
1317 </screen></para>
1318       </note>
1319     </section>
1320     <section remap="h3" xml:id="TuningClientReadahead">
1321       <title><indexterm>
1322           <primary>proc</primary>
1323           <secondary>readahead</secondary>
1324         </indexterm>Tuning File Readahead and Directory Statahead</title>
1325       <para>File readahead and directory statahead enable reading of data
1326       into memory before a process requests the data. File readahead prefetches
1327       file content data into memory for <literal>read()</literal> related
1328       calls, while directory statahead fetches file metadata into memory for
1329       <literal>readdir()</literal> and <literal>stat()</literal> related
1330       calls.  When readahead and statahead work well, a process that accesses
1331       data finds that the information it needs is available immediately in
1332       memory on the client when requested without the delay of network I/O.
1333       </para>
1334       <section remap="h4">
1335         <title>Tuning File Readahead</title>
1336         <para>File readahead is triggered when two or more sequential reads
1337           by an application fail to be satisfied by data in the Linux buffer
1338           cache. The size of the initial readahead is determined by the RPC
1339           size and the file stripe size, but will typically be at least 1 MiB.
1340           Additional readaheads grow linearly and increment until the per-file
1341           or per-system readahead cache limit on the client is reached.</para>
1342         <para>Readahead tunables include:</para>
1343         <itemizedlist>
1344           <listitem>
1345             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_mb</literal>
1346               - Controls the maximum amount of data readahead on a file.
1347               Files are read ahead in RPC-sized chunks (4 MiB, or the size of
1348               the <literal>read()</literal> call, if larger) after the second
1349               sequential read on a file descriptor. Random reads are done at
1350               the size of the <literal>read()</literal> call only (no
1351               readahead). Reads to non-contiguous regions of the file reset
1352               the readahead algorithm, and readahead is not triggered until
1353               sequential reads take place again.
1354             </para>
1355             <para>
1356               This is the global limit for all files and cannot be larger than
1357               1/2 of the client RAM.  To disable readahead, set
1358               <literal>max_read_ahead_mb=0</literal>.
1359             </para>
1360           </listitem>
1361           <listitem>
1362             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_per_file_mb</literal>
1363               - Controls the maximum number of megabytes (MiB) of data that
1364               should be prefetched by the client when sequential reads are
1365               detected on a file.  This is the per-file readahead limit and
1366               cannot be larger than <literal>max_read_ahead_mb</literal>.
1367             </para>
1368           </listitem>
1369           <listitem>
1370             <para><literal>llite.<replaceable>fsname_instance</replaceable>.max_read_ahead_whole_mb</literal>
1371               - Controls the maximum size of a file in MiB that is read in its
1372               entirety upon access, regardless of the size of the
1373               <literal>read()</literal> call.  This avoids multiple small read
1374               RPCs on relatively small files, when it is not possible to
1375               efficiently detect a sequential read pattern before the whole
1376               file has been read.
1377             </para>
1378             <para>The default value is the greater of 2 MiB or the size of one
1379               RPC, as given by <literal>max_pages_per_rpc</literal>.
1380             </para>
1381           </listitem>
1382         </itemizedlist>
1383       </section>
1384       <section>
1385         <title>Tuning Directory Statahead and AGL</title>
1386         <para>Many system commands, such as <literal>ls –l</literal>,
1387         <literal>du</literal>, and <literal>find</literal>, traverse a
1388         directory sequentially. To make these commands run efficiently, the
1389         directory statahead can be enabled to improve the performance of
1390         directory traversal.</para>
1391         <para>The statahead tunables are:</para>
1392         <itemizedlist>
1393           <listitem>
1394             <para><literal>statahead_max</literal> -
1395             Controls the maximum number of file attributes that will be
1396             prefetched by the statahead thread. By default, statahead is
1397             enabled and <literal>statahead_max</literal> is 32 files.</para>
1398             <para>To disable statahead, set <literal>statahead_max</literal>
1399             to zero via the following command on the client:</para>
1400             <screen>lctl set_param llite.*.statahead_max=0</screen>
1401             <para>To change the maximum statahead window size on a client:</para>
1402             <screen>lctl set_param llite.*.statahead_max=<replaceable>n</replaceable></screen>
1403             <para>The maximum <literal>statahead_max</literal> is 8192 files.
1404             </para>
1405             <para>The directory statahead thread will also prefetch the file
1406             size/block attributes from the OSTs, so that all file attributes
1407             are available on the client when requested by an application.
1408             This is controlled by the asynchronous glimpse lock (AGL) setting.
1409             The AGL behaviour can be disabled by setting:</para>
1410             <screen>lctl set_param llite.*.statahead_agl=0</screen>
1411           </listitem>
1412           <listitem>
1413             <para><literal>statahead_stats</literal> -
1414             A read-only interface that provides current statahead and AGL
1415             statistics, such as how many times statahead/AGL has been triggered
1416             since the last mount, how many statahead/AGL failures have occurred
1417             due to an incorrect prediction or other causes.</para>
1418             <note>
1419               <para>AGL behaviour is affected by statahead since the inodes
1420               processed by AGL are built by the statahead thread.  If
1421               statahead is disabled, then AGL is also disabled.</para>
1422             </note>
1423           </listitem>
1424         </itemizedlist>
1425       </section>
1426     </section>
1427     <section remap="h3">
1428       <title><indexterm>
1429           <primary>proc</primary>
1430           <secondary>read cache</secondary>
1431         </indexterm>Tuning OSS Read Cache</title>
1432       <para>The OSS read cache feature provides read-only caching of data on an OSS. This
1433         functionality uses the Linux page cache to store the data and uses as much physical memory
1434         as is allocated.</para>
1435       <para>OSS read cache improves Lustre file system performance in these situations:</para>
1436       <itemizedlist>
1437         <listitem>
1438           <para>Many clients are accessing the same data set (as in HPC applications or when
1439             diskless clients boot from the Lustre file system).</para>
1440         </listitem>
1441         <listitem>
1442           <para>One client is storing data while another client is reading it (i.e., clients are
1443             exchanging data via the OST).</para>
1444         </listitem>
1445         <listitem>
1446           <para>A client has very limited caching of its own.</para>
1447         </listitem>
1448       </itemizedlist>
1449       <para>OSS read cache offers these benefits:</para>
1450       <itemizedlist>
1451         <listitem>
1452           <para>Allows OSTs to cache read data more frequently.</para>
1453         </listitem>
1454         <listitem>
1455           <para>Improves repeated reads to match network speeds instead of disk speeds.</para>
1456         </listitem>
1457         <listitem>
1458           <para>Provides the building blocks for OST write cache (small-write aggregation).</para>
1459         </listitem>
1460       </itemizedlist>
1461       <section remap="h4">
1462         <title>Using OSS Read Cache</title>
1463         <para>OSS read cache is implemented on the OSS, and does not require any special support on
1464           the client side. Since OSS read cache uses the memory available in the Linux page cache,
1465           the appropriate amount of memory for the cache should be determined based on I/O patterns;
1466           if the data is mostly reads, then more cache is required than would be needed for mostly
1467           writes.</para>
1468         <para>OSS read cache is managed using the following tunables:</para>
1469         <itemizedlist>
1470           <listitem>
1471             <para><literal>read_cache_enable</literal> - Controls whether data read from disk during
1472               a read request is kept in memory and available for later read requests for the same
1473               data, without having to re-read it from disk. By default, read cache is enabled
1474                 (<literal>read_cache_enable=1</literal>).</para>
1475             <para>When the OSS receives a read request from a client, it reads data from disk into
1476               its memory and sends the data as a reply to the request. If read cache is enabled,
1477               this data stays in memory after the request from the client has been fulfilled. When
1478               subsequent read requests for the same data are received, the OSS skips reading data
1479               from disk and the request is fulfilled from the cached data. The read cache is managed
1480               by the Linux kernel globally across all OSTs on that OSS so that the least recently
1481               used cache pages are dropped from memory when the amount of free memory is running
1482               low.</para>
1483             <para>If read cache is disabled (<literal>read_cache_enable=0</literal>), the OSS
1484               discards the data after a read request from the client is serviced and, for subsequent
1485               read requests, the OSS again reads the data from disk.</para>
1486             <para>To disable read cache on all the OSTs of an OSS, run:</para>
1487             <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
1488             <para>To re-enable read cache on one OST, run:</para>
1489             <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1</screen>
1490             <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
1491             <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
1492           </listitem>
1493           <listitem>
1494             <para><literal>writethrough_cache_enable</literal> - Controls whether data sent to the
1495               OSS as a write request is kept in the read cache and available for later reads, or if
1496               it is discarded from cache when the write is completed. By default, the writethrough
1497               cache is enabled (<literal>writethrough_cache_enable=1</literal>).</para>
1498             <para>When the OSS receives write requests from a client, it receives data from the
1499               client into its memory and writes the data to disk. If the writethrough cache is
1500               enabled, this data stays in memory after the write request is completed, allowing the
1501               OSS to skip reading this data from disk if a later read request, or partial-page write
1502               request, for the same data is received.</para>
1503             <para>If the writethrough cache is disabled
1504                 (<literal>writethrough_cache_enabled=0</literal>), the OSS discards the data after
1505               the write request from the client is completed. For subsequent read requests, or
1506               partial-page write requests, the OSS must re-read the data from disk.</para>
1507             <para>Enabling writethrough cache is advisable if clients are doing small or unaligned
1508               writes that would cause partial-page updates, or if the files written by one node are
1509               immediately being accessed by other nodes. Some examples where enabling writethrough
1510               cache might be useful include producer-consumer I/O models or shared-file writes with
1511               a different node doing I/O not aligned on 4096-byte boundaries. </para>
1512             <para>Disabling the writethrough cache is advisable when files are mostly written to the
1513               file system but are not re-read within a short time period, or files are only written
1514               and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
1515             <para>To disable the writethrough cache on all OSTs of an OSS, run:</para>
1516             <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
1517             <para>To re-enable the writethrough cache on one OST, run:</para>
1518             <screen>root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
1519             <para>To check if the writethrough cache is enabled, run:</para>
1520             <screen>root@oss1# lctl get_param obdfilter.*.writethrough_cache_enable</screen>
1521           </listitem>
1522           <listitem>
1523             <para><literal>readcache_max_filesize</literal> - Controls the maximum size of a file
1524               that both the read cache and writethrough cache will try to keep in memory. Files
1525               larger than <literal>readcache_max_filesize</literal> will not be kept in cache for
1526               either reads or writes.</para>
1527             <para>Setting this tunable can be useful for workloads where relatively small files are
1528               repeatedly accessed by many clients, such as job startup files, executables, log
1529               files, etc., but large files are read or written only once. By not putting the larger
1530               files into the cache, it is much more likely that more of the smaller files will
1531               remain in cache for a longer time.</para>
1532             <para>When setting <literal>readcache_max_filesize</literal>, the input value can be
1533               specified in bytes, or can have a suffix to indicate other binary units such as
1534                 <literal>K</literal> (kilobytes), <literal>M</literal> (megabytes),
1535                 <literal>G</literal> (gigabytes), <literal>T</literal> (terabytes), or
1536                 <literal>P</literal> (petabytes).</para>
1537             <para>To limit the maximum cached file size to 32 MB on all OSTs of an OSS, run:</para>
1538             <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
1539             <para>To disable the maximum cached file size on an OST, run:</para>
1540             <screen>root@oss1# lctl set_param obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
1541             <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
1542             <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
1543           </listitem>
1544         </itemizedlist>
1545       </section>
1546     </section>
1547     <section>
1548       <title><indexterm>
1549           <primary>proc</primary>
1550           <secondary>OSS journal</secondary>
1551         </indexterm>Enabling OSS Asynchronous Journal Commit</title>
1552       <para>The OSS asynchronous journal commit feature asynchronously writes data to disk without
1553         forcing a journal flush. This reduces the number of seeks and significantly improves
1554         performance on some hardware.</para>
1555       <note>
1556         <para>Asynchronous journal commit cannot work with direct I/O-originated writes
1557             (<literal>O_DIRECT</literal> flag set). In this case, a journal flush is forced. </para>
1558       </note>
1559       <para>When the asynchronous journal commit feature is enabled, client nodes keep data in the
1560         page cache (a page reference). Lustre clients monitor the last committed transaction number
1561           (<literal>transno</literal>) in messages sent from the OSS to the clients. When a client
1562         sees that the last committed <literal>transno</literal> reported by the OSS is at least
1563         equal to the bulk write <literal>transno</literal>, it releases the reference on the
1564         corresponding pages. To avoid page references being held for too long on clients after a
1565         bulk write, a 7 second ping request is scheduled (the default OSS file system commit time
1566         interval is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity
1567         to report the last committed <literal>transno</literal>.</para>
1568       <para>If the OSS crashes before the journal commit occurs, then intermediate data is lost.
1569         However, OSS recovery functionality incorporated into the asynchronous journal commit
1570         feature causes clients to replay their write requests and compensate for the missing disk
1571         updates by restoring the state of the file system.</para>
1572       <para>By default, <literal>sync_journal</literal> is enabled
1573           (<literal>sync_journal=1</literal>), so that journal entries are committed synchronously.
1574         To enable asynchronous journal commit, set the <literal>sync_journal</literal> parameter to
1575           <literal>0</literal> by entering: </para>
1576       <screen>$ lctl set_param obdfilter.*.sync_journal=0
1577 obdfilter.lol-OST0001.sync_journal=0</screen>
1578       <para>An associated <literal>sync-on-lock-cancel</literal> feature (enabled by default)
1579         addresses a data consistency issue that can result if an OSS crashes after multiple clients
1580         have written data into intersecting regions of an object, and then one of the clients also
1581         crashes. A condition is created in which the POSIX requirement for continuous writes is
1582         violated along with a potential for corrupted data. With
1583           <literal>sync-on-lock-cancel</literal> enabled, if a cancelled lock has any volatile
1584         writes attached to it, the OSS synchronously writes the journal to disk on lock
1585         cancellation. Disabling the <literal>sync-on-lock-cancel</literal> feature may enhance
1586         performance for concurrent write workloads, but it is recommended that you not disable this
1587         feature.</para>
1588       <para> The <literal>sync_on_lock_cancel</literal> parameter can be set to the following
1589         values:</para>
1590       <itemizedlist>
1591         <listitem>
1592           <para><literal>always</literal> - Always force a journal flush on lock cancellation
1593             (default when <literal>async_journal</literal> is enabled).</para>
1594         </listitem>
1595         <listitem>
1596           <para><literal>blocking</literal> - Force a journal flush only when the local cancellation
1597             is due to a blocking callback.</para>
1598         </listitem>
1599         <listitem>
1600           <para><literal>never</literal> - Do not force any journal flush (default when
1601               <literal>async_journal</literal> is disabled).</para>
1602         </listitem>
1603       </itemizedlist>
1604       <para>For example, to set <literal>sync_on_lock_cancel</literal> to not to force a journal
1605         flush, use a command similar to:</para>
1606       <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
1607 obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
1608     </section>
1609     <section xml:id="dbdoclet.TuningModRPCs" condition='l28'>
1610       <title>
1611         <indexterm>
1612           <primary>proc</primary>
1613           <secondary>client metadata performance</secondary>
1614         </indexterm>
1615         Tuning the Client Metadata RPC Stream
1616       </title>
1617       <para>The client metadata RPC stream represents the metadata RPCs issued
1618         in parallel by a client to a MDT target. The metadata RPCs can be split
1619         in two categories: the requests that do not modify the file system
1620         (like getattr operation), and the requests that do modify the file system
1621         (like create, unlink, setattr operations). To help optimize the client
1622         metadata RPC stream, several tuning variables are provided to adjust
1623         behavior according to network conditions and cluster size.</para>
1624       <para>Note that increasing the number of metadata RPCs issued in parallel
1625         might improve the performance metadata intensive parallel applications,
1626         but as a consequence it will consume more memory on the client and on
1627         the MDS.</para>
1628       <section>
1629         <title>Configuring the Client Metadata RPC Stream</title>
1630         <para>The MDC <literal>max_rpcs_in_flight</literal> parameter defines
1631           the maximum number of metadata RPCs, both modifying and
1632           non-modifying RPCs, that can be sent in parallel by a client to a MDT
1633           target. This includes every file system metadata operations, such as
1634           file or directory stat, creation, unlink. The default setting is 8,
1635           minimum setting is 1 and maximum setting is 256.</para>
1636         <para>To set the <literal>max_rpcs_in_flight</literal> parameter, run
1637           the following command on the Lustre client:</para>
1638         <screen>client$ lctl set_param mdc.*.max_rpcs_in_flight=16</screen>
1639         <para>The MDC <literal>max_mod_rpcs_in_flight</literal> parameter
1640           defines the maximum number of file system modifying RPCs that can be
1641           sent in parallel by a client to a MDT target. For example, the Lustre
1642           client sends modify RPCs when it performs file or directory creation,
1643           unlink, access permission modification or ownership modification. The
1644           default setting is 7, minimum setting is 1 and maximum setting is
1645           256.</para>
1646         <para>To set the <literal>max_mod_rpcs_in_flight</literal> parameter,
1647           run the following command on the Lustre client:</para>
1648         <screen>client$ lctl set_param mdc.*.max_mod_rpcs_in_flight=12</screen>
1649         <para>The <literal>max_mod_rpcs_in_flight</literal> value must be
1650           strictly less than the <literal>max_rpcs_in_flight</literal> value.
1651           It must also be less or equal to the MDT
1652           <literal>max_mod_rpcs_per_client</literal> value. If one of theses
1653           conditions is not enforced, the setting fails and an explicit message
1654           is written in the Lustre log.</para>
1655         <para>The MDT <literal>max_mod_rpcs_per_client</literal> parameter is a
1656           tunable of the kernel module <literal>mdt</literal> that defines the
1657           maximum number of file system modifying RPCs in flight allowed per
1658           client. The parameter can be updated at runtime, but the change is
1659           effective to new client connections only. The default setting is 8.
1660         </para>
1661         <para>To set the <literal>max_mod_rpcs_per_client</literal> parameter,
1662           run the following command on the MDS:</para>
1663         <screen>mds$ echo 12 > /sys/module/mdt/parameters/max_mod_rpcs_per_client</screen>
1664       </section>
1665       <section>
1666         <title>Monitoring the Client Metadata RPC Stream</title>
1667         <para>The <literal>rpc_stats</literal> file contains histogram data
1668           showing information about modify metadata RPCs. It can be helpful to
1669           identify the level of parallelism achieved by an application doing
1670           modify metadata operations.</para>
1671         <para><emphasis role="bold">Example:</emphasis></para>
1672         <screen>client$ lctl get_param mdc.*.rpc_stats
1673 snapshot_time:         1441876896.567070 (secs.usecs)
1674 modify_RPCs_in_flight:  0
1675
1676                         modify
1677 rpcs in flight        rpcs   % cum %
1678 0:                       0   0   0
1679 1:                      56   0   0
1680 2:                      40   0   0
1681 3:                      70   0   0
1682 4                       41   0   0
1683 5:                      51   0   1
1684 6:                      88   0   1
1685 7:                     366   1   2
1686 8:                    1321   5   8
1687 9:                    3624  15  23
1688 10:                   6482  27  50
1689 11:                   7321  30  81
1690 12:                   4540  18 100</screen>
1691         <para>The file information includes:</para>
1692         <itemizedlist>
1693           <listitem>
1694             <para><literal>snapshot_time</literal> - UNIX epoch instant the
1695               file was read.</para>
1696           </listitem>
1697           <listitem>
1698             <para><literal>modify_RPCs_in_flight</literal> - Number of modify
1699               RPCs issued by the MDC, but not completed at the time of the
1700               snapshot. This value should always be less than or equal to
1701               <literal>max_mod_rpcs_in_flight</literal>.</para>
1702           </listitem>
1703           <listitem>
1704             <para><literal>rpcs in flight</literal> - Number of modify RPCs
1705               that are pending when a RPC is sent, the relative percentage
1706               (<literal>%</literal>) of total modify RPCs, and the cumulative
1707               percentage (<literal>cum %</literal>) to that point.</para>
1708           </listitem>
1709         </itemizedlist>
1710         <para>If a large proportion of modify metadata RPCs are issued with a
1711           number of pending metadata RPCs close to the
1712           <literal>max_mod_rpcs_in_flight</literal> value, it means the
1713           <literal>max_mod_rpcs_in_flight</literal> value could be increased to
1714           improve the modify metadata performance.</para>
1715       </section>
1716     </section>
1717   </section>
1718   <section>
1719     <title>Configuring Timeouts in a Lustre File System</title>
1720     <para>In a Lustre file system, RPC timeouts are set using an adaptive timeouts mechanism, which
1721       is enabled by default. Servers track RPC completion times and then report back to clients
1722       estimates for completion times for future RPCs. Clients  use these estimates to set RPC
1723       timeout values. If the processing of server requests slows down for any reason, the server
1724       estimates for RPC completion increase, and clients then revise RPC timeout values to allow
1725       more time for RPC completion.</para>
1726     <para>If the RPCs queued on the server approach the RPC timeout specified by the client, to
1727       avoid RPC timeouts and disconnect/reconnect cycles, the server sends an "early reply" to the
1728       client, telling the client to allow more time. Conversely, as server processing speeds up, RPC
1729       timeout values decrease, resulting in faster detection if the server becomes non-responsive
1730       and quicker connection to the failover partner of the server.</para>
1731     <section>
1732       <title><indexterm>
1733           <primary>proc</primary>
1734           <secondary>configuring adaptive timeouts</secondary>
1735         </indexterm><indexterm>
1736           <primary>configuring</primary>
1737           <secondary>adaptive timeouts</secondary>
1738         </indexterm><indexterm>
1739           <primary>proc</primary>
1740           <secondary>adaptive timeouts</secondary>
1741         </indexterm>Configuring Adaptive Timeouts</title>
1742       <para>The adaptive timeout parameters in the table below can be set persistently system-wide
1743         using <literal>lctl conf_param</literal> on the MGS. For example, the following command sets
1744         the <literal>at_max</literal> value  for all servers and clients associated with the file
1745         system
1746         <literal>testfs</literal>:<screen>lctl conf_param testfs.sys.at_max=1500</screen></para>
1747       <note>
1748         <para>Clients that access multiple Lustre file systems must use the same parameter values
1749           for all file systems.</para>
1750       </note>
1751       <informaltable frame="all">
1752         <tgroup cols="2">
1753           <colspec colname="c1" colwidth="30*"/>
1754           <colspec colname="c2" colwidth="80*"/>
1755           <thead>
1756             <row>
1757               <entry>
1758                 <para><emphasis role="bold">Parameter</emphasis></para>
1759               </entry>
1760               <entry>
1761                 <para><emphasis role="bold">Description</emphasis></para>
1762               </entry>
1763             </row>
1764           </thead>
1765           <tbody>
1766             <row>
1767               <entry>
1768                 <para>
1769                   <literal> at_min </literal></para>
1770               </entry>
1771               <entry>
1772                 <para>Minimum adaptive timeout (in seconds). The default value is 0. The
1773                     <literal>at_min</literal> parameter is the minimum processing time that a server
1774                   will report. Ideally, <literal>at_min</literal> should be set to its default
1775                   value. Clients base their timeouts on this value, but they do not use this value
1776                   directly. </para>
1777                 <para>If, for unknown reasons (usually due to temporary network outages), the
1778                   adaptive timeout value is too short and clients time out their RPCs, you can
1779                   increase the <literal>at_min</literal> value to compensate for this.</para>
1780               </entry>
1781             </row>
1782             <row>
1783               <entry>
1784                 <para>
1785                   <literal> at_max </literal></para>
1786               </entry>
1787               <entry>
1788                 <para>Maximum adaptive timeout (in seconds). The <literal>at_max</literal> parameter
1789                   is an upper-limit on the service time estimate. If <literal>at_max</literal> is
1790                   reached, an RPC request times out.</para>
1791                 <para>Setting <literal>at_max</literal> to 0 causes adaptive timeouts to be disabled
1792                   and a fixed timeout method to be used instead (see <xref
1793                     xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_c24_nt5_dl"/></para>
1794                 <note>
1795                   <para>If slow hardware causes the service estimate to increase beyond the default
1796                     value of <literal>at_max</literal>, increase <literal>at_max</literal> to the
1797                     maximum time you are willing to wait for an RPC completion.</para>
1798                 </note>
1799               </entry>
1800             </row>
1801             <row>
1802               <entry>
1803                 <para>
1804                   <literal> at_history </literal></para>
1805               </entry>
1806               <entry>
1807                 <para>Time period (in seconds) within which adaptive timeouts remember the slowest
1808                   event that occurred. The default is 600.</para>
1809               </entry>
1810             </row>
1811             <row>
1812               <entry>
1813                 <para>
1814                   <literal> at_early_margin </literal></para>
1815               </entry>
1816               <entry>
1817                 <para>Amount of time before the Lustre server sends an early reply (in seconds).
1818                   Default is 5.</para>
1819               </entry>
1820             </row>
1821             <row>
1822               <entry>
1823                 <para>
1824                   <literal> at_extra </literal></para>
1825               </entry>
1826               <entry>
1827                 <para>Incremental amount of time that a server requests with each early reply (in
1828                   seconds). The server does not know how much time the RPC will take, so it asks for
1829                   a fixed value. The default is 30, which provides a balance between sending too
1830                   many early replies for the same RPC and overestimating the actual completion
1831                   time.</para>
1832                 <para>When a server finds a queued request about to time out and needs to send an
1833                   early reply out, the server adds the <literal>at_extra</literal> value. If the
1834                   time expires, the Lustre server drops the request, and the client enters recovery
1835                   status and reconnects to restore the connection to normal status.</para>
1836                 <para>If you see multiple early replies for the same RPC asking for 30-second
1837                   increases, change the <literal>at_extra</literal> value to a larger number to cut
1838                   down on early replies sent and, therefore, network load.</para>
1839               </entry>
1840             </row>
1841             <row>
1842               <entry>
1843                 <para>
1844                   <literal> ldlm_enqueue_min </literal></para>
1845               </entry>
1846               <entry>
1847                 <para>Minimum lock enqueue time (in seconds). The default is 100. The time it takes
1848                   to enqueue a lock, <literal>ldlm_enqueue</literal>, is the maximum of the measured
1849                   enqueue estimate (influenced by <literal>at_min</literal> and
1850                     <literal>at_max</literal> parameters), multiplied by a weighting factor and the
1851                   value of <literal>ldlm_enqueue_min</literal>. </para>
1852                 <para>Lustre Distributed Lock Manager (LDLM) lock enqueues have a dedicated minimum
1853                   value for <literal>ldlm_enqueue_min</literal>. Lock enqueue timeouts increase as
1854                   the measured enqueue times increase (similar to adaptive timeouts).</para>
1855               </entry>
1856             </row>
1857           </tbody>
1858         </tgroup>
1859       </informaltable>
1860       <section>
1861         <title>Interpreting Adaptive Timeout Information</title>
1862         <para>Adaptive timeout information can be obtained via
1863           <literal>lctl get_param {osc,mdc}.*.timeouts</literal> files on each
1864           client and <literal>lctl get_param {ost,mds}.*.*.timeouts</literal>
1865           on each server.  To read information from a
1866           <literal>timeouts</literal> file, enter a command similar to:</para>
1867         <screen># lctl get_param -n ost.*.ost_io.timeouts
1868 service : cur 33  worst 34 (at 1193427052, 1600s ago) 1 1 33 2</screen>
1869         <para>In this example, the <literal>ost_io</literal> service on this
1870           node is currently reporting an estimated RPC service time of 33
1871           seconds. The worst RPC service time was 34 seconds, which occurred
1872           26 minutes ago.</para>
1873         <para>The output also provides a history of service times.
1874           Four &quot;bins&quot; of adaptive timeout history are shown, with the
1875           maximum RPC time in each bin reported. In both the 0-150s bin and the
1876           150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the
1877           worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a
1878           maximum of RPC time of 2 seconds. The estimated service time is the
1879           maximum value in the four bins (33 seconds in this example).</para>
1880         <para>Service times (as reported by the servers) are also tracked in
1881           the client OBDs, as shown in this example:</para>
1882         <screen># lctl get_param osc.*.timeouts
1883 last reply : 1193428639, 0d0h00m00s ago
1884 network    : cur  1 worst  2 (at 1193427053, 0d0h26m26s ago)  1  1  1  1
1885 portal 6   : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33  2
1886 portal 28  : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  1  1  1
1887 portal 7   : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  0  1  1
1888 portal 17  : cur  1 worst  1 (at 1193426177, 0d0h41m02s ago)  1  0  0  1
1889 </screen>
1890         <para>In this example, portal 6, the <literal>ost_io</literal> service
1891           portal, shows the history of service estimates reported by the portal.
1892         </para>
1893         <para>Server statistic files also show the range of estimates including
1894           min, max, sum, and sum-squared. For example:</para>
1895         <screen># lctl get_param mdt.*.mdt.stats
1896 ...
1897 req_timeout               6 samples [sec] 1 10 15 105
1898 ...
1899 </screen>
1900       </section>
1901     </section>
1902     <section xml:id="section_c24_nt5_dl">
1903       <title>Setting Static Timeouts<indexterm>
1904           <primary>proc</primary>
1905           <secondary>static timeouts</secondary>
1906         </indexterm></title>
1907       <para>The Lustre software provides two sets of static (fixed) timeouts, LND timeouts and
1908         Lustre timeouts, which are used when adaptive timeouts are not enabled.</para>
1909       <para>
1910         <itemizedlist>
1911           <listitem>
1912             <para><emphasis role="italic"><emphasis role="bold">LND timeouts</emphasis></emphasis> -
1913               LND timeouts ensure that point-to-point communications across a network complete in a
1914               finite time in the presence of failures, such as packages lost or broken connections.
1915               LND timeout parameters are set for each individual LND.</para>
1916             <para>LND timeouts are logged with the <literal>S_LND</literal> flag set. They are not
1917               printed as console messages, so check the Lustre log for <literal>D_NETERROR</literal>
1918               messages or enable printing of <literal>D_NETERROR</literal> messages to the console
1919               using:<screen>lctl set_param printk=+neterror</screen></para>
1920             <para>Congested routers can be a source of spurious LND timeouts. To avoid this
1921               situation, increase the number of LNet router buffers to reduce back-pressure and/or
1922               increase LND timeouts on all nodes on all connected networks. Also consider increasing
1923               the total number of LNet router nodes in the system so that the aggregate router
1924               bandwidth matches the aggregate server bandwidth.</para>
1925           </listitem>
1926           <listitem>
1927             <para><emphasis role="italic"><emphasis role="bold">Lustre timeouts
1928                 </emphasis></emphasis>- Lustre timeouts ensure that Lustre RPCs complete in a finite
1929               time in the presence of failures when adaptive timeouts are not enabled. Adaptive
1930               timeouts are enabled by default. To disable adaptive timeouts at run time, set
1931                 <literal>at_max</literal> to 0 by running on the
1932               MGS:<screen># lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0</screen></para>
1933             <note>
1934               <para>Changing the status of adaptive timeouts at runtime may cause a transient client
1935                 timeout, recovery, and reconnection.</para>
1936             </note>
1937             <para>Lustre timeouts are always printed as console messages. </para>
1938             <para>If Lustre timeouts are not accompanied by LND timeouts, increase the Lustre
1939               timeout on both servers and clients. Lustre timeouts are set using a command such as
1940               the following:<screen># lctl set_param timeout=30</screen></para>
1941             <para>Lustre timeout parameters are described in the table below.</para>
1942           </listitem>
1943         </itemizedlist>
1944         <informaltable frame="all">
1945           <tgroup cols="2">
1946             <colspec colname="c1" colnum="1" colwidth="30*"/>
1947             <colspec colname="c2" colnum="2" colwidth="70*"/>
1948             <thead>
1949               <row>
1950                 <entry>Parameter</entry>
1951                 <entry>Description</entry>
1952               </row>
1953             </thead>
1954             <tbody>
1955               <row>
1956                 <entry><literal>timeout</literal></entry>
1957                 <entry>
1958                   <para>The time that a client waits for a server to complete an RPC (default 100s).
1959                     Servers wait half this time for a normal client RPC to complete and a quarter of
1960                     this time for a single bulk request (read or write of up to 4 MB) to complete.
1961                     The client pings recoverable targets (MDS and OSTs) at one quarter of the
1962                     timeout, and the server waits one and a half times the timeout before evicting a
1963                     client for being &quot;stale.&quot;</para>
1964                   <para>Lustre client sends periodic &apos;ping&apos; messages to servers with which
1965                     it has had no communication for the specified period of time. Any network
1966                     activity between a client and a server in the file system also serves as a
1967                     ping.</para>
1968                 </entry>
1969               </row>
1970               <row>
1971                 <entry><literal>ldlm_timeout</literal></entry>
1972                 <entry>
1973                   <para>The time that a server waits for a client to reply to an initial AST (lock
1974                     cancellation request). The default is 20s for an OST and 6s for an MDS. If the
1975                     client replies to the AST, the server will give it a normal timeout (half the
1976                     client timeout) to flush any dirty data and release the lock.</para>
1977                 </entry>
1978               </row>
1979               <row>
1980                 <entry><literal>fail_loc</literal></entry>
1981                 <entry>
1982                   <para>An internal debugging failure hook. The default value of
1983                       <literal>0</literal> means that no failure will be triggered or
1984                     injected.</para>
1985                 </entry>
1986               </row>
1987               <row>
1988                 <entry><literal>dump_on_timeout</literal></entry>
1989                 <entry>
1990                   <para>Triggers a dump of the Lustre debug log when a timeout occurs. The default
1991                     value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
1992                     not be triggered.</para>
1993                 </entry>
1994               </row>
1995               <row>
1996                 <entry><literal>dump_on_eviction</literal></entry>
1997                 <entry>
1998                   <para>Triggers a dump of the Lustre debug log when an eviction occurs. The default
1999                     value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
2000                     not be triggered. </para>
2001                 </entry>
2002               </row>
2003             </tbody>
2004           </tgroup>
2005         </informaltable>
2006       </para>
2007     </section>
2008   </section>
2009   <section remap="h3">
2010     <title><indexterm>
2011         <primary>proc</primary>
2012         <secondary>LNet</secondary>
2013       </indexterm><indexterm>
2014         <primary>LNet</primary>
2015         <secondary>proc</secondary>
2016       </indexterm>Monitoring LNet</title>
2017     <para>LNet information is located via <literal>lctl get_param</literal>
2018       in these parameters:
2019       <itemizedlist>
2020         <listitem>
2021           <para><literal>peers</literal> - Shows all NIDs known to this node
2022             and provides information on the queue state.</para>
2023           <para>Example:</para>
2024           <screen># lctl get_param peers
2025 nid                refs   state  max  rtr  min   tx    min   queue
2026 0@lo               1      ~rtr   0    0    0     0     0     0
2027 192.168.10.35@tcp  1      ~rtr   8    8    8     8     6     0
2028 192.168.10.36@tcp  1      ~rtr   8    8    8     8     6     0
2029 192.168.10.37@tcp  1      ~rtr   8    8    8     8     6     0</screen>
2030           <para>The fields are explained in the table below:</para>
2031           <informaltable frame="all">
2032             <tgroup cols="2">
2033               <colspec colname="c1" colwidth="30*"/>
2034               <colspec colname="c2" colwidth="80*"/>
2035               <thead>
2036                 <row>
2037                   <entry>
2038                     <para><emphasis role="bold">Field</emphasis></para>
2039                   </entry>
2040                   <entry>
2041                     <para><emphasis role="bold">Description</emphasis></para>
2042                   </entry>
2043                 </row>
2044               </thead>
2045               <tbody>
2046                 <row>
2047                   <entry>
2048                     <para>
2049                       <literal>refs</literal>
2050                     </para>
2051                   </entry>
2052                   <entry>
2053                     <para>A reference count. </para>
2054                   </entry>
2055                 </row>
2056                 <row>
2057                   <entry>
2058                     <para>
2059                       <literal>state</literal>
2060                     </para>
2061                   </entry>
2062                   <entry>
2063                     <para>If the node is a router, indicates the state of the router. Possible
2064                       values are:</para>
2065                     <itemizedlist>
2066                       <listitem>
2067                         <para><literal>NA</literal> - Indicates the node is not a router.</para>
2068                       </listitem>
2069                       <listitem>
2070                         <para><literal>up/down</literal>- Indicates if the node (router) is up or
2071                           down.</para>
2072                       </listitem>
2073                     </itemizedlist>
2074                   </entry>
2075                 </row>
2076                 <row>
2077                   <entry>
2078                     <para>
2079                       <literal>max </literal></para>
2080                   </entry>
2081                   <entry>
2082                     <para>Maximum number of concurrent sends from this peer.</para>
2083                   </entry>
2084                 </row>
2085                 <row>
2086                   <entry>
2087                     <para>
2088                       <literal>rtr </literal></para>
2089                   </entry>
2090                   <entry>
2091                     <para>Number of routing buffer credits.</para>
2092                   </entry>
2093                 </row>
2094                 <row>
2095                   <entry>
2096                     <para>
2097                       <literal>min </literal></para>
2098                   </entry>
2099                   <entry>
2100                     <para>Minimum number of routing buffer credits seen.</para>
2101                   </entry>
2102                 </row>
2103                 <row>
2104                   <entry>
2105                     <para>
2106                       <literal>tx </literal></para>
2107                   </entry>
2108                   <entry>
2109                     <para>Number of send credits.</para>
2110                   </entry>
2111                 </row>
2112                 <row>
2113                   <entry>
2114                     <para>
2115                       <literal>min </literal></para>
2116                   </entry>
2117                   <entry>
2118                     <para>Minimum number of send credits seen.</para>
2119                   </entry>
2120                 </row>
2121                 <row>
2122                   <entry>
2123                     <para>
2124                       <literal>queue </literal></para>
2125                   </entry>
2126                   <entry>
2127                     <para>Total bytes in active/queued sends.</para>
2128                   </entry>
2129                 </row>
2130               </tbody>
2131             </tgroup>
2132           </informaltable>
2133           <para>Credits are initialized to allow a certain number of operations (in the example
2134             above the table, eight as shown in the <literal>max</literal> column. LNet keeps track
2135             of the minimum number of credits ever seen over time showing the peak congestion that
2136             has occurred during the time monitored. Fewer available credits indicates a more
2137             congested resource. </para>
2138           <para>The number of credits currently in flight (number of transmit credits) is shown in
2139             the <literal>tx</literal> column. The maximum number of send credits available is shown
2140             in the <literal>max</literal> column and never changes. The number of router buffers
2141             available for consumption by a peer is shown in the <literal>rtr</literal>
2142             column.</para>
2143           <para>Therefore, <literal>rtr</literal> – <literal>tx</literal> is the number of transmits
2144             in flight. Typically, <literal>rtr == max</literal>, although a configuration can be set
2145             such that <literal>max >= rtr</literal>. The ratio of routing buffer credits to send
2146             credits (<literal>rtr/tx</literal>) that is less than <literal>max</literal> indicates
2147             operations are in progress. If the ratio <literal>rtr/tx</literal> is greater than
2148               <literal>max</literal>, operations are blocking.</para>
2149           <para>LNet also limits concurrent sends and number of router buffers allocated to a single
2150             peer so that no peer can occupy all these resources.</para>
2151         </listitem>
2152         <listitem>
2153           <para><literal>nis</literal> - Shows the current queue health on this node.</para>
2154           <para>Example:</para>
2155           <screen># lctl get_param nis
2156 nid                    refs   peer    max   tx    min
2157 0@lo                   3      0       0     0     0
2158 192.168.10.34@tcp      4      8       256   256   252
2159 </screen>
2160           <para> The fields are explained in the table below.</para>
2161           <informaltable frame="all">
2162             <tgroup cols="2">
2163               <colspec colname="c1" colwidth="30*"/>
2164               <colspec colname="c2" colwidth="80*"/>
2165               <thead>
2166                 <row>
2167                   <entry>
2168                     <para><emphasis role="bold">Field</emphasis></para>
2169                   </entry>
2170                   <entry>
2171                     <para><emphasis role="bold">Description</emphasis></para>
2172                   </entry>
2173                 </row>
2174               </thead>
2175               <tbody>
2176                 <row>
2177                   <entry>
2178                     <para>
2179                       <literal> nid </literal></para>
2180                   </entry>
2181                   <entry>
2182                     <para>Network interface.</para>
2183                   </entry>
2184                 </row>
2185                 <row>
2186                   <entry>
2187                     <para>
2188                       <literal> refs </literal></para>
2189                   </entry>
2190                   <entry>
2191                     <para>Internal reference counter.</para>
2192                   </entry>
2193                 </row>
2194                 <row>
2195                   <entry>
2196                     <para>
2197                       <literal> peer </literal></para>
2198                   </entry>
2199                   <entry>
2200                     <para>Number of peer-to-peer send credits on this NID. Credits are used to size
2201                       buffer pools.</para>
2202                   </entry>
2203                 </row>
2204                 <row>
2205                   <entry>
2206                     <para>
2207                       <literal> max </literal></para>
2208                   </entry>
2209                   <entry>
2210                     <para>Total number of send credits on this NID.</para>
2211                   </entry>
2212                 </row>
2213                 <row>
2214                   <entry>
2215                     <para>
2216                       <literal> tx </literal></para>
2217                   </entry>
2218                   <entry>
2219                     <para>Current number of send credits available on this NID.</para>
2220                   </entry>
2221                 </row>
2222                 <row>
2223                   <entry>
2224                     <para>
2225                       <literal> min </literal></para>
2226                   </entry>
2227                   <entry>
2228                     <para>Lowest number of send credits available on this NID.</para>
2229                   </entry>
2230                 </row>
2231                 <row>
2232                   <entry>
2233                     <para>
2234                       <literal> queue </literal></para>
2235                   </entry>
2236                   <entry>
2237                     <para>Total bytes in active/queued sends.</para>
2238                   </entry>
2239                 </row>
2240               </tbody>
2241             </tgroup>
2242           </informaltable>
2243           <para><emphasis role="bold"><emphasis role="italic">Analysis:</emphasis></emphasis></para>
2244           <para>Subtracting <literal>max</literal> from <literal>tx</literal>
2245               (<literal>max</literal> - <literal>tx</literal>) yields the number of sends currently
2246             active. A large or increasing number of active sends may indicate a problem.</para>
2247         </listitem>
2248       </itemizedlist></para>
2249   </section>
2250   <section remap="h3" xml:id="dbdoclet.balancing_free_space">
2251     <title><indexterm>
2252         <primary>proc</primary>
2253         <secondary>free space</secondary>
2254       </indexterm>Allocating Free Space on OSTs</title>
2255     <para>Free space is allocated using either a round-robin or a weighted
2256     algorithm. The allocation method is determined by the maximum amount of
2257     free-space imbalance between the OSTs. When free space is relatively
2258     balanced across OSTs, the faster round-robin allocator is used, which
2259     maximizes network balancing. The weighted allocator is used when any two
2260     OSTs are out of balance by more than a specified threshold.</para>
2261     <para>Free space distribution can be tuned using these two
2262     tunable parameters:</para>
2263     <itemizedlist>
2264       <listitem>
2265         <para><literal>lod.*.qos_threshold_rr</literal> - The threshold at which
2266         the allocation method switches from round-robin to weighted is set
2267         in this file. The default is to switch to the weighted algorithm when
2268         any two OSTs are out of balance by more than 17 percent.</para>
2269       </listitem>
2270       <listitem>
2271         <para><literal>lod.*.qos_prio_free</literal> - The weighting priority
2272         used by the weighted allocator can be adjusted in this file. Increasing
2273         the value of <literal>qos_prio_free</literal> puts more weighting on the
2274         amount of free space available on each OST and less on how stripes are
2275         distributed across OSTs. The default value is 91 percent weighting for
2276         free space rebalancing and 9 percent for OST balancing. When the
2277         free space priority is set to 100, weighting is based entirely on free
2278         space and location is no longer used by the striping algorithm.</para>
2279       </listitem>
2280       <listitem>
2281         <para condition="l29"><literal>osp.*.reserved_mb_low</literal>
2282           - The low watermark used to stop object allocation if available space
2283           is less than this. The default is 0.1% of total OST size.</para>
2284       </listitem>
2285        <listitem>
2286         <para condition="l29"><literal>osp.*.reserved_mb_high</literal>
2287           - The high watermark used to start object allocation if available
2288           space is more than this. The default is 0.2% of total OST size.</para>
2289       </listitem>
2290     </itemizedlist>
2291     <para>For more information about monitoring and managing free space, see <xref
2292         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438209_10424"/>.</para>
2293   </section>
2294   <section remap="h3">
2295     <title><indexterm>
2296         <primary>proc</primary>
2297         <secondary>locking</secondary>
2298       </indexterm>Configuring Locking</title>
2299     <para>The <literal>lru_size</literal> parameter is used to control the
2300       number of client-side locks in the LRU cached locks queue. LRU size is
2301       normally dynamic, based on load to optimize the number of locks cached
2302       on nodes that have different workloads (e.g., login/build nodes vs.
2303       compute nodes vs. backup nodes).</para>
2304     <para>The total number of locks available is a function of the server RAM.
2305       The default limit is 50 locks/1 MB of RAM. If memory pressure is too high,
2306       the LRU size is shrunk. The number of locks on the server is limited to
2307       <replaceable>num_osts_per_oss * num_clients * lru_size</replaceable>
2308       as follows: </para>
2309     <itemizedlist>
2310       <listitem>
2311         <para>To enable automatic LRU sizing, set the
2312         <literal>lru_size</literal> parameter to 0. In this case, the
2313         <literal>lru_size</literal> parameter shows the current number of locks
2314         being used on the client. Dynamic LRU resizing is enabled by default.
2315         </para>
2316       </listitem>
2317       <listitem>
2318         <para>To specify a maximum number of locks, set the
2319         <literal>lru_size</literal> parameter to a value other than zero.
2320         A good default value for compute nodes is around
2321         <literal>100 * <replaceable>num_cpus</replaceable></literal>.
2322         It is recommended that you only set <literal>lru_size</literal>
2323         to be signifivantly larger on a few login nodes where multiple
2324         users access the file system interactively.</para>
2325       </listitem>
2326     </itemizedlist>
2327     <para>To clear the LRU on a single client, and, as a result, flush client
2328       cache without changing the <literal>lru_size</literal> value, run:</para>
2329     <screen># lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
2330     <para>If the LRU size is set lower than the number of existing locks,
2331       <emphasis>unused</emphasis> locks are canceled immediately. Use
2332       <literal>clear</literal> to cancel all locks without changing the value.
2333     </para>
2334     <note>
2335       <para>The <literal>lru_size</literal> parameter can only be set
2336         temporarily using <literal>lctl set_param</literal>, it cannot be set
2337         permanently.</para>
2338     </note>
2339     <para>To disable dynamic LRU resizing on the clients, run for example:
2340     </para>
2341     <screen># lctl set_param ldlm.namespaces.*osc*.lru_size=5000</screen>
2342     <para>To determine the number of locks being granted with dynamic LRU
2343       resizing, run:</para>
2344     <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
2345     <para>The <literal>lru_max_age</literal> parameter is used to control the
2346       age of client-side locks in the LRU cached locks queue. This limits how
2347       long unused locks are cached on the client, and avoids idle clients from
2348       holding locks for an excessive time, which reduces memory usage on both
2349       the client and server, as well as reducing work during server recovery.
2350     </para>
2351     <para>The <literal>lru_max_age</literal> is set and printed in milliseconds,
2352       and by default is 3900000 ms (65 minutes).</para>
2353     <para condition='l2B'>Since Lustre 2.11, in addition to setting the
2354       maximum lock age in milliseconds, it can also be set using a suffix of
2355       <literal>s</literal> or <literal>ms</literal> to indicate seconds or
2356       milliseconds, respectively.  For example to set the client's maximum
2357       lock age to 15 minutes (900s) run:
2358     </para>
2359     <screen>
2360 # lctl set_param ldlm.namespaces.*MDT*.lru_max_age=900s
2361 # lctl get_param ldlm.namespaces.*MDT*.lru_max_age
2362 ldlm.namespaces.myth-MDT0000-mdc-ffff8804296c2800.lru_max_age=900000
2363     </screen>
2364   </section>
2365   <section xml:id="dbdoclet.50438271_87260">
2366     <title><indexterm>
2367         <primary>proc</primary>
2368         <secondary>thread counts</secondary>
2369       </indexterm>Setting MDS and OSS Thread Counts</title>
2370     <para>MDS and OSS thread counts tunable can be used to set the minimum and maximum thread counts
2371       or get the current number of running threads for the services listed in the table
2372       below.</para>
2373     <informaltable frame="all">
2374       <tgroup cols="2">
2375         <colspec colname="c1" colwidth="50*"/>
2376         <colspec colname="c2" colwidth="50*"/>
2377         <tbody>
2378           <row>
2379             <entry>
2380               <para>
2381                 <emphasis role="bold">Service</emphasis></para>
2382             </entry>
2383             <entry>
2384               <para>
2385                 <emphasis role="bold">Description</emphasis></para>
2386             </entry>
2387           </row>
2388           <row>
2389             <entry>
2390               <literal> mds.MDS.mdt </literal>
2391             </entry>
2392             <entry>
2393               <para>Main metadata operations service</para>
2394             </entry>
2395           </row>
2396           <row>
2397             <entry>
2398               <literal> mds.MDS.mdt_readpage </literal>
2399             </entry>
2400             <entry>
2401               <para>Metadata <literal>readdir</literal> service</para>
2402             </entry>
2403           </row>
2404           <row>
2405             <entry>
2406               <literal> mds.MDS.mdt_setattr </literal>
2407             </entry>
2408             <entry>
2409               <para>Metadata <literal>setattr/close</literal> operations service </para>
2410             </entry>
2411           </row>
2412           <row>
2413             <entry>
2414               <literal> ost.OSS.ost </literal>
2415             </entry>
2416             <entry>
2417               <para>Main data operations service</para>
2418             </entry>
2419           </row>
2420           <row>
2421             <entry>
2422               <literal> ost.OSS.ost_io </literal>
2423             </entry>
2424             <entry>
2425               <para>Bulk data I/O services</para>
2426             </entry>
2427           </row>
2428           <row>
2429             <entry>
2430               <literal> ost.OSS.ost_create </literal>
2431             </entry>
2432             <entry>
2433               <para>OST object pre-creation service</para>
2434             </entry>
2435           </row>
2436           <row>
2437             <entry>
2438               <literal> ldlm.services.ldlm_canceld </literal>
2439             </entry>
2440             <entry>
2441               <para>DLM lock cancel service</para>
2442             </entry>
2443           </row>
2444           <row>
2445             <entry>
2446               <literal> ldlm.services.ldlm_cbd </literal>
2447             </entry>
2448             <entry>
2449               <para>DLM lock grant service</para>
2450             </entry>
2451           </row>
2452         </tbody>
2453       </tgroup>
2454     </informaltable>
2455     <para>For each service, tunable parameters as shown below are available.
2456     </para>
2457     <itemizedlist>
2458       <listitem>
2459         <para>To temporarily set these tunables, run:</para>
2460         <screen># lctl set_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started=num</replaceable> </screen>
2461         </listitem>
2462       <listitem>
2463         <para>To permanently set this tunable, run:</para>
2464         <screen># lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
2465         <para condition='l25'>For version 2.5 or later, run:
2466                 <screen># lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
2467       </listitem>
2468     </itemizedlist>
2469       <para>The following examples show how to set thread counts and get the number of running threads
2470         for the service <literal>ost_io</literal>  using the tunable
2471         <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
2472     <itemizedlist>
2473       <listitem>
2474         <para>To get the number of running threads, run:</para>
2475         <screen># lctl get_param ost.OSS.ost_io.threads_started
2476 ost.OSS.ost_io.threads_started=128</screen>
2477       </listitem>
2478       <listitem>
2479         <para>To set the number of threads to the maximum value (512), run:</para>
2480         <screen># lctl get_param ost.OSS.ost_io.threads_max
2481 ost.OSS.ost_io.threads_max=512</screen>
2482       </listitem>
2483       <listitem>
2484         <para>To set the maximum thread count to 256 instead of 512 (to avoid overloading the
2485           storage or for an array with requests), run:</para>
2486         <screen># lctl set_param ost.OSS.ost_io.threads_max=256
2487 ost.OSS.ost_io.threads_max=256</screen>
2488       </listitem>
2489       <listitem>
2490         <para>To set the maximum thread count to 256 instead of 512 permanently, run:</para>
2491         <screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen>
2492         <para condition='l25'>For version 2.5 or later, run:
2493         <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
2494 ost.OSS.ost_io.threads_max=256 </screen> </para>
2495       </listitem>
2496       <listitem>
2497         <para> To check if the <literal>threads_max</literal> setting is active, run:</para>
2498         <screen># lctl get_param ost.OSS.ost_io.threads_max
2499 ost.OSS.ost_io.threads_max=256</screen>
2500       </listitem>
2501     </itemizedlist>
2502     <note>
2503       <para>If the number of service threads is changed while the file system is running, the change
2504         may not take effect until the file system is stopped and rest. If the number of service
2505         threads in use exceeds the new <literal>threads_max</literal> value setting, service threads
2506         that are already running will not be stopped.</para>
2507     </note>
2508     <para>See also <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="lustretuning"/></para>
2509   </section>
2510   <section xml:id="dbdoclet.50438271_83523">
2511     <title><indexterm>
2512         <primary>proc</primary>
2513         <secondary>debug</secondary>
2514       </indexterm>Enabling and Interpreting Debugging Logs</title>
2515     <para>By default, a detailed log of all operations is generated to aid in
2516       debugging. Flags that control debugging are found via
2517       <literal>lctl get_param debug</literal>.</para>
2518     <para>The overhead of debugging can affect the performance of Lustre file
2519       system. Therefore, to minimize the impact on performance, the debug level
2520       can be lowered, which affects the amount of debugging information kept in
2521       the internal log buffer but does not alter the amount of information to
2522       goes into syslog. You can raise the debug level when you need to collect
2523       logs to debug problems. </para>
2524     <para>The debugging mask can be set using &quot;symbolic names&quot;. The
2525       symbolic format is shown in the examples below.
2526       <itemizedlist>
2527         <listitem>
2528           <para>To verify the debug level used, examine the parameter that
2529             controls debugging by running:</para>
2530           <screen># lctl get_param debug
2531 debug=
2532 ioctl neterror warning error emerg ha config console</screen>
2533         </listitem>
2534         <listitem>
2535           <para>To turn off debugging except for network error debugging, run
2536           the following command on all nodes concerned:</para>
2537           <screen># sysctl -w lnet.debug=&quot;neterror&quot;
2538 debug=neterror</screen>
2539         </listitem>
2540       </itemizedlist>
2541       <itemizedlist>
2542         <listitem>
2543           <para>To turn off debugging completely (except for the minimum error
2544             reporting to the console), run the following command on all nodes
2545             concerned:</para>
2546           <screen># lctl set_param debug=0
2547 debug=0</screen>
2548         </listitem>
2549         <listitem>
2550           <para>To set an appropriate debug level for a production environment,
2551             run:</para>
2552           <screen># lctl set_param debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot;
2553 debug=warning dlmtrace error emerg ha rpctrace vfstrace</screen>
2554           <para>The flags shown in this example collect enough high-level
2555             information to aid debugging, but they do not cause any serious
2556             performance impact.</para>
2557         </listitem>
2558       </itemizedlist>
2559       <itemizedlist>
2560         <listitem>
2561           <para>To add new flags to flags that have already been set,
2562             precede each one with a &quot;<literal>+</literal>&quot;:</para>
2563           <screen># lctl set_param debug=&quot;+neterror +ha&quot;
2564 debug=+neterror +ha
2565 # lctl get_param debug
2566 debug=neterror warning error emerg ha console</screen>
2567         </listitem>
2568         <listitem>
2569           <para>To remove individual flags, precede them with a
2570             &quot;<literal>-</literal>&quot;:</para>
2571           <screen># lctl set_param debug=&quot;-ha&quot;
2572 debug=-ha
2573 # lctl get_param debug
2574 debug=neterror warning error emerg console</screen>
2575         </listitem>
2576       </itemizedlist>
2577     </para>
2578     <para>Debugging parameters include:</para>
2579     <itemizedlist>
2580       <listitem>
2581         <para><literal>subsystem_debug</literal> - Controls the debug logs for subsystems.</para>
2582       </listitem>
2583       <listitem>
2584         <para><literal>debug_path</literal> - Indicates the location where the debug log is dumped
2585           when triggered automatically or manually. The default path is
2586             <literal>/tmp/lustre-log</literal>.</para>
2587       </listitem>
2588     </itemizedlist>
2589     <para>These parameters can also be set using:<screen>sysctl -w lnet.debug={value}</screen></para>
2590     <para>Additional useful parameters: <itemizedlist>
2591         <listitem>
2592           <para><literal>panic_on_lbug</literal> - Causes &apos;&apos;panic&apos;&apos; to be called
2593             when the Lustre software detects an internal problem (an <literal>LBUG</literal> log
2594             entry); panic crashes the node. This is particularly useful when a kernel crash dump
2595             utility is configured. The crash dump is triggered when the internal inconsistency is
2596             detected by the Lustre software. </para>
2597         </listitem>
2598         <listitem>
2599           <para><literal>upcall</literal> - Allows you to specify the path to the binary which will
2600             be invoked when an <literal>LBUG</literal> log entry is encountered. This binary is
2601             called with four parameters:</para>
2602           <para> - The string &apos;&apos;<literal>LBUG</literal>&apos;&apos;.</para>
2603           <para> - The file where the <literal>LBUG</literal> occurred.</para>
2604           <para> - The function name.</para>
2605           <para> - The line number in the file</para>
2606         </listitem>
2607       </itemizedlist></para>
2608     <section>
2609       <title>Interpreting OST Statistics</title>
2610       <note>
2611         <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
2612             <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2613       </note>
2614       <para>OST <literal>stats</literal> files can be used to provide statistics showing activity
2615         for each OST. For example:</para>
2616       <screen># lctl get_param osc.testfs-OST0000-osc.stats
2617 snapshot_time                      1189732762.835363
2618 ost_create                 1
2619 ost_get_info               1
2620 ost_connect                1
2621 ost_set_info               1
2622 obd_ping                   212</screen>
2623       <para>Use the <literal>llstat</literal> utility to monitor statistics over time.</para>
2624       <para>To clear the statistics, use the <literal>-c</literal> option to
2625         <literal>llstat</literal>. To specify how frequently the statistics
2626         should be reported (in seconds), use the <literal>-i</literal> option.
2627         In the example below, the <literal>-c</literal> option clears the
2628         statistics and <literal>-i10</literal> option reports statistics every
2629         10 seconds:</para>
2630 <screen role="smaller">$ llstat -c -i10 ost_io
2631
2632 /usr/bin/llstat: STATS on 06/06/07
2633         /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
2634 snapshot_time                              1181074093.276072
2635
2636 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
2637 Name        Cur.  Cur. #
2638             Count Rate Events Unit  last   min    avg       max    stddev
2639 req_waittime 8    0    8    [usec]  2078   34     259.75    868    317.49
2640 req_qdepth   8    0    8    [reqs]  1      0      0.12      1      0.35
2641 req_active   8    0    8    [reqs]  11     1      1.38      2      0.52
2642 reqbuf_avail 8    0    8    [bufs]  511    63     63.88     64     0.35
2643 ost_write    8    0    8    [bytes] 169767 72914  212209.62 387579 91874.29
2644
2645 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
2646 Name        Cur.  Cur. #
2647             Count Rate Events Unit  last    min   avg       max    stddev
2648 req_waittime 31   3    39   [usec]  30011   34    822.79    12245  2047.71
2649 req_qdepth   31   3    39   [reqs]  0       0     0.03      1      0.16
2650 req_active   31   3    39   [reqs]  58      1     1.77      3      0.74
2651 reqbuf_avail 31   3    39   [bufs]  1977    63    63.79     64     0.41
2652 ost_write    30   3    38   [bytes] 1028467 15019 315325.16 910694 197776.51
2653
2654 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
2655 Name        Cur.  Cur. #
2656             Count Rate Events Unit  last    min    avg       max    stddev
2657 req_waittime 21   2    60   [usec]  14970   34     784.32    12245  1878.66
2658 req_qdepth   21   2    60   [reqs]  0       0      0.02      1      0.13
2659 req_active   21   2    60   [reqs]  33      1      1.70      3      0.70
2660 reqbuf_avail 21   2    60   [bufs]  1341    63     63.82     64     0.39
2661 ost_write    21   2    59   [bytes] 7648424 15019  332725.08 910694 180397.87
2662 </screen>
2663       <para>The columns in this example are described in the table below.</para>
2664       <informaltable frame="all">
2665         <tgroup cols="2">
2666           <colspec colname="c1" colwidth="50*"/>
2667           <colspec colname="c2" colwidth="50*"/>
2668           <thead>
2669             <row>
2670               <entry>
2671                 <para><emphasis role="bold">Parameter</emphasis></para>
2672               </entry>
2673               <entry>
2674                 <para><emphasis role="bold">Description</emphasis></para>
2675               </entry>
2676             </row>
2677           </thead>
2678           <tbody>
2679             <row>
2680               <entry><literal>Name</literal></entry>
2681               <entry>Name of the service event.  See the tables below for descriptions of service
2682                 events that are tracked.</entry>
2683             </row>
2684             <row>
2685               <entry>
2686                 <para>
2687                   <literal>Cur. Count </literal></para>
2688               </entry>
2689               <entry>
2690                 <para>Number of events of each type sent in the last interval.</para>
2691               </entry>
2692             </row>
2693             <row>
2694               <entry>
2695                 <para>
2696                   <literal>Cur. Rate </literal></para>
2697               </entry>
2698               <entry>
2699                 <para>Number of events per second in the last interval.</para>
2700               </entry>
2701             </row>
2702             <row>
2703               <entry>
2704                 <para>
2705                   <literal> # Events </literal></para>
2706               </entry>
2707               <entry>
2708                 <para>Total number of such events since the events have been cleared.</para>
2709               </entry>
2710             </row>
2711             <row>
2712               <entry>
2713                 <para>
2714                   <literal> Unit </literal></para>
2715               </entry>
2716               <entry>
2717                 <para>Unit of measurement for that statistic (microseconds, requests,
2718                   buffers).</para>
2719               </entry>
2720             </row>
2721             <row>
2722               <entry>
2723                 <para>
2724                   <literal> last </literal></para>
2725               </entry>
2726               <entry>
2727                 <para>Average rate of these events (in units/event) for the last interval during
2728                   which they arrived. For instance, in the above mentioned case of
2729                     <literal>ost_destroy</literal> it took an average of 736 microseconds per
2730                   destroy for the 400 object destroys in the previous 10 seconds.</para>
2731               </entry>
2732             </row>
2733             <row>
2734               <entry>
2735                 <para>
2736                   <literal> min </literal></para>
2737               </entry>
2738               <entry>
2739                 <para>Minimum rate (in units/events) since the service started.</para>
2740               </entry>
2741             </row>
2742             <row>
2743               <entry>
2744                 <para>
2745                   <literal> avg </literal></para>
2746               </entry>
2747               <entry>
2748                 <para>Average rate.</para>
2749               </entry>
2750             </row>
2751             <row>
2752               <entry>
2753                 <para>
2754                   <literal> max </literal></para>
2755               </entry>
2756               <entry>
2757                 <para>Maximum rate.</para>
2758               </entry>
2759             </row>
2760             <row>
2761               <entry>
2762                 <para>
2763                   <literal> stddev </literal></para>
2764               </entry>
2765               <entry>
2766                 <para>Standard deviation (not measured in some cases)</para>
2767               </entry>
2768             </row>
2769           </tbody>
2770         </tgroup>
2771       </informaltable>
2772       <para>Events common to all services are shown in the table below.</para>
2773       <informaltable frame="all">
2774         <tgroup cols="2">
2775           <colspec colname="c1" colwidth="50*"/>
2776           <colspec colname="c2" colwidth="50*"/>
2777           <thead>
2778             <row>
2779               <entry>
2780                 <para><emphasis role="bold">Parameter</emphasis></para>
2781               </entry>
2782               <entry>
2783                 <para><emphasis role="bold">Description</emphasis></para>
2784               </entry>
2785             </row>
2786           </thead>
2787           <tbody>
2788             <row>
2789               <entry>
2790                 <para>
2791                   <literal> req_waittime </literal></para>
2792               </entry>
2793               <entry>
2794                 <para>Amount of time a request waited in the queue before being handled by an
2795                   available server thread.</para>
2796               </entry>
2797             </row>
2798             <row>
2799               <entry>
2800                 <para>
2801                   <literal> req_qdepth </literal></para>
2802               </entry>
2803               <entry>
2804                 <para>Number of requests waiting to be handled in the queue for this service.</para>
2805               </entry>
2806             </row>
2807             <row>
2808               <entry>
2809                 <para>
2810                   <literal> req_active </literal></para>
2811               </entry>
2812               <entry>
2813                 <para>Number of requests currently being handled.</para>
2814               </entry>
2815             </row>
2816             <row>
2817               <entry>
2818                 <para>
2819                   <literal> reqbuf_avail </literal></para>
2820               </entry>
2821               <entry>
2822                 <para>Number of unsolicited lnet request buffers for this service.</para>
2823               </entry>
2824             </row>
2825           </tbody>
2826         </tgroup>
2827       </informaltable>
2828       <para>Some service-specific events of interest are described in the table below.</para>
2829       <informaltable frame="all">
2830         <tgroup cols="2">
2831           <colspec colname="c1" colwidth="50*"/>
2832           <colspec colname="c2" colwidth="50*"/>
2833           <thead>
2834             <row>
2835               <entry>
2836                 <para><emphasis role="bold">Parameter</emphasis></para>
2837               </entry>
2838               <entry>
2839                 <para><emphasis role="bold">Description</emphasis></para>
2840               </entry>
2841             </row>
2842           </thead>
2843           <tbody>
2844             <row>
2845               <entry>
2846                 <para>
2847                   <literal> ldlm_enqueue </literal></para>
2848               </entry>
2849               <entry>
2850                 <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
2851               </entry>
2852             </row>
2853             <row>
2854               <entry>
2855                 <para>
2856                   <literal> mds_reint </literal></para>
2857               </entry>
2858               <entry>
2859                 <para>Time it takes to process an MDS modification record (includes
2860                     <literal>create</literal>, <literal>mkdir</literal>, <literal>unlink</literal>,
2861                     <literal>rename</literal> and <literal>setattr</literal>)</para>
2862               </entry>
2863             </row>
2864           </tbody>
2865         </tgroup>
2866       </informaltable>
2867     </section>
2868     <section>
2869       <title>Interpreting MDT Statistics</title>
2870       <note>
2871         <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
2872             <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2873       </note>
2874       <para>MDT <literal>stats</literal> files can be used to track MDT
2875       statistics for the MDS. The example below shows sample output from an
2876       MDT <literal>stats</literal> file.</para>
2877       <screen># lctl get_param mds.*-MDT0000.stats
2878 snapshot_time                   1244832003.676892 secs.usecs
2879 open                            2 samples [reqs]
2880 close                           1 samples [reqs]
2881 getxattr                        3 samples [reqs]
2882 process_config                  1 samples [reqs]
2883 connect                         2 samples [reqs]
2884 disconnect                      2 samples [reqs]
2885 statfs                          3 samples [reqs]
2886 setattr                         1 samples [reqs]
2887 getattr                         3 samples [reqs]
2888 llog_init                       6 samples [reqs]
2889 notify                          16 samples [reqs]</screen>
2890     </section>
2891   </section>
2892 </chapter>