LustreProc.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
   3   xml:lang="en-US" xml:id="lustreproc">
   4   <title xml:id="lustreproc.title">LustreProc</title>
   5   <para>The <literal>/proc</literal> file system acts as an interface to internal data structures in
   6     the kernel. This chapter describes entries in <literal>/proc</literal> that are useful for
   7     tuning and monitoring aspects of a Lustre file system. It includes these sections:</para>
   8   <itemizedlist>
   9     <listitem>
  10       <para><xref linkend="dbdoclet.50438271_83523"/></para>
  11       <para>.</para>
  12     </listitem>
  13   </itemizedlist>
  14   <section>
  15     <title>Introduction to <literal>/proc</literal></title>
  16     <para>The <literal>/proc</literal> directory provides an interface to internal data structures
  17       in the kernel that enables monitoring and tuning of many aspects of Lustre file system and
  18       application performance These data structures include settings and metrics for components such
  19       as memory, networking, file systems, and kernel housekeeping routines, which are available
  20       throughout the hierarchical file layout in <literal>/proc.</literal>
  21     </para>
  22     <para>Typically, metrics are accessed by reading from <literal>/proc</literal> files and
  23       settings are changed by writing to <literal>/proc</literal> files. Some data is server-only,
  24       some data is client-only, and some data is exported from the client to the server and is thus
  25       duplicated in both locations.</para>
  26     <note>
  27       <para>In the examples in this chapter, <literal>#</literal> indicates a command is entered as
  28         root.  Servers are named according to the convention
  29             <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
  30         The standard UNIX wildcard designation (*) is used.</para>
  31     </note>
  32     <para>In most cases, information is accessed using the <literal>lctl get_param</literal> command
  33       and settings are changed using the <literal>lctl set_param</literal> command. Some examples
  34       are shown below:</para>
  35     <itemizedlist>
  36       <listitem>
  37         <para> To obtain data from a Lustre client:</para>
  38         <screen># lctl list_param osc.*
  39 osc.testfs-OST0000-osc-ffff881071d5cc00
  40 osc.testfs-OST0001-osc-ffff881071d5cc00
  41 osc.testfs-OST0002-osc-ffff881071d5cc00
  42 osc.testfs-OST0003-osc-ffff881071d5cc00
  43 osc.testfs-OST0004-osc-ffff881071d5cc00
  44 osc.testfs-OST0005-osc-ffff881071d5cc00
  45 osc.testfs-OST0006-osc-ffff881071d5cc00
  46 osc.testfs-OST0007-osc-ffff881071d5cc00
  47 osc.testfs-OST0008-osc-ffff881071d5cc00</screen>
  48         <para>In this example, information about OST connections available on a client is displayed
  49           (indicated by "osc").</para>
  50       </listitem>
  51     </itemizedlist>
  52     <itemizedlist>
  53       <listitem>
  54         <para> To see multiple levels of parameters, use multiple
  55           wildcards:<screen># lctl list_param osc.*.*
  56 osc.testfs-OST0000-osc-ffff881071d5cc00.active
  57 osc.testfs-OST0000-osc-ffff881071d5cc00.blocksize
  58 osc.testfs-OST0000-osc-ffff881071d5cc00.checksum_type
  59 osc.testfs-OST0000-osc-ffff881071d5cc00.checksums
  60 osc.testfs-OST0000-osc-ffff881071d5cc00.connect_flags
  61 osc.testfs-OST0000-osc-ffff881071d5cc00.contention_seconds
  62 osc.testfs-OST0000-osc-ffff881071d5cc00.cur_dirty_bytes
  63 ...
  64 osc.testfs-OST0000-osc-ffff881071d5cc00.rpc_stats</screen></para>
  65       </listitem>
  66     </itemizedlist>
  67     <itemizedlist>
  68       <listitem>
  69         <para> To view a specific file, use <literal>lctl get_param</literal>
  70           :<screen># lctl get_param osc.lustre-OST0000-osc-ffff881071d5cc00.rpc_stats</screen></para>
  71       </listitem>
  72     </itemizedlist>
  73     <para>For more information about using <literal>lctl</literal>, see <xref
  74         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_51490"/>.</para>
  75     <para>Data can also be viewed using the <literal>cat</literal> command with the full path to the
  76       file. The form of the <literal>cat</literal> command is similar to that of the <literal>lctl
  77         get_param</literal> command with these differences. In the <literal>cat</literal> command: </para>
  78     <itemizedlist>
  79       <listitem>
  80         <para> Replace the dots in the path with slashes.</para>
  81       </listitem>
  82       <listitem>
  83         <para> Prepend the path with the following as
  84           appropriate:<screen>/proc/{fs,sys}/{lustre,lnet}</screen></para>
  85       </listitem>
  86     </itemizedlist>
  87     <para>For example, an <literal>lctl get_param</literal> command may look like
  88       this:<screen># lctl get_param osc.*.uuid
  89 osc.testfs-OST0000-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
  90 osc.testfs-OST0001-osc-ffff881071d5cc00.uuid=594db456-0685-bd16-f59b-e72ee90e9819
  91 ...</screen></para>
  92     <para>The equivalent <literal>cat</literal> command looks like
  93       this:<screen># cat /proc/fs/lustre/osc/*/uuid
  94 594db456-0685-bd16-f59b-e72ee90e9819
  95 594db456-0685-bd16-f59b-e72ee90e9819
  96 ...</screen></para>
  97     <para>The <literal>llstat</literal> utility can be used to monitor some Lustre file system I/O
  98       activity over a specified time period. For more details, see <xref
  99         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438219_23232"/></para>
 100     <para>Some data is imported from attached clients and is available in a directory called
 101         <literal>exports</literal> located in the corresponding per-service directory on a Lustre
 102       server. For
 103       example:<screen># ls /proc/fs/lustre/obdfilter/testfs-OST0000/exports/192.168.124.9\@o2ib1/
 104 # hash ldlm_stats stats uuid</screen></para>
 105     <section remap="h3">
 106       <title>Identifying Lustre File Systems and Servers</title>
 107       <para>Several <literal>/proc</literal> files on the MGS list existing Lustre file systems and
 108         file system servers. The examples below are for a Lustre file system called
 109           <literal>testfs</literal> with one MDT and three OSTs.</para>
 110       <itemizedlist>
 111         <listitem>
 112           <para> To view all known Lustre file systems, enter:</para>
 113           <screen>mgs# lctl get_param mgs.*.filesystems
 114 testfs</screen>
 115         </listitem>
 116         <listitem>
 117           <para> To view the names of the servers in a file system in which least one server is
 118             running,
 119             enter:<screen>lctl get_param mgs.*.live.<replaceable>&lt;filesystem name></replaceable></screen></para>
 120           <para>For example:</para>
 121           <screen>mgs# lctl get_param mgs.*.live.testfs
 122 fsname: testfs
 123 flags: 0x20     gen: 45
 124 testfs-MDT0000
 125 testfs-OST0000
 126 testfs-OST0001
 127 testfs-OST0002
 128
 129 Secure RPC Config Rules:
 130
 131 imperative_recovery_state:
 132     state: startup
 133     nonir_clients: 0
 134     nidtbl_version: 6
 135     notify_duration_total: 0.001000
 136     notify_duation_max:  0.001000
 137     notify_count: 4</screen>
 138         </listitem>
 139         <listitem>
 140           <para>To view the names of all live servers in the file system as listed in
 141               <literal>/proc/fs/lustre/devices</literal>, enter:</para>
 142           <screen># lctl device_list
 143 0 UP mgs MGS MGS 11
 144 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
 145 2 UP mdt MDS MDS_uuid 3
 146 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
 147 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 7
 148 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
 149 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
 150 7 UP lov testfs-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
 151 8 UP mdc testfs-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
 152 9 UP osc testfs-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
 153 10 UP osc testfs-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
 154           <para>The information provided on each line includes:</para>
 155           <para> -  Device number</para>
 156           <para> - Device status (UP, INactive, or STopping) </para>
 157           <para> -  Device name</para>
 158           <para> -  Device UUID</para>
 159           <para> -  Reference count (how many users this device has)</para>
 160         </listitem>
 161         <listitem>
 162           <para>To display the name of any server, view the device
 163             label:<screen>mds# e2label /dev/sda
 164 testfs-MDT0000</screen></para>
 165         </listitem>
 166       </itemizedlist>
 167     </section>
 168   </section>
 169   <section>
 170     <title>Tuning Multi-Block Allocation (mballoc)</title>
 171     <para>Capabilities supported by <literal>mballoc</literal> include:</para>
 172     <itemizedlist>
 173       <listitem>
 174         <para> Pre-allocation for single files to help to reduce fragmentation.</para>
 175       </listitem>
 176       <listitem>
 177         <para> Pre-allocation for a group of files to enable packing of small files into large,
 178           contiguous chunks.</para>
 179       </listitem>
 180       <listitem>
 181         <para> Stream allocation to help decrease the seek rate.</para>
 182       </listitem>
 183     </itemizedlist>
 184     <para>The following <literal>mballoc</literal> tunables are available:</para>
 185     <informaltable frame="all">
 186       <tgroup cols="2">
 187         <colspec colname="c1" colwidth="30*"/>
 188         <colspec colname="c2" colwidth="70*"/>
 189         <thead>
 190           <row>
 191             <entry>
 192               <para><emphasis role="bold">Field</emphasis></para>
 193             </entry>
 194             <entry>
 195               <para><emphasis role="bold">Description</emphasis></para>
 196             </entry>
 197           </row>
 198         </thead>
 199         <tbody>
 200           <row>
 201             <entry>
 202               <para>
 203                 <literal>mb_max_to_scan</literal></para>
 204             </entry>
 205             <entry>
 206               <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a
 207                 final decision to avoid a livelock situation.</para>
 208             </entry>
 209           </row>
 210           <row>
 211             <entry>
 212               <para>
 213                 <literal>mb_min_to_scan</literal></para>
 214             </entry>
 215             <entry>
 216               <para>Minimum number of free chunks that <literal>mballoc</literal> searches before
 217                 picking the best chunk for allocation. This is useful for small requests to reduce
 218                 fragmentation of big free chunks.</para>
 219             </entry>
 220           </row>
 221           <row>
 222             <entry>
 223               <para>
 224                 <literal>mb_order2_req</literal></para>
 225             </entry>
 226             <entry>
 227               <para>For requests equal to 2^N, where N &gt;= <literal>mb_order2_req</literal>, a
 228                 fast search is done using a base 2 buddy allocation service.</para>
 229             </entry>
 230           </row>
 231           <row>
 232             <entry>
 233               <para>
 234                 <literal>mb_small_req</literal></para>
 235             </entry>
 236             <entry morerows="1">
 237               <para><literal>mb_small_req</literal> - Defines (in MB) the upper bound of "small
 238                 requests".</para>
 239               <para><literal>mb_large_req</literal> - Defines (in MB) the lower bound of "large
 240                 requests".</para>
 241               <para>Requests are handled differently based on size:<itemizedlist>
 242                   <listitem>
 243                     <para>&lt; <literal>mb_small_req</literal> - Requests are packed together to
 244                       form large, aggregated requests.</para>
 245                   </listitem>
 246                   <listitem>
 247                     <para>> <literal>mb_small_req</literal> and &lt; <literal>mb_large_req</literal>
 248                       - Requests are primarily allocated linearly.</para>
 249                   </listitem>
 250                   <listitem>
 251                     <para>> <literal>mb_large_req</literal> - Requests are allocated since hard disk
 252                       seek time is less of a concern in this case.</para>
 253                   </listitem>
 254                 </itemizedlist></para>
 255               <para>In general, small requests are combined to create larger requests, which are
 256                 then placed close to one another to minimize the number of seeks required to access
 257                 the data.</para>
 258             </entry>
 259           </row>
 260           <row>
 261             <entry>
 262               <para>
 263                 <literal>mb_large_req</literal></para>
 264             </entry>
 265           </row>
 266           <row>
 267             <entry>
 268               <para>
 269                 <literal>mb_prealloc_table</literal></para>
 270             </entry>
 271             <entry>
 272               <para>A table of values used to preallocate space when a new request is received. By
 273                 default, the table looks like
 274                 this:<screen>prealloc_table
 275 4 8 16 32 64 128 256 512 1024 2048 </screen></para>
 276               <para>When a new request is received, space is preallocated at the next higher
 277                 increment specified in the table. For example, for requests of less than 4 file
 278                 system blocks, 4 blocks of space are preallocated; for requests between 4 and 8, 8
 279                 blocks are preallocated; and so forth</para>
 280               <para>Although customized values can be entered in the table, the performance of
 281                 general usage file systems will not typically be improved by modifying the table (in
 282                 fact, in ext4 systems, the table values are fixed).  However, for some specialized
 283                 workloads, tuning the <literal>prealloc_table</literal> values may result in smarter
 284                 preallocation decisions. </para>
 285             </entry>
 286           </row>
 287           <row>
 288             <entry>
 289               <para>
 290                 <literal>mb_group_prealloc</literal></para>
 291             </entry>
 292             <entry>
 293               <para>The amount of space (in kilobytes) preallocated for groups of small
 294                 requests.</para>
 295             </entry>
 296           </row>
 297         </tbody>
 298       </tgroup>
 299     </informaltable>
 300     <para>Buddy group cache information found in
 301           <literal>/proc/fs/ldiskfs/<replaceable>disk_device</replaceable>/mb_groups</literal> may
 302       be useful for assessing on-disk fragmentation. For
 303       example:<screen>cat /proc/fs/ldiskfs/loop0/mb_groups
 304 #group: free free frags first pa [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9
 305      2^10 2^11 2^12 2^13]
 306 #0    : 2936 2936 1     42    0  [ 0   0   0   1   1   1   1   2   0   1
 307      2    0    0    0   ]</screen></para>
 308     <para>In this example, the columns show:<itemizedlist>
 309         <listitem>
 310           <para>#group number</para>
 311         </listitem>
 312         <listitem>
 313           <para>Available blocks in the group</para>
 314         </listitem>
 315         <listitem>
 316           <para>Blocks free on a disk</para>
 317         </listitem>
 318         <listitem>
 319           <para>Number of free fragments</para>
 320         </listitem>
 321         <listitem>
 322           <para>First free block in the group</para>
 323         </listitem>
 324         <listitem>
 325           <para>Number of preallocated chunks (not blocks)</para>
 326         </listitem>
 327         <listitem>
 328           <para>A series of available chunks of different sizes</para>
 329         </listitem>
 330       </itemizedlist></para>
 331   </section>
 332   <section>
 333     <title>Monitoring Lustre File System  I/O</title>
 334     <para>A number of system utilities are provided to enable collection of data related to I/O
 335       activity in a Lustre file system. In general, the data collected describes:</para>
 336     <itemizedlist>
 337       <listitem>
 338         <para> Data transfer rates and throughput of inputs and outputs external to the Lustre file
 339           system, such as network requests or disk I/O operations performed</para>
 340       </listitem>
 341       <listitem>
 342         <para> Data about the throughput or transfer rates of internal Lustre file system data, such
 343           as locks or allocations. </para>
 344       </listitem>
 345     </itemizedlist>
 346     <note>
 347       <para>It is highly recommended that you complete baseline testing for your Lustre file system
 348         to determine normal I/O activity for your hardware, network, and system workloads. Baseline
 349         data will allow you to easily determine when performance becomes degraded in your system.
 350         Two particularly useful baseline statistics are:</para>
 351       <itemizedlist>
 352         <listitem>
 353           <para><literal>brw_stats</literal> – Histogram data characterizing I/O requests to the
 354             OSTs. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 355               linkend="dbdoclet.50438271_55057"/>.</para>
 356         </listitem>
 357         <listitem>
 358           <para><literal>rpc_stats</literal> – Histogram data showing information about RPCs made by
 359             clients. For more details, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 360               linkend="MonitoringClientRCPStream"/>.</para>
 361         </listitem>
 362       </itemizedlist>
 363     </note>
 364     <section remap="h3" xml:id="MonitoringClientRCPStream">
 365       <title><indexterm>
 366           <primary>proc</primary>
 367           <secondary>watching RPC</secondary>
 368         </indexterm>Monitoring the Client RPC Stream</title>
 369       <para>The <literal>rpc_stats</literal> file contains histogram data showing information about
 370         remote procedure calls (RPCs) that have been made since this file was last cleared. The
 371         histogram data can be cleared by writing any value into the <literal>rpc_stats</literal>
 372         file.</para>
 373       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 374       <screen># lctl get_param osc.testfs-OST0000-osc-ffff810058d2f800.rpc_stats
 375 snapshot_time:            1372786692.389858 (secs.usecs)
 376 read RPCs in flight:      0
 377 write RPCs in flight:     1
 378 dio read RPCs in flight:  0
 379 dio write RPCs in flight: 0
 380 pending write pages:      256
 381 pending read pages:       0
 382
 383                      read                   write
 384 pages per rpc   rpcs   % cum % |       rpcs   % cum %
 385 1:                 0   0   0   |          0   0   0
 386 2:                 0   0   0   |          1   0   0
 387 4:                 0   0   0   |          0   0   0
 388 8:                 0   0   0   |          0   0   0
 389 16:                0   0   0   |          0   0   0
 390 32:                0   0   0   |          2   0   0
 391 64:                0   0   0   |          2   0   0
 392 128:               0   0   0   |          5   0   0
 393 256:             850 100 100   |      18346  99 100
 394
 395                      read                   write
 396 rpcs in flight  rpcs   % cum % |       rpcs   % cum %
 397 0:               691  81  81   |       1740   9   9
 398 1:                48   5  86   |        938   5  14
 399 2:                29   3  90   |       1059   5  20
 400 3:                17   2  92   |       1052   5  26
 401 4:                13   1  93   |        920   5  31
 402 5:                12   1  95   |        425   2  33
 403 6:                10   1  96   |        389   2  35
 404 7:                30   3 100   |      11373  61  97
 405 8:                 0   0 100   |        460   2 100
 406
 407                      read                   write
 408 offset          rpcs   % cum % |       rpcs   % cum %
 409 0:               850 100 100   |      18347  99  99
 410 1:                 0   0 100   |          0   0  99
 411 2:                 0   0 100   |          0   0  99
 412 4:                 0   0 100   |          0   0  99
 413 8:                 0   0 100   |          0   0  99
 414 16:                0   0 100   |          1   0  99
 415 32:                0   0 100   |          1   0  99
 416 64:                0   0 100   |          3   0  99
 417 128:               0   0 100   |          4   0 100
 418
 419 </screen>
 420       <para>The header information includes:</para>
 421       <itemizedlist>
 422         <listitem>
 423           <para><literal>snapshot_time</literal> - UNIX epoch instant the file was read.</para>
 424         </listitem>
 425         <listitem>
 426           <para><literal>read RPCs in flight</literal> - Number of read RPCs issued by the OSC, but
 427             not complete at the time of the snapshot. This value should always be less than or equal
 428             to <literal>max_rpcs_in_flight</literal>.</para>
 429         </listitem>
 430         <listitem>
 431           <para><literal>write RPCs in flight</literal> - Number of write RPCs issued by the OSC,
 432             but not complete at the time of the snapshot. This value should always be less than or
 433             equal to <literal>max_rpcs_in_flight</literal>.</para>
 434         </listitem>
 435         <listitem>
 436           <para><literal>dio read RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
 437             read RPCs issued but not completed at the time of the snapshot.</para>
 438         </listitem>
 439         <listitem>
 440           <para><literal>dio write RPCs in flight</literal> - Direct I/O (as opposed to block I/O)
 441             write RPCs issued but not completed at the time of the snapshot.</para>
 442         </listitem>
 443         <listitem>
 444           <para><literal>pending write pages</literal>  - Number of pending write pages that have
 445             been queued for I/O in the OSC.</para>
 446         </listitem>
 447         <listitem>
 448           <para><literal>pending read pages</literal> - Number of pending read pages that have been
 449             queued for I/O in the OSC.</para>
 450         </listitem>
 451       </itemizedlist>
 452       <para>The tabular data is described in the table below. Each row in the table shows the number
 453         of reads or writes (<literal>ios</literal>) occurring for the statistic, the relative
 454         percentage (<literal>%</literal>) of total reads or writes, and the cumulative percentage
 455           (<literal>cum %</literal>) to that point in the table for the statistic.</para>
 456       <informaltable frame="all">
 457         <tgroup cols="2">
 458           <colspec colname="c1" colwidth="40*"/>
 459           <colspec colname="c2" colwidth="60*"/>
 460           <thead>
 461             <row>
 462               <entry>
 463                 <para><emphasis role="bold">Field</emphasis></para>
 464               </entry>
 465               <entry>
 466                 <para><emphasis role="bold">Description</emphasis></para>
 467               </entry>
 468             </row>
 469           </thead>
 470           <tbody>
 471             <row>
 472               <entry>
 473                 <para> pages per RPC</para>
 474               </entry>
 475               <entry>
 476                 <para>Shows cumulative RPC reads and writes organized according to the number of
 477                   pages in the RPC. A single page RPC increments the <literal>0:</literal>
 478                   row.</para>
 479               </entry>
 480             </row>
 481             <row>
 482               <entry>
 483                 <para> RPCs in flight</para>
 484               </entry>
 485               <entry>
 486                 <para> Shows the number of RPCs that are pending when an RPC is sent. When the first
 487                   RPC is sent, the <literal>0:</literal> row is incremented. If the first RPC is
 488                   sent while another RPC is pending, the <literal>1:</literal> row is incremented
 489                   and so on. </para>
 490               </entry>
 491             </row>
 492             <row>
 493               <entry>
 494                 <para> offset</para>
 495               </entry>
 496               <entry>
 497                 <para> The page index of the first page read from or written to the object by the
 498                   RPC. </para>
 499               </entry>
 500             </row>
 501           </tbody>
 502         </tgroup>
 503       </informaltable>
 504       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 505       <para>This table provides a way to visualize the concurrency of the RPC stream. Ideally, you
 506         will see a large clump around the <literal>max_rpcs_in_flight value</literal>, which shows
 507         that the network is being kept busy.</para>
 508       <para>For information about optimizing the client I/O RPC stream, see <xref
 509           xmlns:xlink="http://www.w3.org/1999/xlink" linkend="TuningClientIORPCStream"/>.</para>
 510     </section>
 511     <section xml:id="lustreproc.clientstats" remap="h3">
 512       <title><indexterm>
 513           <primary>proc</primary>
 514           <secondary>client stats</secondary>
 515         </indexterm>Monitoring Client Activity</title>
 516       <para>The <literal>stats</literal> file maintains statistics accumulate during typical
 517         operation of a client across the VFS interface of the Lustre file system. Only non-zero
 518         parameters are displayed in the file. </para>
 519       <para>Client statistics are enabled by default.</para>
 520       <note>
 521         <para>Statistics for all mounted file systems can be discovered by
 522           entering:<screen>lctl get_param llite.*.stats</screen></para>
 523       </note>
 524       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 525       <screen>client# lctl get_param llite.*.stats
 526 snapshot_time          1308343279.169704 secs.usecs
 527 dirty_pages_hits       14819716 samples [regs]
 528 dirty_pages_misses     81473472 samples [regs]
 529 read_bytes             36502963 samples [bytes] 1 26843582 55488794
 530 write_bytes            22985001 samples [bytes] 0 125912 3379002
 531 brw_read               2279 samples [pages] 1 1 2270
 532 ioctl                  186749 samples [regs]
 533 open                   3304805 samples [regs]
 534 close                  3331323 samples [regs]
 535 seek                   48222475 samples [regs]
 536 fsync                  963 samples [regs]
 537 truncate               9073 samples [regs]
 538 setxattr               19059 samples [regs]
 539 getxattr               61169 samples [regs]
 540 </screen>
 541       <para> The statistics can be cleared by echoing an empty string into the
 542           <literal>stats</literal> file or by using the command:
 543         <screen>lctl set_param llite.*.stats=0</screen></para>
 544       <para>The statistics displayed are described in the table below.</para>
 545       <informaltable frame="all">
 546         <tgroup cols="2">
 547           <colspec colname="c1" colwidth="3*"/>
 548           <colspec colname="c2" colwidth="7*"/>
 549           <thead>
 550             <row>
 551               <entry>
 552                 <para><emphasis role="bold">Entry</emphasis></para>
 553               </entry>
 554               <entry>
 555                 <para><emphasis role="bold">Description</emphasis></para>
 556               </entry>
 557             </row>
 558           </thead>
 559           <tbody>
 560             <row>
 561               <entry>
 562                 <para>
 563                   <literal>snapshot_time</literal></para>
 564               </entry>
 565               <entry>
 566                 <para>UNIX epoch instant the stats file was read.</para>
 567               </entry>
 568             </row>
 569             <row>
 570               <entry>
 571                 <para>
 572                   <literal>dirty_page_hits</literal></para>
 573               </entry>
 574               <entry>
 575                 <para>The number of write operations that have been satisfied by the dirty page
 576                   cache. See <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 577                     linkend="TuningClientIORPCStream"/> for more information about dirty cache
 578                   behavior in a Lustre file system.</para>
 579               </entry>
 580             </row>
 581             <row>
 582               <entry>
 583                 <para>
 584                   <literal>dirty_page_misses</literal></para>
 585               </entry>
 586               <entry>
 587                 <para>The number of write operations that were not satisfied by the dirty page
 588                   cache.</para>
 589               </entry>
 590             </row>
 591             <row>
 592               <entry>
 593                 <para>
 594                   <literal>read_bytes</literal></para>
 595               </entry>
 596               <entry>
 597                 <para>The number of read operations that have occurred. Three additional parameters
 598                   are displayed:</para>
 599                 <variablelist>
 600                   <varlistentry>
 601                     <term>min</term>
 602                     <listitem>
 603                       <para>The minimum number of bytes read in a single request since the counter
 604                         was reset.</para>
 605                     </listitem>
 606                   </varlistentry>
 607                   <varlistentry>
 608                     <term>max</term>
 609                     <listitem>
 610                       <para>The maximum number of bytes read in a single request since the counter
 611                         was reset.</para>
 612                     </listitem>
 613                   </varlistentry>
 614                   <varlistentry>
 615                     <term>sum</term>
 616                     <listitem>
 617                       <para>The accumulated sum of bytes of all read requests since the counter was
 618                         reset.</para>
 619                     </listitem>
 620                   </varlistentry>
 621                 </variablelist>
 622               </entry>
 623             </row>
 624             <row>
 625               <entry>
 626                 <para>
 627                   <literal>write_bytes</literal></para>
 628               </entry>
 629               <entry>
 630                 <para>The number of write operations that have occurred. Three additional parameters
 631                   are displayed:</para>
 632                 <variablelist>
 633                   <varlistentry>
 634                     <term>min</term>
 635                     <listitem>
 636                       <para>The minimum number of bytes written in a single request since the
 637                         counter was reset.</para>
 638                     </listitem>
 639                   </varlistentry>
 640                   <varlistentry>
 641                     <term>max</term>
 642                     <listitem>
 643                       <para>The maximum number of bytes written in a single request since the
 644                         counter was reset.</para>
 645                     </listitem>
 646                   </varlistentry>
 647                   <varlistentry>
 648                     <term>sum</term>
 649                     <listitem>
 650                       <para>The accumulated sum of bytes of all write requests since the counter was
 651                         reset.</para>
 652                     </listitem>
 653                   </varlistentry>
 654                 </variablelist>
 655               </entry>
 656             </row>
 657             <row>
 658               <entry>
 659                 <para>
 660                   <literal>brw_read</literal></para>
 661               </entry>
 662               <entry>
 663                 <para>The number of pages that have been read. Three additional parameters are
 664                   displayed:</para>
 665                 <variablelist>
 666                   <varlistentry>
 667                     <term>min</term>
 668                     <listitem>
 669                       <para>The minimum number of bytes read in a single block read/write
 670                           (<literal>brw</literal>) read request since the counter was reset.</para>
 671                     </listitem>
 672                   </varlistentry>
 673                   <varlistentry>
 674                     <term>max</term>
 675                     <listitem>
 676                       <para>The maximum number of bytes read in a single <literal>brw</literal> read
 677                         requests since the counter was reset.</para>
 678                     </listitem>
 679                   </varlistentry>
 680                   <varlistentry>
 681                     <term>sum</term>
 682                     <listitem>
 683                       <para>The accumulated sum of bytes of all <literal>brw</literal> read requests
 684                         since the counter was reset.</para>
 685                     </listitem>
 686                   </varlistentry>
 687                 </variablelist>
 688               </entry>
 689             </row>
 690             <row>
 691               <entry>
 692                 <para>
 693                   <literal>ioctl</literal></para>
 694               </entry>
 695               <entry>
 696                 <para>The number of combined file and directory <literal>ioctl</literal>
 697                   operations.</para>
 698               </entry>
 699             </row>
 700             <row>
 701               <entry>
 702                 <para>
 703                   <literal>open</literal></para>
 704               </entry>
 705               <entry>
 706                 <para>The number of open operations that have succeeded.</para>
 707               </entry>
 708             </row>
 709             <row>
 710               <entry>
 711                 <para>
 712                   <literal>close</literal></para>
 713               </entry>
 714               <entry>
 715                 <para>The number of close operations that have succeeded.</para>
 716               </entry>
 717             </row>
 718             <row>
 719               <entry>
 720                 <para>
 721                   <literal>seek</literal></para>
 722               </entry>
 723               <entry>
 724                 <para>The number of times <literal>seek</literal> has been called.</para>
 725               </entry>
 726             </row>
 727             <row>
 728               <entry>
 729                 <para>
 730                   <literal>fsync</literal></para>
 731               </entry>
 732               <entry>
 733                 <para>The number of times <literal>fsync</literal> has been called.</para>
 734               </entry>
 735             </row>
 736             <row>
 737               <entry>
 738                 <para>
 739                   <literal>truncate</literal></para>
 740               </entry>
 741               <entry>
 742                 <para>The total number of calls to both locked and lockless
 743                     <literal>truncate</literal>.</para>
 744               </entry>
 745             </row>
 746             <row>
 747               <entry>
 748                 <para>
 749                   <literal>setxattr</literal></para>
 750               </entry>
 751               <entry>
 752                 <para>The number of times extended attributes have been set. </para>
 753               </entry>
 754             </row>
 755             <row>
 756               <entry>
 757                 <para>
 758                   <literal>getxattr</literal></para>
 759               </entry>
 760               <entry>
 761                 <para>The number of times value(s) of extended attributes have been fetched.</para>
 762               </entry>
 763             </row>
 764           </tbody>
 765         </tgroup>
 766       </informaltable>
 767       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 768       <para>Information is provided about the amount and type of I/O activity is taking place on the
 769         client.</para>
 770     </section>
 771     <section remap="h3">
 772       <title><indexterm>
 773           <primary>proc</primary>
 774           <secondary>read/write survey</secondary>
 775         </indexterm>Monitoring Client Read-Write Offset Statistics</title>
 776       <para>When the <literal>offset_stats</literal> parameter is set, statistics are maintained for
 777         occurrences of a series of read or write calls from a process that did not access the next
 778         sequential location. The <literal>OFFSET</literal> field is reset to 0 (zero) whenever a
 779         different file is read or written.</para>
 780       <note>
 781         <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
 782             <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
 783           to reduce monitoring overhead when this information is not needed.  The collection of
 784           statistics in all three of these files is activated by writing anything into any one of
 785           the files.</para>
 786       </note>
 787       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 788       <screen># lctl get_param llite.testfs-f57dee0.offset_stats
 789 snapshot_time: 1155748884.591028 (secs.usecs)
 790              RANGE   RANGE    SMALLEST   LARGEST
 791 R/W   PID    START   END      EXTENT     EXTENT    OFFSET
 792 R     8385   0       128      128        128       0
 793 R     8385   0       224      224        224       -128
 794 W     8385   0       250      50         100       0
 795 W     8385   100     1110     10         500       -150
 796 W     8384   0       5233     5233       5233      0
 797 R     8385   500     600      100        100       -610</screen>
 798       <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file was
 799         read. The tabular data is described in the table below.</para>
 800       <para>The <literal>offset_stats</literal> file can be cleared by
 801         entering:<screen>lctl set_param llite.*.offset_stats=0</screen></para>
 802       <informaltable frame="all">
 803         <tgroup cols="2">
 804           <colspec colname="c1" colwidth="50*"/>
 805           <colspec colname="c2" colwidth="50*"/>
 806           <thead>
 807             <row>
 808               <entry>
 809                 <para><emphasis role="bold">Field</emphasis></para>
 810               </entry>
 811               <entry>
 812                 <para><emphasis role="bold">Description</emphasis></para>
 813               </entry>
 814             </row>
 815           </thead>
 816           <tbody>
 817             <row>
 818               <entry>
 819                 <para>R/W</para>
 820               </entry>
 821               <entry>
 822                 <para>Indicates if the non-sequential call was a read or write</para>
 823               </entry>
 824             </row>
 825             <row>
 826               <entry>
 827                 <para>PID </para>
 828               </entry>
 829               <entry>
 830                 <para>Process ID of the process that made the read/write call.</para>
 831               </entry>
 832             </row>
 833             <row>
 834               <entry>
 835                 <para>RANGE START/RANGE END</para>
 836               </entry>
 837               <entry>
 838                 <para>Range in which the read/write calls were sequential.</para>
 839               </entry>
 840             </row>
 841             <row>
 842               <entry>
 843                 <para>SMALLEST EXTENT </para>
 844               </entry>
 845               <entry>
 846                 <para>Smallest single read/write in the corresponding range (in bytes).</para>
 847               </entry>
 848             </row>
 849             <row>
 850               <entry>
 851                 <para>LARGEST EXTENT </para>
 852               </entry>
 853               <entry>
 854                 <para>Largest single read/write in the corresponding range (in bytes).</para>
 855               </entry>
 856             </row>
 857             <row>
 858               <entry>
 859                 <para>OFFSET </para>
 860               </entry>
 861               <entry>
 862                 <para>Difference between the previous range end and the current range start.</para>
 863               </entry>
 864             </row>
 865           </tbody>
 866         </tgroup>
 867       </informaltable>
 868       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
 869       <para>This data provides an indication of how contiguous or fragmented the data is. For
 870         example, the fourth entry in the example above shows the writes for this RPC were sequential
 871         in the range 100 to 1110 with the minimum write 10 bytes and the maximum write 500 bytes.
 872         The range started with an offset of -150 from the <literal>RANGE END</literal> of the
 873         previous entry in the example.</para>
 874     </section>
 875     <section remap="h3">
 876       <title><indexterm>
 877           <primary>proc</primary>
 878           <secondary>read/write survey</secondary>
 879         </indexterm>Monitoring Client Read-Write Extent Statistics</title>
 880       <para>For in-depth troubleshooting, client read-write extent statistics can be accessed to
 881         obtain more detail about read/write I/O extents for the file system or for a particular
 882         process.</para>
 883       <note>
 884         <para>By default, statistics are not collected in the <literal>offset_stats</literal>,
 885             <literal>extents_stats</literal>, and <literal>extents_stats_per_process</literal> files
 886           to reduce monitoring overhead when this information is not needed.  The collection of
 887           statistics in all three of these files is activated by writing anything into any one of
 888           the files.</para>
 889       </note>
 890       <section remap="h3">
 891         <title>Client-Based I/O Extent Size Survey</title>
 892         <para>The <literal>extent_stats</literal> histogram in the <literal>llite</literal>
 893           directory shows the statistics for the sizes of the read/write I/O extents. This file does
 894           not maintain the per-process statistics.</para>
 895         <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 896         <screen># lctl get_param llite.testfs-*.extents_stats
 897 snapshot_time:                     1213828728.348516 (secs.usecs)
 898                        read           |            write
 899 extents          calls  %      cum%   |     calls  %     cum%
 900
 901 0K - 4K :        0      0      0      |     2      2     2
 902 4K - 8K :        0      0      0      |     0      0     2
 903 8K - 16K :       0      0      0      |     0      0     2
 904 16K - 32K :      0      0      0      |     20     23    26
 905 32K - 64K :      0      0      0      |     0      0     26
 906 64K - 128K :     0      0      0      |     51     60    86
 907 128K - 256K :    0      0      0      |     0      0     86
 908 256K - 512K :    0      0      0      |     0      0     86
 909 512K - 1024K :   0      0      0      |     0      0     86
 910 1M - 2M :        0      0      0      |     11     13    100</screen>
 911         <para>In this example, <literal>snapshot_time</literal> is the UNIX epoch instant the file
 912           was read. The table shows cumulative extents organized according to size with statistics
 913           provided separately for reads and writes. Each row in the table shows the number of RPCs
 914           for reads and writes respectively (<literal>calls</literal>), the relative percentage of
 915           total calls (<literal>%</literal>), and the cumulative percentage to that point in the
 916           table of calls (<literal>cum %</literal>). </para>
 917         <para> The file can be cleared by issuing the following
 918           command:<screen># lctl set_param llite.testfs-*.extents_stats=0</screen></para>
 919       </section>
 920       <section>
 921         <title>Per-Process Client I/O Statistics</title>
 922         <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size
 923           statistics on a per-process basis.</para>
 924         <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 925         <screen># lctl get_param llite.testfs-*.extents_stats_per_process
 926 snapshot_time:                     1213828762.204440 (secs.usecs)
 927                           read            |             write
 928 extents            calls   %      cum%    |      calls   %       cum%
 929
 930 PID: 11488
 931    0K - 4K :       0       0       0      |      0       0       0
 932    4K - 8K :       0       0       0      |      0       0       0
 933    8K - 16K :      0       0       0      |      0       0       0
 934    16K - 32K :     0       0       0      |      0       0       0
 935    32K - 64K :     0       0       0      |      0       0       0
 936    64K - 128K :    0       0       0      |      0       0       0
 937    128K - 256K :   0       0       0      |      0       0       0
 938    256K - 512K :   0       0       0      |      0       0       0
 939    512K - 1024K :  0       0       0      |      0       0       0
 940    1M - 2M :       0       0       0      |      10      100     100
 941
 942 PID: 11491
 943    0K - 4K :       0       0       0      |      0       0       0
 944    4K - 8K :       0       0       0      |      0       0       0
 945    8K - 16K :      0       0       0      |      0       0       0
 946    16K - 32K :     0       0       0      |      20      100     100
 947
 948 PID: 11424
 949    0K - 4K :       0       0       0      |      0       0       0
 950    4K - 8K :       0       0       0      |      0       0       0
 951    8K - 16K :      0       0       0      |      0       0       0
 952    16K - 32K :     0       0       0      |      0       0       0
 953    32K - 64K :     0       0       0      |      0       0       0
 954    64K - 128K :    0       0       0      |      16      100     100
 955
 956 PID: 11426
 957    0K - 4K :       0       0       0      |      1       100     100
 958
 959 PID: 11429
 960    0K - 4K :       0       0       0      |      1       100     100
 961
 962 </screen>
 963         <para>This table shows cumulative extents organized according to size for each process ID
 964           (PID) with statistics provided separately for reads and writes. Each row in the table
 965           shows the number of RPCs for reads and writes respectively (<literal>calls</literal>), the
 966           relative percentage of total calls (<literal>%</literal>), and the cumulative percentage
 967           to that point in the table of calls (<literal>cum %</literal>). </para>
 968       </section>
 969     </section>
 970     <section xml:id="dbdoclet.50438271_55057">
 971       <title><indexterm>
 972           <primary>proc</primary>
 973           <secondary>block I/O</secondary>
 974         </indexterm>Monitoring the OST Block I/O Stream</title>
 975       <para>The <literal>brw_stats</literal> file in the <literal>obdfilter</literal> directory
 976         contains histogram data showing statistics for number of I/O requests sent to the disk,
 977         their size, and whether they are contiguous on the disk or not.</para>
 978       <para><emphasis role="italic"><emphasis role="bold">Example:</emphasis></emphasis></para>
 979       <para>Enter on the OSS:</para>
 980       <screen># lctl get_param obdfilter.testfs-OST0000.brw_stats
 981 snapshot_time:         1372775039.769045 (secs.usecs)
 982                            read      |      write
 983 pages per bulk r/w     rpcs  % cum % |  rpcs   % cum %
 984 1:                     108 100 100   |    39   0   0
 985 2:                       0   0 100   |     6   0   0
 986 4:                       0   0 100   |     1   0   0
 987 8:                       0   0 100   |     0   0   0
 988 16:                      0   0 100   |     4   0   0
 989 32:                      0   0 100   |    17   0   0
 990 64:                      0   0 100   |    12   0   0
 991 128:                     0   0 100   |    24   0   0
 992 256:                     0   0 100   | 23142  99 100
 993
 994                            read      |      write
 995 discontiguous pages    rpcs  % cum % |  rpcs   % cum %
 996 0:                     108 100 100   | 23245 100 100
 997
 998                            read      |      write
 999 discontiguous blocks   rpcs  % cum % |  rpcs   % cum %
1000 0:                     108 100 100   | 23243  99  99
1001 1:                       0   0 100   |     2   0 100
1002
1003                            read      |      write
1004 disk fragmented I/Os   ios   % cum % |   ios   % cum %
1005 0:                      94  87  87   |     0   0   0
1006 1:                      14  12 100   | 23243  99  99
1007 2:                       0   0 100   |     2   0 100
1008
1009                            read      |      write
1010 disk I/Os in flight    ios   % cum % |   ios   % cum %
1011 1:                      14 100 100   | 20896  89  89
1012 2:                       0   0 100   |  1071   4  94
1013 3:                       0   0 100   |   573   2  96
1014 4:                       0   0 100   |   300   1  98
1015 5:                       0   0 100   |   166   0  98
1016 6:                       0   0 100   |   108   0  99
1017 7:                       0   0 100   |    81   0  99
1018 8:                       0   0 100   |    47   0  99
1019 9:                       0   0 100   |     5   0 100
1020
1021                            read      |      write
1022 I/O time (1/1000s)     ios   % cum % |   ios   % cum %
1023 1:                      94  87  87   |     0   0   0
1024 2:                       0   0  87   |     7   0   0
1025 4:                      14  12 100   |    27   0   0
1026 8:                       0   0 100   |    14   0   0
1027 16:                      0   0 100   |    31   0   0
1028 32:                      0   0 100   |    38   0   0
1029 64:                      0   0 100   | 18979  81  82
1030 128:                     0   0 100   |   943   4  86
1031 256:                     0   0 100   |  1233   5  91
1032 512:                     0   0 100   |  1825   7  99
1033 1K:                      0   0 100   |   99   0  99
1034 2K:                      0   0 100   |     0   0  99
1035 4K:                      0   0 100   |     0   0  99
1036 8K:                      0   0 100   |    49   0 100
1037
1038                            read      |      write
1039 disk I/O size          ios   % cum % |   ios   % cum %
1040 4K:                     14 100 100   |    41   0   0
1041 8K:                      0   0 100   |     6   0   0
1042 16K:                     0   0 100   |     1   0   0
1043 32K:                     0   0 100   |     0   0   0
1044 64K:                     0   0 100   |     4   0   0
1045 128K:                    0   0 100   |    17   0   0
1046 256K:                    0   0 100   |    12   0   0
1047 512K:                    0   0 100   |    24   0   0
1048 1M:                      0   0 100   | 23142  99 100
1049 </screen>
1050       <para>The tabular data is described in the table below. Each row in the table shows the number
1051         of reads and writes occurring for the statistic (<literal>ios</literal>), the relative
1052         percentage of total reads or writes (<literal>%</literal>), and the cumulative percentage to
1053         that point in the table for the statistic (<literal>cum %</literal>). </para>
1054       <informaltable frame="all">
1055         <tgroup cols="2">
1056           <colspec colname="c1" colwidth="40*"/>
1057           <colspec colname="c2" colwidth="60*"/>
1058           <thead>
1059             <row>
1060               <entry>
1061                 <para><emphasis role="bold">Field</emphasis></para>
1062               </entry>
1063               <entry>
1064                 <para><emphasis role="bold">Description</emphasis></para>
1065               </entry>
1066             </row>
1067           </thead>
1068           <tbody>
1069             <row>
1070               <entry>
1071                 <para>
1072                   <literal>pages per bulk r/w</literal></para>
1073               </entry>
1074               <entry>
1075                 <para>Number of pages per RPC request, which should match aggregate client
1076                     <literal>rpc_stats</literal> (see <xref
1077                     xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"
1078                   />).</para>
1079               </entry>
1080             </row>
1081             <row>
1082               <entry>
1083                 <para>
1084                   <literal>discontiguous pages</literal></para>
1085               </entry>
1086               <entry>
1087                 <para>Number of discontinuities in the logical file offset of each page in a single
1088                   RPC.</para>
1089               </entry>
1090             </row>
1091             <row>
1092               <entry>
1093                 <para>
1094                   <literal>discontiguous blocks</literal></para>
1095               </entry>
1096               <entry>
1097                 <para>Number of discontinuities in the physical block allocation in the file system
1098                   for a single RPC.</para>
1099               </entry>
1100             </row>
1101             <row>
1102               <entry>
1103                 <para><literal>disk fragmented I/Os</literal></para>
1104               </entry>
1105               <entry>
1106                 <para>Number of I/Os that were not written entirely sequentially.</para>
1107               </entry>
1108             </row>
1109             <row>
1110               <entry>
1111                 <para><literal>disk I/Os in flight</literal></para>
1112               </entry>
1113               <entry>
1114                 <para>Number of disk I/Os currently pending.</para>
1115               </entry>
1116             </row>
1117             <row>
1118               <entry>
1119                 <para><literal>I/O time (1/1000s)</literal></para>
1120               </entry>
1121               <entry>
1122                 <para>Amount of time for each I/O operation to complete.</para>
1123               </entry>
1124             </row>
1125             <row>
1126               <entry>
1127                 <para><literal>disk I/O size</literal></para>
1128               </entry>
1129               <entry>
1130                 <para>Size of each I/O operation.</para>
1131               </entry>
1132             </row>
1133           </tbody>
1134         </tgroup>
1135       </informaltable>
1136       <para><emphasis role="italic"><emphasis role="bold">Analysis:</emphasis></emphasis></para>
1137       <para>This data provides an indication of extent size and distribution in the file
1138         system.</para>
1139     </section>
1140   </section>
1141   <section>
1142     <title>Tuning Lustre File System I/O</title>
1143     <para>Each OSC has its own tree of  tunables. For example:</para>
1144     <screen>$ ls -d /proc/fs/testfs/osc/OSC_client_ost1_MNT_client_2 /localhost
1145 /proc/fs/testfs/osc/OSC_uml0_ost1_MNT_localhost
1146 /proc/fs/testfs/osc/OSC_uml0_ost2_MNT_localhost
1147 /proc/fs/testfs/osc/OSC_uml0_ost3_MNT_localhost
1148
1149 $ ls /proc/fs/testfs/osc/OSC_uml0_ost1_MNT_localhost
1150 blocksizefilesfree max_dirty_mb ost_server_uuid stats
1151
1152 ...</screen>
1153     <para>The following sections describe some of the parameters that can be tuned in a Lustre file
1154       system.</para>
1155     <section remap="h3" xml:id="TuningClientIORPCStream">
1156       <title><indexterm>
1157           <primary>proc</primary>
1158           <secondary>RPC tunables</secondary>
1159         </indexterm>Tuning the Client I/O RPC Stream</title>
1160       <para>Ideally, an optimal amount of data is packed into each I/O RPC and a consistent number
1161         of issued RPCs are in progress at any time. To help optimize the client I/O RPC stream,
1162         several tuning variables are provided to adjust behavior according to network conditions and
1163         cluster size. For information about monitoring the client I/O RPC stream, see <xref
1164           xmlns:xlink="http://www.w3.org/1999/xlink" linkend="MonitoringClientRCPStream"/>.</para>
1165       <para>RPC stream tunables include:</para>
1166       <para>
1167         <itemizedlist>
1168           <listitem>
1169             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal> -
1170               Controls how many MBs of dirty data can be written and queued up in the OSC. POSIX
1171               file writes that are cached contribute to this count. When the limit is reached,
1172               additional writes stall until previously-cached writes are written to the server. This
1173               may be changed by writing a single ASCII integer to the file. Only values between 0
1174               and 2048 or 1/4 of RAM are allowable. If 0 is specified, no writes are cached.
1175               Performance suffers noticeably unless you use large writes (1 MB or more).</para>
1176             <para>To maximize performance, the value for <literal>max_dirty_mb</literal> is
1177               recommended to be 4 * <literal>max_pages_per_rpc </literal>*
1178                 <literal>max_rpcs_in_flight</literal>.</para>
1179           </listitem>
1180           <listitem>
1181             <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal> - A
1182               read-only value that returns the current number of bytes written and cached on this
1183               OSC.</para>
1184           </listitem>
1185           <listitem>
1186             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal> -
1187               The maximum number of pages that will undergo I/O in a single RPC to the OST. The
1188               minimum setting is a single page and the maximum setting is 1024 (for systems with a
1189                 <literal>PAGE_SIZE</literal> of 4 KB), with the default maximum of 1 MB in the RPC.
1190               It is also possible to specify a units suffix (e.g. <literal>4M</literal>), so that
1191               the RPC size can be specified independently of the client
1192               <literal>PAGE_SIZE</literal>.</para>
1193           </listitem>
1194           <listitem>
1195             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
1196               - The maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC
1197               tries to initiate an RPC but finds that it already has the same number of RPCs
1198               outstanding, it will wait to issue further RPCs until some complete. The minimum
1199               setting is 1 and maximum setting is 256. </para>
1200             <para>To improve small file I/O performance, increase the
1201                 <literal>max_rpcs_in_flight</literal> value.</para>
1202           </listitem>
1203           <listitem>
1204             <para><literal>llite.<replaceable>fsname-instance</replaceable>/max_cache_mb</literal> -
1205               Maximum amount of inactive data cached by the client (default is 3/4 of RAM).  For
1206               example:</para>
1207             <screen># lctl get_param llite.testfs-ce63ca00.max_cached_mb
1208 128</screen>
1209           </listitem>
1210         </itemizedlist>
1211       </para>
1212       <note>
1213         <para>The value for <literal><replaceable>osc_instance</replaceable></literal> is typically
1214               <literal><replaceable>fsname</replaceable>-OST<replaceable>ost_index</replaceable>-osc-<replaceable>mountpoint_instance</replaceable></literal>,
1215           where the value for <literal><replaceable>mountpoint_instance</replaceable></literal> is
1216           unique to each mount point to allow associating osc, mdc, lov, lmv, and llite parameters
1217           with the same mount point. For
1218           example:<screen>lctl get_param osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats
1219 osc.testfs-OST0000-osc-ffff88107412f400.rpc_stats=
1220 snapshot_time:         1375743284.337839 (secs.usecs)
1221 read RPCs in flight:  0
1222 write RPCs in flight: 0
1223 </screen></para>
1224       </note>
1225     </section>
1226     <section remap="h3">
1227       <title><indexterm>
1228           <primary>proc</primary>
1229           <secondary>readahead</secondary>
1230         </indexterm>Tuning File Readahead and Directory Statahead</title>
1231       <para>File readahead and directory statahead enable reading of data into memory before a
1232         process requests the data. File readahead reads file content data into memory and directory
1233         statahead reads metadata into memory. When readahead and statahead work well, a process that
1234         accesses data finds that the information it needs is available immediately when requested in
1235         memory without the delay of network I/O.</para>
1236       <para condition="l22">In Lustre software release 2.2.0, the directory statahead feature was
1237         improved to enhance directory traversal performance. The improvements primarily addressed
1238         two issues: <orderedlist>
1239           <listitem>
1240             <para>A race condition existed between the statahead thread and other VFS operations
1241               while processing asynchronous <literal>getattr</literal> RPC replies, causing
1242               duplicate entries in dcache. This issue was resolved by using statahead local dcache.
1243             </para>
1244           </listitem>
1245           <listitem>
1246             <para>File size/block attributes pre-fetching was not supported, so the traversing
1247               thread had to send synchronous glimpse size RPCs to OST(s). This issue was resolved by
1248               using asynchronous glimpse lock (AGL) RPCs to pre-fetch file size/block attributes
1249               from OST(s).</para>
1250           </listitem>
1251         </orderedlist>
1252       </para>
1253       <section remap="h4">
1254         <title>Tuning File Readahead</title>
1255         <para>File readahead is triggered when two or more sequential reads by an application fail
1256           to be satisfied by data in the Linux buffer cache. The size of the initial readahead is 1
1257           MB. Additional readaheads grow linearly and increment until the readahead cache on the
1258           client is full at 40 MB.</para>
1259         <para>Readahead tunables include:</para>
1260         <itemizedlist>
1261           <listitem>
1262             <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb</literal>
1263               - Controls the maximum amount of data readahead on a file. Files are read ahead in
1264               RPC-sized chunks (1 MB or the size of the <literal>read()</literal> call, if larger)
1265               after the second sequential read on a file descriptor. Random reads are done at the
1266               size of the <literal>read()</literal> call only (no readahead). Reads to
1267               non-contiguous regions of the file reset the readahead algorithm, and readahead is not
1268               triggered again until sequential reads take place again. </para>
1269             <para>To disable readahead, set this tunable to 0. The default value is 40 MB.</para>
1270           </listitem>
1271           <listitem>
1272             <para><literal>llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb</literal>
1273               - Controls the maximum size of a file that is read in its entirety, regardless of the
1274               size of the <literal>read()</literal>.</para>
1275           </listitem>
1276         </itemizedlist>
1277       </section>
1278       <section>
1279         <title>Tuning Directory Statahead and AGL</title>
1280         <para>Many system commands, such as <literal>ls –l</literal>, <literal>du</literal>, and
1281             <literal>find</literal>, traverse a directory sequentially. To make these commands run
1282           efficiently, the directory statahead and asynchronous glimpse lock (AGL) can be enabled to
1283           improve the performance of traversing.</para>
1284         <para>The statahead tunables are:</para>
1285         <itemizedlist>
1286           <listitem>
1287             <para><literal>statahead_max</literal> - Controls whether directory statahead is enabled
1288               and the maximum statahead window size (i.e., how many files can be pre-fetched by the
1289               statahead thread). By default, statahead is enabled and the value of
1290                 <literal>statahead_max</literal> is 32.</para>
1291             <para>To disable statahead, run:</para>
1292             <screen>lctl set_param llite.*.statahead_max=0</screen>
1293             <para>To set the maximum statahead window size (<replaceable>n</replaceable>),
1294               run:</para>
1295             <screen>lctl set_param llite.*.statahead_max=<replaceable>n</replaceable></screen>
1296             <para>The maximum value of <replaceable>n</replaceable> is 8192.</para>
1297             <para>The AGL can be controlled by entering:</para>
1298             <screen>lctl set_param llite.*.statahead_agl=<replaceable>n</replaceable></screen>
1299             <para>The default value for <replaceable>n</replaceable> is 1, which enables the AGL. If
1300                 <replaceable>n</replaceable> is 0, the AGL is disabled.</para>
1301           </listitem>
1302           <listitem>
1303             <para><literal>statahead_stats</literal> - A read-only interface that indicates the
1304               current statahead and AGL statistics, such as how many times statahead/AGL has been
1305               triggered since the last mount, how many statahead/AGL failures have occurred due to
1306               an incorrect prediction or other causes.</para>
1307             <note>
1308               <para>The AGL is affected by statahead because the inodes processed by AGL are built
1309                 by the statahead thread, which means the statahead thread is the input of the AGL
1310                 pipeline. So if statahead is disabled, then the AGL is disabled by force.</para>
1311             </note>
1312           </listitem>
1313         </itemizedlist>
1314       </section>
1315     </section>
1316     <section remap="h3">
1317       <title><indexterm>
1318           <primary>proc</primary>
1319           <secondary>read cache</secondary>
1320         </indexterm>Tuning OSS Read Cache</title>
1321       <para>The OSS read cache feature provides read-only caching of data on an OSS. This
1322         functionality uses the Linux page cache to store the data and uses as much physical memory
1323         as is allocated.</para>
1324       <para>OSS read cache improves Lustre file system performance in these situations:</para>
1325       <itemizedlist>
1326         <listitem>
1327           <para>Many clients are accessing the same data set (as in HPC applications or when
1328             diskless clients boot from the Lustre file system).</para>
1329         </listitem>
1330         <listitem>
1331           <para>One client is storing data while another client is reading it (i.e., clients are
1332             exchanging data via the OST).</para>
1333         </listitem>
1334         <listitem>
1335           <para>A client has very limited caching of its own.</para>
1336         </listitem>
1337       </itemizedlist>
1338       <para>OSS read cache offers these benefits:</para>
1339       <itemizedlist>
1340         <listitem>
1341           <para>Allows OSTs to cache read data more frequently.</para>
1342         </listitem>
1343         <listitem>
1344           <para>Improves repeated reads to match network speeds instead of disk speeds.</para>
1345         </listitem>
1346         <listitem>
1347           <para>Provides the building blocks for OST write cache (small-write aggregation).</para>
1348         </listitem>
1349       </itemizedlist>
1350       <section remap="h4">
1351         <title>Using OSS Read Cache</title>
1352         <para>OSS read cache is implemented on the OSS, and does not require any special support on
1353           the client side. Since OSS read cache uses the memory available in the Linux page cache,
1354           the appropriate amount of memory for the cache should be determined based on I/O patterns;
1355           if the data is mostly reads, then more cache is required than would be needed for mostly
1356           writes.</para>
1357         <para>OSS read cache is managed using the following tunables:</para>
1358         <itemizedlist>
1359           <listitem>
1360             <para><literal>read_cache_enable</literal> - Controls whether data read from disk during
1361               a read request is kept in memory and available for later read requests for the same
1362               data, without having to re-read it from disk. By default, read cache is enabled
1363                 (<literal>read_cache_enable=1</literal>).</para>
1364             <para>When the OSS receives a read request from a client, it reads data from disk into
1365               its memory and sends the data as a reply to the request. If read cache is enabled,
1366               this data stays in memory after the request from the client has been fulfilled. When
1367               subsequent read requests for the same data are received, the OSS skips reading data
1368               from disk and the request is fulfilled from the cached data. The read cache is managed
1369               by the Linux kernel globally across all OSTs on that OSS so that the least recently
1370               used cache pages are dropped from memory when the amount of free memory is running
1371               low.</para>
1372             <para>If read cache is disabled (<literal>read_cache_enable=0</literal>), the OSS
1373               discards the data after a read request from the client is serviced and, for subsequent
1374               read requests, the OSS again reads the data from disk.</para>
1375             <para>To disable read cache on all the OSTs of an OSS, run:</para>
1376             <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
1377             <para>To re-enable read cache on one OST, run:</para>
1378             <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1</screen>
1379             <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
1380             <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
1381           </listitem>
1382           <listitem>
1383             <para><literal>writethrough_cache_enable</literal> - Controls whether data sent to the
1384               OSS as a write request is kept in the read cache and available for later reads, or if
1385               it is discarded from cache when the write is completed. By default, the writethrough
1386               cache is enabled (<literal>writethrough_cache_enable=1</literal>).</para>
1387             <para>When the OSS receives write requests from a client, it receives data from the
1388               client into its memory and writes the data to disk. If the writethrough cache is
1389               enabled, this data stays in memory after the write request is completed, allowing the
1390               OSS to skip reading this data from disk if a later read request, or partial-page write
1391               request, for the same data is received.</para>
1392             <para>If the writethrough cache is disabled
1393                 (<literal>writethrough_cache_enabled=0</literal>), the OSS discards the data after
1394               the write request from the client is completed. For subsequent read requests, or
1395               partial-page write requests, the OSS must re-read the data from disk.</para>
1396             <para>Enabling writethrough cache is advisable if clients are doing small or unaligned
1397               writes that would cause partial-page updates, or if the files written by one node are
1398               immediately being accessed by other nodes. Some examples where enabling writethrough
1399               cache might be useful include producer-consumer I/O models or shared-file writes with
1400               a different node doing I/O not aligned on 4096-byte boundaries. </para>
1401             <para>Disabling the writethrough cache is advisable when files are mostly written to the
1402               file system but are not re-read within a short time period, or files are only written
1403               and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
1404             <para>To disable the writethrough cache on all OSTs of an OSS, run:</para>
1405             <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
1406             <para>To re-enable the writethrough cache on one OST, run:</para>
1407             <screen>root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
1408             <para>To check if the writethrough cache is enabled, run:</para>
1409             <screen>root@oss1# lctl get_param obdfilter.*.writethrough_cache_enable</screen>
1410           </listitem>
1411           <listitem>
1412             <para><literal>readcache_max_filesize</literal> - Controls the maximum size of a file
1413               that both the read cache and writethrough cache will try to keep in memory. Files
1414               larger than <literal>readcache_max_filesize</literal> will not be kept in cache for
1415               either reads or writes.</para>
1416             <para>Setting this tunable can be useful for workloads where relatively small files are
1417               repeatedly accessed by many clients, such as job startup files, executables, log
1418               files, etc., but large files are read or written only once. By not putting the larger
1419               files into the cache, it is much more likely that more of the smaller files will
1420               remain in cache for a longer time.</para>
1421             <para>When setting <literal>readcache_max_filesize</literal>, the input value can be
1422               specified in bytes, or can have a suffix to indicate other binary units such as
1423                 <literal>K</literal> (kilobytes), <literal>M</literal> (megabytes),
1424                 <literal>G</literal> (gigabytes), <literal>T</literal> (terabytes), or
1425                 <literal>P</literal> (petabytes).</para>
1426             <para>To limit the maximum cached file size to 32 MB on all OSTs of an OSS, run:</para>
1427             <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
1428             <para>To disable the maximum cached file size on an OST, run:</para>
1429             <screen>root@oss1# lctl set_param obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
1430             <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
1431             <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
1432           </listitem>
1433         </itemizedlist>
1434       </section>
1435     </section>
1436     <section>
1437       <title><indexterm>
1438           <primary>proc</primary>
1439           <secondary>OSS journal</secondary>
1440         </indexterm>Enabling OSS Asynchronous Journal Commit</title>
1441       <para>The OSS asynchronous journal commit feature asynchronously writes data to disk without
1442         forcing a journal flush. This reduces the number of seeks and significantly improves
1443         performance on some hardware.</para>
1444       <note>
1445         <para>Asynchronous journal commit cannot work with direct I/O-originated writes
1446             (<literal>O_DIRECT</literal> flag set). In this case, a journal flush is forced. </para>
1447       </note>
1448       <para>When the asynchronous journal commit feature is enabled, client nodes keep data in the
1449         page cache (a page reference). Lustre clients monitor the last committed transaction number
1450           (<literal>transno</literal>) in messages sent from the OSS to the clients. When a client
1451         sees that the last committed <literal>transno</literal> reported by the OSS is at least
1452         equal to the bulk write <literal>transno</literal>, it releases the reference on the
1453         corresponding pages. To avoid page references being held for too long on clients after a
1454         bulk write, a 7 second ping request is scheduled (the default OSS file system commit time
1455         interval is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity
1456         to report the last committed <literal>transno</literal>.</para>
1457       <para>If the OSS crashes before the journal commit occurs, then intermediate data is lost.
1458         However, OSS recovery functionality incorporated into the asynchronous journal commit
1459         feature causes clients to replay their write requests and compensate for the missing disk
1460         updates by restoring the state of the file system.</para>
1461       <para>By default, <literal>sync_journal</literal> is enabled
1462           (<literal>sync_journal=1</literal>), so that journal entries are committed synchronously.
1463         To enable asynchronous journal commit, set the <literal>sync_journal</literal> parameter to
1464           <literal>0</literal> by entering: </para>
1465       <screen>$ lctl set_param obdfilter.*.sync_journal=0
1466 obdfilter.lol-OST0001.sync_journal=0</screen>
1467       <para>An associated <literal>sync-on-lock-cancel</literal> feature (enabled by default)
1468         addresses a data consistency issue that can result if an OSS crashes after multiple clients
1469         have written data into intersecting regions of an object, and then one of the clients also
1470         crashes. A condition is created in which the POSIX requirement for continuous writes is
1471         violated along with a potential for corrupted data. With
1472           <literal>sync-on-lock-cancel</literal> enabled, if a cancelled lock has any volatile
1473         writes attached to it, the OSS synchronously writes the journal to disk on lock
1474         cancellation. Disabling the <literal>sync-on-lock-cancel</literal> feature may enhance
1475         performance for concurrent write workloads, but it is recommended that you not disable this
1476         feature.</para>
1477       <para> The <literal>sync_on_lock_cancel</literal> parameter can be set to the following
1478         values:</para>
1479       <itemizedlist>
1480         <listitem>
1481           <para><literal>always</literal> - Always force a journal flush on lock cancellation
1482             (default when <literal>async_journal</literal> is enabled).</para>
1483         </listitem>
1484         <listitem>
1485           <para><literal>blocking</literal> - Force a journal flush only when the local cancellation
1486             is due to a blocking callback.</para>
1487         </listitem>
1488         <listitem>
1489           <para><literal>never</literal> - Do not force any journal flush (default when
1490               <literal>async_journal</literal> is disabled).</para>
1491         </listitem>
1492       </itemizedlist>
1493       <para>For example, to set <literal>sync_on_lock_cancel</literal> to not to force a journal
1494         flush, use a command similar to:</para>
1495       <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
1496 obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
1497     </section>
1498   </section>
1499   <section>
1500     <title>Configuring Timeouts in a Lustre File System</title>
1501     <para>In a Lustre file system, RPC timeouts are set using an adaptive timeouts mechanism, which
1502       is enabled by default. Servers track RPC completion times and then report back to clients
1503       estimates for completion times for future RPCs. Clients  use these estimates to set RPC
1504       timeout values. If the processing of server requests slows down for any reason, the server
1505       estimates for RPC completion increase, and clients then revise RPC timeout values to allow
1506       more time for RPC completion.</para>
1507     <para>If the RPCs queued on the server approach the RPC timeout specified by the client, to
1508       avoid RPC timeouts and disconnect/reconnect cycles, the server sends an "early reply" to the
1509       client, telling the client to allow more time. Conversely, as server processing speeds up, RPC
1510       timeout values decrease, resulting in faster detection if the server becomes non-responsive
1511       and quicker connection to the failover partner of the server.</para>
1512     <section>
1513       <title><indexterm>
1514           <primary>proc</primary>
1515           <secondary>configuring adaptive timeouts</secondary>
1516         </indexterm><indexterm>
1517           <primary>configuring</primary>
1518           <secondary>adaptive timeouts</secondary>
1519         </indexterm><indexterm>
1520           <primary>proc</primary>
1521           <secondary>adaptive timeouts</secondary>
1522         </indexterm>Configuring Adaptive Timeouts</title>
1523       <para>The adaptive timeout parameters in the table below can be set persistently system-wide
1524         using <literal>lctl conf_param</literal> on the MGS. For example, the following command sets
1525         the <literal>at_max</literal> value  for all servers and clients associated with the file
1526         system
1527         <literal>testfs</literal>:<screen>lctl conf_param testfs.sys.at_max=1500</screen></para>
1528       <note>
1529         <para>Clients that access multiple Lustre file systems must use the same parameter values
1530           for all file systems.</para>
1531       </note>
1532       <informaltable frame="all">
1533         <tgroup cols="2">
1534           <colspec colname="c1" colwidth="30*"/>
1535           <colspec colname="c2" colwidth="80*"/>
1536           <thead>
1537             <row>
1538               <entry>
1539                 <para><emphasis role="bold">Parameter</emphasis></para>
1540               </entry>
1541               <entry>
1542                 <para><emphasis role="bold">Description</emphasis></para>
1543               </entry>
1544             </row>
1545           </thead>
1546           <tbody>
1547             <row>
1548               <entry>
1549                 <para>
1550                   <literal> at_min </literal></para>
1551               </entry>
1552               <entry>
1553                 <para>Minimum adaptive timeout (in seconds). The default value is 0. The
1554                     <literal>at_min</literal> parameter is the minimum processing time that a server
1555                   will report. Ideally, <literal>at_min</literal> should be set to its default
1556                   value. Clients base their timeouts on this value, but they do not use this value
1557                   directly. </para>
1558                 <para>If, for unknown reasons (usually due to temporary network outages), the
1559                   adaptive timeout value is too short and clients time out their RPCs, you can
1560                   increase the <literal>at_min</literal> value to compensate for this.</para>
1561               </entry>
1562             </row>
1563             <row>
1564               <entry>
1565                 <para>
1566                   <literal> at_max </literal></para>
1567               </entry>
1568               <entry>
1569                 <para>Maximum adaptive timeout (in seconds). The <literal>at_max</literal> parameter
1570                   is an upper-limit on the service time estimate. If <literal>at_max</literal> is
1571                   reached, an RPC request times out.</para>
1572                 <para>Setting <literal>at_max</literal> to 0 causes adaptive timeouts to be disabled
1573                   and a fixed timeout method to be used instead (see <xref
1574                     xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_c24_nt5_dl"/></para>
1575                 <note>
1576                   <para>If slow hardware causes the service estimate to increase beyond the default
1577                     value of <literal>at_max</literal>, increase <literal>at_max</literal> to the
1578                     maximum time you are willing to wait for an RPC completion.</para>
1579                 </note>
1580               </entry>
1581             </row>
1582             <row>
1583               <entry>
1584                 <para>
1585                   <literal> at_history </literal></para>
1586               </entry>
1587               <entry>
1588                 <para>Time period (in seconds) within which adaptive timeouts remember the slowest
1589                   event that occurred. The default is 600.</para>
1590               </entry>
1591             </row>
1592             <row>
1593               <entry>
1594                 <para>
1595                   <literal> at_early_margin </literal></para>
1596               </entry>
1597               <entry>
1598                 <para>Amount of time before the Lustre server sends an early reply (in seconds).
1599                   Default is 5.</para>
1600               </entry>
1601             </row>
1602             <row>
1603               <entry>
1604                 <para>
1605                   <literal> at_extra </literal></para>
1606               </entry>
1607               <entry>
1608                 <para>Incremental amount of time that a server requests with each early reply (in
1609                   seconds). The server does not know how much time the RPC will take, so it asks for
1610                   a fixed value. The default is 30, which provides a balance between sending too
1611                   many early replies for the same RPC and overestimating the actual completion
1612                   time.</para>
1613                 <para>When a server finds a queued request about to time out and needs to send an
1614                   early reply out, the server adds the <literal>at_extra</literal> value. If the
1615                   time expires, the Lustre server drops the request, and the client enters recovery
1616                   status and reconnects to restore the connection to normal status.</para>
1617                 <para>If you see multiple early replies for the same RPC asking for 30-second
1618                   increases, change the <literal>at_extra</literal> value to a larger number to cut
1619                   down on early replies sent and, therefore, network load.</para>
1620               </entry>
1621             </row>
1622             <row>
1623               <entry>
1624                 <para>
1625                   <literal> ldlm_enqueue_min </literal></para>
1626               </entry>
1627               <entry>
1628                 <para>Minimum lock enqueue time (in seconds). The default is 100. The time it takes
1629                   to enqueue a lock, <literal>ldlm_enqueue</literal>, is the maximum of the measured
1630                   enqueue estimate (influenced by <literal>at_min</literal> and
1631                     <literal>at_max</literal> parameters), multiplied by a weighting factor and the
1632                   value of <literal>ldlm_enqueue_min</literal>. </para>
1633                 <para>Lustre Distributed Lock Manager (LDLM) lock enqueues have a dedicated minimum
1634                   value for <literal>ldlm_enqueue_min</literal>. Lock enqueue timeouts increase as
1635                   the measured enqueue times increase (similar to adaptive timeouts).</para>
1636               </entry>
1637             </row>
1638           </tbody>
1639         </tgroup>
1640       </informaltable>
1641       <section>
1642         <title>Interpreting Adaptive Timeout Information</title>
1643         <para>Adaptive timeout information can be obtained from the <literal>timeouts</literal>
1644           files in <literal>/proc/fs/lustre/*/</literal> on each server and client using the
1645             <literal>lctl</literal> command. To read information from a <literal>timeouts</literal>
1646           file, enter a command similar to:</para>
1647         <screen># lctl get_param -n ost.*.ost_io.timeouts
1648 service : cur 33  worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2</screen>
1649         <para>In this example, the <literal>ost_io</literal> service on this node is currently
1650           reporting an estimated RPC service time of 33 seconds. The worst RPC service time was 34
1651           seconds, which occurred 26 minutes ago.</para>
1652         <para>The output also provides a history of service times. Four &quot;bins&quot; of adaptive
1653           timeout history are shown, with the maximum RPC time in each bin reported. In both the
1654           0-150s bin and the 150-300s bin, the maximum RPC time was 1. The 300-450s bin shows the
1655           worst (maximum) RPC time at 33 seconds, and the 450-600s bin shows a maximum of RPC time
1656           of 2 seconds. The estimated service time is the maximum value across the four bins (33
1657           seconds in this example).</para>
1658         <para>Service times (as reported by the servers) are also tracked in the client OBDs, as
1659           shown in this example:</para>
1660         <screen># lctl get_param osc.*.timeouts
1661 last reply : 1193428639, 0d0h00m00s ago
1662 network    : cur  1 worst  2 (at 1193427053, 0d0h26m26s ago)  1  1  1  1
1663 portal 6   : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33  2
1664 portal 28  : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  1  1  1
1665 portal 7   : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  0  1  1
1666 portal 17  : cur  1 worst  1 (at 1193426177, 0d0h41m02s ago)  1  0  0  1
1667 </screen>
1668         <para>In this example, portal 6, the <literal>ost_io</literal> service portal, shows the
1669           history of service estimates reported by the portal.</para>
1670         <para>Server statistic files also show the range of estimates including min, max, sum, and
1671           sumsq. For example:</para>
1672         <screen># lctl get_param mdt.*.mdt.stats
1673 ...
1674 req_timeout               6 samples [sec] 1 10 15 105
1675 ...
1676 </screen>
1677       </section>
1678     </section>
1679     <section xml:id="section_c24_nt5_dl">
1680       <title>Setting Static Timeouts<indexterm>
1681           <primary>proc</primary>
1682           <secondary>static timeouts</secondary>
1683         </indexterm></title>
1684       <para>The Lustre software provides two sets of static (fixed) timeouts, LND timeouts and
1685         Lustre timeouts, which are used when adaptive timeouts are not enabled.</para>
1686       <para>
1687         <itemizedlist>
1688           <listitem>
1689             <para><emphasis role="italic"><emphasis role="bold">LND timeouts</emphasis></emphasis> -
1690               LND timeouts ensure that point-to-point communications across a network complete in a
1691               finite time in the presence of failures, such as packages lost or broken connections.
1692               LND timeout parameters are set for each individual LND.</para>
1693             <para>LND timeouts are logged with the <literal>S_LND</literal> flag set. They are not
1694               printed as console messages, so check the Lustre log for <literal>D_NETERROR</literal>
1695               messages or enable printing of <literal>D_NETERROR</literal> messages to the console
1696               using:<screen>lctl set_param printk=+neterror</screen></para>
1697             <para>Congested routers can be a source of spurious LND timeouts. To avoid this
1698               situation, increase the number of LNET router buffers to reduce back-pressure and/or
1699               increase LND timeouts on all nodes on all connected networks. Also consider increasing
1700               the total number of LNET router nodes in the system so that the aggregate router
1701               bandwidth matches the aggregate server bandwidth.</para>
1702           </listitem>
1703           <listitem>
1704             <para><emphasis role="italic"><emphasis role="bold">Lustre timeouts
1705                 </emphasis></emphasis>- Lustre timeouts ensure that Lustre RPCs complete in a finite
1706               time in the presence of failures when adaptive timeouts are not enabled. Adaptive
1707               timeouts are enabled by default. To disable adaptive timeouts at run time, set
1708                 <literal>at_max</literal> to 0 by running on the
1709               MGS:<screen># lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0</screen></para>
1710             <note>
1711               <para>Changing the status of adaptive timeouts at runtime may cause a transient client
1712                 timeout, recovery, and reconnection.</para>
1713             </note>
1714             <para>Lustre timeouts are always printed as console messages. </para>
1715             <para>If Lustre timeouts are not accompanied by LND timeouts, increase the Lustre
1716               timeout on both servers and clients. Lustre timeouts are set using a command such as
1717               the following:<screen># lctl set_param timeout=30</screen></para>
1718             <para>Lustre timeout parameters are described in the table below.</para>
1719           </listitem>
1720         </itemizedlist>
1721         <informaltable frame="all">
1722           <tgroup cols="2">
1723             <colspec colname="c1" colnum="1" colwidth="30*"/>
1724             <colspec colname="c2" colnum="2" colwidth="70*"/>
1725             <thead>
1726               <row>
1727                 <entry>Parameter</entry>
1728                 <entry>Description</entry>
1729               </row>
1730             </thead>
1731             <tbody>
1732               <row>
1733                 <entry><literal>timeout</literal></entry>
1734                 <entry>
1735                   <para>The time that a client waits for a server to complete an RPC (default 100s).
1736                     Servers wait half this time for a normal client RPC to complete and a quarter of
1737                     this time for a single bulk request (read or write of up to 4 MB) to complete.
1738                     The client pings recoverable targets (MDS and OSTs) at one quarter of the
1739                     timeout, and the server waits one and a half times the timeout before evicting a
1740                     client for being &quot;stale.&quot;</para>
1741                   <para>Lustre client sends periodic &apos;ping&apos; messages to servers with which
1742                     it has had no communication for the specified period of time. Any network
1743                     activity between a client and a server in the file system also serves as a
1744                     ping.</para>
1745                 </entry>
1746               </row>
1747               <row>
1748                 <entry><literal>ldlm_timeout</literal></entry>
1749                 <entry>
1750                   <para>The time that a server waits for a client to reply to an initial AST (lock
1751                     cancellation request). The default is 20s for an OST and 6s for an MDS. If the
1752                     client replies to the AST, the server will give it a normal timeout (half the
1753                     client timeout) to flush any dirty data and release the lock.</para>
1754                 </entry>
1755               </row>
1756               <row>
1757                 <entry><literal>fail_loc</literal></entry>
1758                 <entry>
1759                   <para>An internal debugging failure hook. The default value of
1760                       <literal>0</literal> means that no failure will be triggered or
1761                     injected.</para>
1762                 </entry>
1763               </row>
1764               <row>
1765                 <entry><literal>dump_on_timeout</literal></entry>
1766                 <entry>
1767                   <para>Triggers a dump of the Lustre debug log when a timeout occurs. The default
1768                     value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
1769                     not be triggered.</para>
1770                 </entry>
1771               </row>
1772               <row>
1773                 <entry><literal>dump_on_eviction</literal></entry>
1774                 <entry>
1775                   <para>Triggers a dump of the Lustre debug log when an eviction occurs. The default
1776                     value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
1777                     not be triggered. </para>
1778                 </entry>
1779               </row>
1780             </tbody>
1781           </tgroup>
1782         </informaltable>
1783       </para>
1784     </section>
1785   </section>
1786   <section remap="h3">
1787     <title><indexterm>
1788         <primary>proc</primary>
1789         <secondary>LNET</secondary>
1790       </indexterm><indexterm>
1791         <primary>LNET</primary>
1792         <secondary>proc</secondary>
1793       </indexterm>Monitoring LNET</title>
1794     <para>LNET information is located in <literal>/proc/sys/lnet</literal> in these files:<itemizedlist>
1795         <listitem>
1796           <para><literal>peers</literal> - Shows all NIDs known to this node and provides
1797             information on the queue state.</para>
1798           <para>Example:</para>
1799           <screen># lctl get_param peers
1800 nid                refs   state  max  rtr  min   tx    min   queue
1801 0@lo               1      ~rtr   0    0    0     0     0     0
1802 192.168.10.35@tcp  1      ~rtr   8    8    8     8     6     0
1803 192.168.10.36@tcp  1      ~rtr   8    8    8     8     6     0
1804 192.168.10.37@tcp  1      ~rtr   8    8    8     8     6     0</screen>
1805           <para>The fields are explained in the table below:</para>
1806           <informaltable frame="all">
1807             <tgroup cols="2">
1808               <colspec colname="c1" colwidth="30*"/>
1809               <colspec colname="c2" colwidth="80*"/>
1810               <thead>
1811                 <row>
1812                   <entry>
1813                     <para><emphasis role="bold">Field</emphasis></para>
1814                   </entry>
1815                   <entry>
1816                     <para><emphasis role="bold">Description</emphasis></para>
1817                   </entry>
1818                 </row>
1819               </thead>
1820               <tbody>
1821                 <row>
1822                   <entry>
1823                     <para>
1824                       <literal>refs</literal>
1825                     </para>
1826                   </entry>
1827                   <entry>
1828                     <para>A reference count. </para>
1829                   </entry>
1830                 </row>
1831                 <row>
1832                   <entry>
1833                     <para>
1834                       <literal>state</literal>
1835                     </para>
1836                   </entry>
1837                   <entry>
1838                     <para>If the node is a router, indicates the state of the router. Possible
1839                       values are:</para>
1840                     <itemizedlist>
1841                       <listitem>
1842                         <para><literal>NA</literal> - Indicates the node is not a router.</para>
1843                       </listitem>
1844                       <listitem>
1845                         <para><literal>up/down</literal>- Indicates if the node (router) is up or
1846                           down.</para>
1847                       </listitem>
1848                     </itemizedlist>
1849                   </entry>
1850                 </row>
1851                 <row>
1852                   <entry>
1853                     <para>
1854                       <literal>max </literal></para>
1855                   </entry>
1856                   <entry>
1857                     <para>Maximum number of concurrent sends from this peer.</para>
1858                   </entry>
1859                 </row>
1860                 <row>
1861                   <entry>
1862                     <para>
1863                       <literal>rtr </literal></para>
1864                   </entry>
1865                   <entry>
1866                     <para>Number of routing buffer credits.</para>
1867                   </entry>
1868                 </row>
1869                 <row>
1870                   <entry>
1871                     <para>
1872                       <literal>min </literal></para>
1873                   </entry>
1874                   <entry>
1875                     <para>Minimum number of routing buffer credits seen.</para>
1876                   </entry>
1877                 </row>
1878                 <row>
1879                   <entry>
1880                     <para>
1881                       <literal>tx </literal></para>
1882                   </entry>
1883                   <entry>
1884                     <para>Number of send credits.</para>
1885                   </entry>
1886                 </row>
1887                 <row>
1888                   <entry>
1889                     <para>
1890                       <literal>min </literal></para>
1891                   </entry>
1892                   <entry>
1893                     <para>Minimum number of send credits seen.</para>
1894                   </entry>
1895                 </row>
1896                 <row>
1897                   <entry>
1898                     <para>
1899                       <literal>queue </literal></para>
1900                   </entry>
1901                   <entry>
1902                     <para>Total bytes in active/queued sends.</para>
1903                   </entry>
1904                 </row>
1905               </tbody>
1906             </tgroup>
1907           </informaltable>
1908           <para>Credits are initialized to allow a certain number of operations (in the example
1909             above the table, eight as shown in the <literal>max</literal> column. LNET keeps track
1910             of the minimum number of credits ever seen over time showing the peak congestion that
1911             has occurred during the time monitored. Fewer available credits indicates a more
1912             congested resource. </para>
1913           <para>The number of credits currently in flight (number of transmit credits) is shown in
1914             the <literal>tx</literal> column. The maximum number of send credits available is shown
1915             in the <literal>max</literal> column and never changes. The number of router buffers
1916             available for consumption by a peer is shown in the <literal>rtr</literal>
1917             column.</para>
1918           <para>Therefore, <literal>rtr</literal> – <literal>tx</literal> is the number of transmits
1919             in flight. Typically, <literal>rtr == max</literal>, although a configuration can be set
1920             such that <literal>max >= rtr</literal>. The ratio of routing buffer credits to send
1921             credits (<literal>rtr/tx</literal>) that is less than <literal>max</literal> indicates
1922             operations are in progress. If the ratio <literal>rtr/tx</literal> is greater than
1923               <literal>max</literal>, operations are blocking.</para>
1924           <para>LNET also limits concurrent sends and number of router buffers allocated to a single
1925             peer so that no peer can occupy all these resources.</para>
1926         </listitem>
1927         <listitem>
1928           <para><literal>nis</literal> - Shows the current queue health on this node.</para>
1929           <para>Example:</para>
1930           <screen># lctl get_param nis
1931 nid                    refs   peer    max   tx    min
1932 0@lo                   3      0       0     0     0
1933 192.168.10.34@tcp      4      8       256   256   252
1934 </screen>
1935           <para> The fields are explained in the table below.</para>
1936           <informaltable frame="all">
1937             <tgroup cols="2">
1938               <colspec colname="c1" colwidth="30*"/>
1939               <colspec colname="c2" colwidth="80*"/>
1940               <thead>
1941                 <row>
1942                   <entry>
1943                     <para><emphasis role="bold">Field</emphasis></para>
1944                   </entry>
1945                   <entry>
1946                     <para><emphasis role="bold">Description</emphasis></para>
1947                   </entry>
1948                 </row>
1949               </thead>
1950               <tbody>
1951                 <row>
1952                   <entry>
1953                     <para>
1954                       <literal> nid </literal></para>
1955                   </entry>
1956                   <entry>
1957                     <para>Network interface.</para>
1958                   </entry>
1959                 </row>
1960                 <row>
1961                   <entry>
1962                     <para>
1963                       <literal> refs </literal></para>
1964                   </entry>
1965                   <entry>
1966                     <para>Internal reference counter.</para>
1967                   </entry>
1968                 </row>
1969                 <row>
1970                   <entry>
1971                     <para>
1972                       <literal> peer </literal></para>
1973                   </entry>
1974                   <entry>
1975                     <para>Number of peer-to-peer send credits on this NID. Credits are used to size
1976                       buffer pools.</para>
1977                   </entry>
1978                 </row>
1979                 <row>
1980                   <entry>
1981                     <para>
1982                       <literal> max </literal></para>
1983                   </entry>
1984                   <entry>
1985                     <para>Total number of send credits on this NID.</para>
1986                   </entry>
1987                 </row>
1988                 <row>
1989                   <entry>
1990                     <para>
1991                       <literal> tx </literal></para>
1992                   </entry>
1993                   <entry>
1994                     <para>Current number of send credits available on this NID.</para>
1995                   </entry>
1996                 </row>
1997                 <row>
1998                   <entry>
1999                     <para>
2000                       <literal> min </literal></para>
2001                   </entry>
2002                   <entry>
2003                     <para>Lowest number of send credits available on this NID.</para>
2004                   </entry>
2005                 </row>
2006                 <row>
2007                   <entry>
2008                     <para>
2009                       <literal> queue </literal></para>
2010                   </entry>
2011                   <entry>
2012                     <para>Total bytes in active/queued sends.</para>
2013                   </entry>
2014                 </row>
2015               </tbody>
2016             </tgroup>
2017           </informaltable>
2018           <para><emphasis role="bold"><emphasis role="italic">Analysis:</emphasis></emphasis></para>
2019           <para>Subtracting <literal>max</literal> from <literal>tx</literal>
2020               (<literal>max</literal> - <literal>tx</literal>) yields the number of sends currently
2021             active. A large or increasing number of active sends may indicate a problem.</para>
2022         </listitem>
2023       </itemizedlist></para>
2024   </section>
2025   <section remap="h3">
2026     <title><indexterm>
2027         <primary>proc</primary>
2028         <secondary>free space</secondary>
2029       </indexterm>Allocating Free Space on OSTs</title>
2030     <para>Free space is allocated using either a round-robin or a weighted algorithm. The allocation
2031       method is determined by the maximum amount of free-space imbalance between the OSTs. When free
2032       space is relatively balanced across OSTs, the faster round-robin allocator is used, which
2033       maximizes network balancing. The weighted allocator is used when any two OSTs are out of
2034       balance by more than a specified threshold.</para>
2035     <para>Free space distribution can be tuned using these two <literal>/proc</literal>
2036       tunables:</para>
2037     <itemizedlist>
2038       <listitem>
2039         <para><literal>qos_threshold_rr</literal> - The threshold at which the allocation method
2040           switches from round-robin to weighted is set in this file. The default is to switch to the
2041           weighted algorithm when any two OSTs are out of balance by more than 17 percent.</para>
2042       </listitem>
2043       <listitem>
2044         <para><literal>qos_prio_free</literal> - The weighting priority used by the weighted
2045           allocator can be adjusted in this file. Increasing the value of
2046             <literal>qos_prio_free</literal> puts more weighting on the amount of free space
2047           available on each OST and less on how stripes are distributed across OSTs. The default
2048           value is 91 percent. When the free space priority is set to 100, weighting is based
2049           entirely on free space and location is no longer used by the striping algorthm.</para>
2050       </listitem>
2051     </itemizedlist>
2052     <para>For more information about monitoring and managing free space, see <xref
2053         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438209_10424"/>.</para>
2054   </section>
2055   <section remap="h3">
2056     <title><indexterm>
2057         <primary>proc</primary>
2058         <secondary>locking</secondary>
2059       </indexterm>Configuring Locking</title>
2060     <para>The <literal>lru_size</literal> parameter is used to control the number of client-side
2061       locks in an LRU cached locks queue. LRU size is dynamic, based on load to optimize the number
2062       of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute
2063       nodes vs. backup nodes).</para>
2064     <para>The total number of locks available is a function of the server RAM. The default limit is
2065       50 locks/1 MB of RAM. If memory pressure is too high, the LRU size is shrunk. The number of
2066       locks on the server is limited to <emphasis role="italic">the number of OSTs per
2067         server</emphasis> * <emphasis role="italic">the number of clients</emphasis> * <emphasis
2068         role="italic">the value of the</emphasis>
2069       <literal>lru_size</literal>
2070       <emphasis role="italic">setting on the client</emphasis> as follows: </para>
2071     <itemizedlist>
2072       <listitem>
2073         <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0. In
2074           this case, the <literal>lru_size</literal> parameter shows the current number of locks
2075           being used on the export. LRU sizing is enabled by default.</para>
2076       </listitem>
2077       <listitem>
2078         <para>To specify a maximum number of locks, set the <literal>lru_size</literal> parameter to
2079           a value other than zero but, normally, less than 100 * <emphasis role="italic">number of
2080             CPUs in client</emphasis>. It is recommended that you only increase the LRU size on a
2081           few login nodes where users access the file system interactively.</para>
2082       </listitem>
2083     </itemizedlist>
2084     <para>To clear the LRU on a single client, and, as a result, flush client cache without changing
2085       the <literal>lru_size</literal> value, run:</para>
2086     <screen>$ lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
2087     <para>If the LRU size is set to be less than the number of existing unused locks, the unused
2088       locks are canceled immediately. Use <literal>echo clear</literal> to cancel all locks without
2089       changing the value.</para>
2090     <note>
2091       <para>The <literal>lru_size</literal> parameter can only be set temporarily using
2092           <literal>lctl set_param</literal>; it cannot be set permanently.</para>
2093     </note>
2094     <para>To disable LRU sizing, on the Lustre clients, run:</para>
2095     <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((<replaceable>NR_CPU</replaceable>*100))</screen>
2096     <para>Replace <literal><replaceable>NR_CPU</replaceable></literal> with the number of CPUs on
2097       the node.</para>
2098     <para>To determine the number of locks being granted, run:</para>
2099     <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
2100   </section>
2101   <section xml:id="dbdoclet.50438271_87260">
2102     <title><indexterm>
2103         <primary>proc</primary>
2104         <secondary>thread counts</secondary>
2105       </indexterm>Setting MDS and OSS Thread Counts</title>
2106     <para>MDS and OSS thread counts tunable can be used to set the minimum and maximum thread counts
2107       or get the current number of running threads for the services listed in the table
2108       below.</para>
2109     <informaltable frame="all">
2110       <tgroup cols="2">
2111         <colspec colname="c1" colwidth="50*"/>
2112         <colspec colname="c2" colwidth="50*"/>
2113         <tbody>
2114           <row>
2115             <entry>
2116               <para>
2117                 <emphasis role="bold">Service</emphasis></para>
2118             </entry>
2119             <entry>
2120               <para>
2121                 <emphasis role="bold">Description</emphasis></para>
2122             </entry>
2123           </row>
2124           <row>
2125             <entry>
2126               <literal> mds.MDS.mdt </literal>
2127             </entry>
2128             <entry>
2129               <para>Main metadata operations service</para>
2130             </entry>
2131           </row>
2132           <row>
2133             <entry>
2134               <literal> mds.MDS.mdt_readpage </literal>
2135             </entry>
2136             <entry>
2137               <para>Metadata <literal>readdir</literal> service</para>
2138             </entry>
2139           </row>
2140           <row>
2141             <entry>
2142               <literal> mds.MDS.mdt_setattr </literal>
2143             </entry>
2144             <entry>
2145               <para>Metadata <literal>setattr/close</literal> operations service </para>
2146             </entry>
2147           </row>
2148           <row>
2149             <entry>
2150               <literal> ost.OSS.ost </literal>
2151             </entry>
2152             <entry>
2153               <para>Main data operations service</para>
2154             </entry>
2155           </row>
2156           <row>
2157             <entry>
2158               <literal> ost.OSS.ost_io </literal>
2159             </entry>
2160             <entry>
2161               <para>Bulk data I/O services</para>
2162             </entry>
2163           </row>
2164           <row>
2165             <entry>
2166               <literal> ost.OSS.ost_create </literal>
2167             </entry>
2168             <entry>
2169               <para>OST object pre-creation service</para>
2170             </entry>
2171           </row>
2172           <row>
2173             <entry>
2174               <literal> ldlm.services.ldlm_canceld </literal>
2175             </entry>
2176             <entry>
2177               <para>DLM lock cancel service</para>
2178             </entry>
2179           </row>
2180           <row>
2181             <entry>
2182               <literal> ldlm.services.ldlm_cbd </literal>
2183             </entry>
2184             <entry>
2185               <para>DLM lock grant service</para>
2186             </entry>
2187           </row>
2188         </tbody>
2189       </tgroup>
2190     </informaltable>
2191     <para>For each service, an entry as shown below is
2192       created:<screen>/proc/fs/lustre/<replaceable>service</replaceable>/*/threads_<replaceable>min|max|started</replaceable></screen></para>
2193     <itemizedlist>
2194       <listitem>
2195         <para>To temporarily set this tunable, run:</para>
2196         <screen># lctl <replaceable>get|set</replaceable>_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
2197         </listitem>
2198       <listitem>
2199         <para>To permanently set this tunable, run:</para>
2200         <screen># lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
2201         <para condition='l25'>For version 2.5 or later, run:
2202                 <screen># lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
2203       </listitem>
2204     </itemizedlist>
2205       <para>The following examples show how to set thread counts and get the number of running threads
2206         for the service <literal>ost_io</literal>  using the tunable
2207         <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
2208     <itemizedlist>
2209       <listitem>
2210         <para>To get the number of running threads, run:</para>
2211         <screen># lctl get_param ost.OSS.ost_io.threads_started
2212 ost.OSS.ost_io.threads_started=128</screen>
2213       </listitem>
2214       <listitem>
2215         <para>To set the number of threads to the maximum value (512), run:</para>
2216         <screen># lctl get_param ost.OSS.ost_io.threads_max
2217 ost.OSS.ost_io.threads_max=512</screen>
2218       </listitem>
2219       <listitem>
2220         <para>To set the maximum thread count to 256 instead of 512 (to avoid overloading the
2221           storage or for an array with requests), run:</para>
2222         <screen># lctl set_param ost.OSS.ost_io.threads_max=256
2223 ost.OSS.ost_io.threads_max=256</screen>
2224       </listitem>
2225       <listitem>
2226         <para>To set the maximum thread count to 256 instead of 512 permanently, run:</para>
2227         <screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen>
2228         <para condition='l25'>For version 2.5 or later, run:
2229         <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
2230 ost.OSS.ost_io.threads_max=256 </screen> </para>
2231       </listitem>
2232       <listitem>
2233         <para> To check if the <literal>threads_max</literal> setting is active, run:</para>
2234         <screen># lctl get_param ost.OSS.ost_io.threads_max
2235 ost.OSS.ost_io.threads_max=256</screen>
2236       </listitem>
2237     </itemizedlist>
2238     <note>
2239       <para>If the number of service threads is changed while the file system is running, the change
2240         may not take effect until the file system is stopped and rest. If the number of service
2241         threads in use exceeds the new <literal>threads_max</literal> value setting, service threads
2242         that are already running will not be stopped.</para>
2243     </note>
2244     <para>See also <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="lustretuning"/></para>
2245   </section>
2246   <section xml:id="dbdoclet.50438271_83523">
2247     <title><indexterm>
2248         <primary>proc</primary>
2249         <secondary>debug</secondary>
2250       </indexterm>Enabling and Interpreting Debugging Logs</title>
2251     <para>By default, a detailed log of all operations is generated to aid in debugging. Flags that
2252       control debugging are found in <literal>/proc/sys/lnet/debug</literal>. </para>
2253     <para>The overhead of debugging can affect the performance of Lustre file system. Therefore, to
2254       minimize the impact on performance, the debug level can be lowered, which affects the amount
2255       of debugging information kept in the internal log buffer but does not alter the amount of
2256       information to goes into syslog. You can raise the debug level when you need to collect logs
2257       to debug problems. </para>
2258     <para>The debugging mask can be set using &quot;symbolic names&quot;. The symbolic format is
2259       shown in the examples below.<itemizedlist>
2260         <listitem>
2261           <para>To verify the debug level used, examine the <literal>sysctl</literal> that controls
2262             debugging by running:</para>
2263           <screen># sysctl lnet.debug
2264 lnet.debug = ioctl neterror warning error emerg ha config console</screen>
2265         </listitem>
2266         <listitem>
2267           <para>To turn off debugging (except for network error debugging), run the following
2268             command on all nodes concerned:</para>
2269           <screen># sysctl -w lnet.debug=&quot;neterror&quot;
2270 lnet.debug = neterror</screen>
2271         </listitem>
2272       </itemizedlist><itemizedlist>
2273         <listitem>
2274           <para>To turn off debugging completely, run the following command on all nodes
2275             concerned:</para>
2276           <screen># sysctl -w lnet.debug=0
2277 lnet.debug = 0</screen>
2278         </listitem>
2279         <listitem>
2280           <para>To set an appropriate debug level for a production environment, run:</para>
2281           <screen># sysctl -w lnet.debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot;
2282 lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace</screen>
2283           <para>The flags shown in this example collect enough high-level information to aid
2284             debugging, but they do not cause any serious performance impact.</para>
2285         </listitem>
2286       </itemizedlist><itemizedlist>
2287         <listitem>
2288           <para>To clear all flags and set new flags, run:</para>
2289           <screen># sysctl -w lnet.debug=&quot;warning&quot;
2290 lnet.debug = warning</screen>
2291         </listitem>
2292       </itemizedlist><itemizedlist>
2293         <listitem>
2294           <para>To add new flags to flags that have already been set, precede each one with a
2295               &quot;<literal>+</literal>&quot;:</para>
2296           <screen># sysctl -w lnet.debug=&quot;+neterror +ha&quot;
2297 lnet.debug = +neterror +ha
2298 # sysctl lnet.debug
2299 lnet.debug = neterror warning ha</screen>
2300         </listitem>
2301         <listitem>
2302           <para>To remove individual flags, precede them with a
2303             &quot;<literal>-</literal>&quot;:</para>
2304           <screen># sysctl -w lnet.debug=&quot;-ha&quot;
2305 lnet.debug = -ha
2306 # sysctl lnet.debug
2307 lnet.debug = neterror warning</screen>
2308         </listitem>
2309         <listitem>
2310           <para>To verify or change the debug level, run commands such as the following: :</para>
2311           <screen># lctl get_param debug
2312 debug=
2313 neterror warning
2314 # lctl set_param debug=+ha
2315 # lctl get_param debug
2316 debug=
2317 neterror warning ha
2318 # lctl set_param debug=-warning
2319 # lctl get_param debug
2320 debug=
2321 neterror ha</screen>
2322         </listitem>
2323       </itemizedlist></para>
2324     <para>Debugging parameters include:</para>
2325     <itemizedlist>
2326       <listitem>
2327         <para><literal>subsystem_debug</literal> - Controls the debug logs for subsystems.</para>
2328       </listitem>
2329       <listitem>
2330         <para><literal>debug_path</literal> - Indicates the location where the debug log is dumped
2331           when triggered automatically or manually. The default path is
2332             <literal>/tmp/lustre-log</literal>.</para>
2333       </listitem>
2334     </itemizedlist>
2335     <para>These parameters are also set using:<screen>sysctl -w lnet.debug={value}</screen></para>
2336     <para>Additional useful parameters: <itemizedlist>
2337         <listitem>
2338           <para><literal>panic_on_lbug</literal> - Causes &apos;&apos;panic&apos;&apos; to be called
2339             when the Lustre software detects an internal problem (an <literal>LBUG</literal> log
2340             entry); panic crashes the node. This is particularly useful when a kernel crash dump
2341             utility is configured. The crash dump is triggered when the internal inconsistency is
2342             detected by the Lustre software. </para>
2343         </listitem>
2344         <listitem>
2345           <para><literal>upcall</literal> - Allows you to specify the path to the binary which will
2346             be invoked when an <literal>LBUG</literal> log entry is encountered. This binary is
2347             called with four parameters:</para>
2348           <para> - The string &apos;&apos;<literal>LBUG</literal>&apos;&apos;.</para>
2349           <para> - The file where the <literal>LBUG</literal> occurred.</para>
2350           <para> - The function name.</para>
2351           <para> - The line number in the file</para>
2352         </listitem>
2353       </itemizedlist></para>
2354     <section>
2355       <title>Interpreting OST Statistics</title>
2356       <note>
2357         <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
2358             <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2359       </note>
2360       <para>OST <literal>stats</literal> files can be used to provide statistics showing activity
2361         for each OST. For example:</para>
2362       <screen># lctl get_param osc.testfs-OST0000-osc.stats
2363 snapshot_time                      1189732762.835363
2364 ost_create                 1
2365 ost_get_info               1
2366 ost_connect                1
2367 ost_set_info               1
2368 obd_ping                   212</screen>
2369       <para>Use the <literal>llstat</literal> utility to monitor statistics over time.</para>
2370       <para>To clear the statistics, use the <literal>-c</literal> option to
2371           <literal>llstat</literal>. To specify how frequently the statistics should be reported (in
2372         seconds), use the <literal>-i</literal> option. In the example below, the
2373           <literal>-c</literal> option clears the statistics and <literal>-i10</literal> option
2374         reports statistics every 10 seconds:</para>
2375       <screen role="smaller">$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
2376
2377 /usr/bin/llstat: STATS on 06/06/07
2378         /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
2379 snapshot_time                              1181074093.276072
2380
2381 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
2382 Name        Cur.  Cur. #
2383             Count Rate Events Unit  last   min    avg       max    stddev
2384 req_waittime 8    0    8    [usec]  2078   34     259.75    868    317.49
2385 req_qdepth   8    0    8    [reqs]  1      0      0.12      1      0.35
2386 req_active   8    0    8    [reqs]  11     1      1.38      2      0.52
2387 reqbuf_avail 8    0    8    [bufs]  511    63     63.88     64     0.35
2388 ost_write    8    0    8    [bytes] 169767 72914  212209.62 387579 91874.29
2389
2390 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
2391 Name        Cur.  Cur. #
2392             Count Rate Events Unit  last    min   avg       max    stddev
2393 req_waittime 31   3    39   [usec]  30011   34    822.79    12245  2047.71
2394 req_qdepth   31   3    39   [reqs]  0       0     0.03      1      0.16
2395 req_active   31   3    39   [reqs]  58      1     1.77      3      0.74
2396 reqbuf_avail 31   3    39   [bufs]  1977    63    63.79     64     0.41
2397 ost_write    30   3    38   [bytes] 1028467 15019 315325.16 910694 197776.51
2398
2399 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
2400 Name        Cur.  Cur. #
2401             Count Rate Events Unit  last    min    avg       max    stddev
2402 req_waittime 21   2    60   [usec]  14970   34     784.32    12245  1878.66
2403 req_qdepth   21   2    60   [reqs]  0       0      0.02      1      0.13
2404 req_active   21   2    60   [reqs]  33      1      1.70      3      0.70
2405 reqbuf_avail 21   2    60   [bufs]  1341    63     63.82     64     0.39
2406 ost_write    21   2    59   [bytes] 7648424 15019  332725.08 910694 180397.87
2407 </screen>
2408       <para>The columns in this example are described in the table below.</para>
2409       <informaltable frame="all">
2410         <tgroup cols="2">
2411           <colspec colname="c1" colwidth="50*"/>
2412           <colspec colname="c2" colwidth="50*"/>
2413           <thead>
2414             <row>
2415               <entry>
2416                 <para><emphasis role="bold">Parameter</emphasis></para>
2417               </entry>
2418               <entry>
2419                 <para><emphasis role="bold">Description</emphasis></para>
2420               </entry>
2421             </row>
2422           </thead>
2423           <tbody>
2424             <row>
2425               <entry><literal>Name</literal></entry>
2426               <entry>Name of the service event.  See the tables below for descriptions of service
2427                 events that are tracked.</entry>
2428             </row>
2429             <row>
2430               <entry>
2431                 <para>
2432                   <literal>Cur. Count </literal></para>
2433               </entry>
2434               <entry>
2435                 <para>Number of events of each type sent in the last interval.</para>
2436               </entry>
2437             </row>
2438             <row>
2439               <entry>
2440                 <para>
2441                   <literal>Cur. Rate </literal></para>
2442               </entry>
2443               <entry>
2444                 <para>Number of events per second in the last interval.</para>
2445               </entry>
2446             </row>
2447             <row>
2448               <entry>
2449                 <para>
2450                   <literal> # Events </literal></para>
2451               </entry>
2452               <entry>
2453                 <para>Total number of such events since the events have been cleared.</para>
2454               </entry>
2455             </row>
2456             <row>
2457               <entry>
2458                 <para>
2459                   <literal> Unit </literal></para>
2460               </entry>
2461               <entry>
2462                 <para>Unit of measurement for that statistic (microseconds, requests,
2463                   buffers).</para>
2464               </entry>
2465             </row>
2466             <row>
2467               <entry>
2468                 <para>
2469                   <literal> last </literal></para>
2470               </entry>
2471               <entry>
2472                 <para>Average rate of these events (in units/event) for the last interval during
2473                   which they arrived. For instance, in the above mentioned case of
2474                     <literal>ost_destroy</literal> it took an average of 736 microseconds per
2475                   destroy for the 400 object destroys in the previous 10 seconds.</para>
2476               </entry>
2477             </row>
2478             <row>
2479               <entry>
2480                 <para>
2481                   <literal> min </literal></para>
2482               </entry>
2483               <entry>
2484                 <para>Minimum rate (in units/events) since the service started.</para>
2485               </entry>
2486             </row>
2487             <row>
2488               <entry>
2489                 <para>
2490                   <literal> avg </literal></para>
2491               </entry>
2492               <entry>
2493                 <para>Average rate.</para>
2494               </entry>
2495             </row>
2496             <row>
2497               <entry>
2498                 <para>
2499                   <literal> max </literal></para>
2500               </entry>
2501               <entry>
2502                 <para>Maximum rate.</para>
2503               </entry>
2504             </row>
2505             <row>
2506               <entry>
2507                 <para>
2508                   <literal> stddev </literal></para>
2509               </entry>
2510               <entry>
2511                 <para>Standard deviation (not measured in some cases)</para>
2512               </entry>
2513             </row>
2514           </tbody>
2515         </tgroup>
2516       </informaltable>
2517       <para>Events common to all services are shown in the table below.</para>
2518       <informaltable frame="all">
2519         <tgroup cols="2">
2520           <colspec colname="c1" colwidth="50*"/>
2521           <colspec colname="c2" colwidth="50*"/>
2522           <thead>
2523             <row>
2524               <entry>
2525                 <para><emphasis role="bold">Parameter</emphasis></para>
2526               </entry>
2527               <entry>
2528                 <para><emphasis role="bold">Description</emphasis></para>
2529               </entry>
2530             </row>
2531           </thead>
2532           <tbody>
2533             <row>
2534               <entry>
2535                 <para>
2536                   <literal> req_waittime </literal></para>
2537               </entry>
2538               <entry>
2539                 <para>Amount of time a request waited in the queue before being handled by an
2540                   available server thread.</para>
2541               </entry>
2542             </row>
2543             <row>
2544               <entry>
2545                 <para>
2546                   <literal> req_qdepth </literal></para>
2547               </entry>
2548               <entry>
2549                 <para>Number of requests waiting to be handled in the queue for this service.</para>
2550               </entry>
2551             </row>
2552             <row>
2553               <entry>
2554                 <para>
2555                   <literal> req_active </literal></para>
2556               </entry>
2557               <entry>
2558                 <para>Number of requests currently being handled.</para>
2559               </entry>
2560             </row>
2561             <row>
2562               <entry>
2563                 <para>
2564                   <literal> reqbuf_avail </literal></para>
2565               </entry>
2566               <entry>
2567                 <para>Number of unsolicited lnet request buffers for this service.</para>
2568               </entry>
2569             </row>
2570           </tbody>
2571         </tgroup>
2572       </informaltable>
2573       <para>Some service-specific events of interest are described in the table below.</para>
2574       <informaltable frame="all">
2575         <tgroup cols="2">
2576           <colspec colname="c1" colwidth="50*"/>
2577           <colspec colname="c2" colwidth="50*"/>
2578           <thead>
2579             <row>
2580               <entry>
2581                 <para><emphasis role="bold">Parameter</emphasis></para>
2582               </entry>
2583               <entry>
2584                 <para><emphasis role="bold">Description</emphasis></para>
2585               </entry>
2586             </row>
2587           </thead>
2588           <tbody>
2589             <row>
2590               <entry>
2591                 <para>
2592                   <literal> ldlm_enqueue </literal></para>
2593               </entry>
2594               <entry>
2595                 <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
2596               </entry>
2597             </row>
2598             <row>
2599               <entry>
2600                 <para>
2601                   <literal> mds_reint </literal></para>
2602               </entry>
2603               <entry>
2604                 <para>Time it takes to process an MDS modification record (includes
2605                     <literal>create</literal>, <literal>mkdir</literal>, <literal>unlink</literal>,
2606                     <literal>rename</literal> and <literal>setattr</literal>)</para>
2607               </entry>
2608             </row>
2609           </tbody>
2610         </tgroup>
2611       </informaltable>
2612     </section>
2613     <section>
2614       <title>Interpreting MDT Statistics</title>
2615       <note>
2616         <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>) and
2617             <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2618       </note>
2619       <para>MDT <literal>stats</literal> files can be used to track MDT statistics for the MDS. The
2620         example below shows sample output from an MDT <literal>stats</literal> file.</para>
2621       <screen># lctl get_param mds.*-MDT0000.stats
2622 snapshot_time                   1244832003.676892 secs.usecs
2623 open                            2 samples [reqs]
2624 close                           1 samples [reqs]
2625 getxattr                        3 samples [reqs]
2626 process_config                  1 samples [reqs]
2627 connect                         2 samples [reqs]
2628 disconnect                      2 samples [reqs]
2629 statfs                          3 samples [reqs]
2630 setattr                         1 samples [reqs]
2631 getattr                         3 samples [reqs]
2632 llog_init                       6 samples [reqs]
2633 notify                          16 samples [reqs]</screen>
2634     </section>
2635   </section>
2636 </chapter>