+ <section xml:id="section_c24_nt5_dl">
+ <title>Setting Static Timeouts<indexterm>
+ <primary>proc</primary>
+ <secondary>static timeouts</secondary>
+ </indexterm></title>
+ <para>The Lustre software provides two sets of static (fixed) timeouts, LND timeouts and
+ Lustre timeouts, which are used when adaptive timeouts are not enabled.</para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para><emphasis role="italic"><emphasis role="bold">LND timeouts</emphasis></emphasis> -
+ LND timeouts ensure that point-to-point communications across a network complete in a
+ finite time in the presence of failures, such as packages lost or broken connections.
+ LND timeout parameters are set for each individual LND.</para>
+ <para>LND timeouts are logged with the <literal>S_LND</literal> flag set. They are not
+ printed as console messages, so check the Lustre log for <literal>D_NETERROR</literal>
+ messages or enable printing of <literal>D_NETERROR</literal> messages to the console
+ using:<screen>lctl set_param printk=+neterror</screen></para>
+ <para>Congested routers can be a source of spurious LND timeouts. To avoid this
+ situation, increase the number of LNet router buffers to reduce back-pressure and/or
+ increase LND timeouts on all nodes on all connected networks. Also consider increasing
+ the total number of LNet router nodes in the system so that the aggregate router
+ bandwidth matches the aggregate server bandwidth.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="italic"><emphasis role="bold">Lustre timeouts
+ </emphasis></emphasis>- Lustre timeouts ensure that Lustre RPCs complete in a finite
+ time in the presence of failures when adaptive timeouts are not enabled. Adaptive
+ timeouts are enabled by default. To disable adaptive timeouts at run time, set
+ <literal>at_max</literal> to 0 by running on the
+ MGS:<screen># lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0</screen></para>
+ <note>
+ <para>Changing the status of adaptive timeouts at runtime may cause a transient client
+ timeout, recovery, and reconnection.</para>
+ </note>
+ <para>Lustre timeouts are always printed as console messages. </para>
+ <para>If Lustre timeouts are not accompanied by LND timeouts, increase the Lustre
+ timeout on both servers and clients. Lustre timeouts are set using a command such as
+ the following:<screen># lctl set_param timeout=30</screen></para>
+ <para>Lustre timeout parameters are described in the table below.</para>
+ </listitem>
+ </itemizedlist>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colnum="1" colwidth="30*"/>
+ <colspec colname="c2" colnum="2" colwidth="70*"/>
+ <thead>
+ <row>
+ <entry>Parameter</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>timeout</literal></entry>
+ <entry>
+ <para>The time that a client waits for a server to complete an RPC (default 100s).
+ Servers wait half this time for a normal client RPC to complete and a quarter of
+ this time for a single bulk request (read or write of up to 4 MB) to complete.
+ The client pings recoverable targets (MDS and OSTs) at one quarter of the
+ timeout, and the server waits one and a half times the timeout before evicting a
+ client for being "stale."</para>
+ <para>Lustre client sends periodic 'ping' messages to servers with which
+ it has had no communication for the specified period of time. Any network
+ activity between a client and a server in the file system also serves as a
+ ping.</para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ldlm_timeout</literal></entry>
+ <entry>
+ <para>The time that a server waits for a client to reply to an initial AST (lock
+ cancellation request). The default is 20s for an OST and 6s for an MDS. If the
+ client replies to the AST, the server will give it a normal timeout (half the
+ client timeout) to flush any dirty data and release the lock.</para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>fail_loc</literal></entry>
+ <entry>
+ <para>An internal debugging failure hook. The default value of
+ <literal>0</literal> means that no failure will be triggered or
+ injected.</para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>dump_on_timeout</literal></entry>
+ <entry>
+ <para>Triggers a dump of the Lustre debug log when a timeout occurs. The default
+ value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
+ not be triggered.</para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>dump_on_eviction</literal></entry>
+ <entry>
+ <para>Triggers a dump of the Lustre debug log when an eviction occurs. The default
+ value of <literal>0</literal> (zero) means a dump of the Lustre debug log will
+ not be triggered. </para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ </para>
+ </section>
+ </section>
+ <section remap="h3">
+ <title><indexterm>
+ <primary>proc</primary>
+ <secondary>LNet</secondary>
+ </indexterm><indexterm>
+ <primary>LNet</primary>
+ <secondary>proc</secondary>
+ </indexterm>Monitoring LNet</title>
+ <para>LNet information is located in <literal>/proc/sys/lnet</literal> in these files:<itemizedlist>
+ <listitem>
+ <para><literal>peers</literal> - Shows all NIDs known to this node and provides
+ information on the queue state.</para>
+ <para>Example:</para>
+ <screen># lctl get_param peers
+nid refs state max rtr min tx min queue
+0@lo 1 ~rtr 0 0 0 0 0 0
+192.168.10.35@tcp 1 ~rtr 8 8 8 8 6 0
+192.168.10.36@tcp 1 ~rtr 8 8 8 8 6 0
+192.168.10.37@tcp 1 ~rtr 8 8 8 8 6 0</screen>
+ <para>The fields are explained in the table below:</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="30*"/>
+ <colspec colname="c2" colwidth="80*"/>
+ <thead>
+ <row>
+ <entry>
+ <para><emphasis role="bold">Field</emphasis></para>
+ </entry>
+ <entry>
+ <para><emphasis role="bold">Description</emphasis></para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal>refs</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>A reference count. </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>state</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>If the node is a router, indicates the state of the router. Possible
+ values are:</para>
+ <itemizedlist>
+ <listitem>
+ <para><literal>NA</literal> - Indicates the node is not a router.</para>
+ </listitem>
+ <listitem>
+ <para><literal>up/down</literal>- Indicates if the node (router) is up or
+ down.</para>
+ </listitem>
+ </itemizedlist>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>max </literal></para>
+ </entry>
+ <entry>
+ <para>Maximum number of concurrent sends from this peer.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>rtr </literal></para>
+ </entry>
+ <entry>
+ <para>Number of routing buffer credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>min </literal></para>
+ </entry>
+ <entry>
+ <para>Minimum number of routing buffer credits seen.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>tx </literal></para>
+ </entry>
+ <entry>
+ <para>Number of send credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>min </literal></para>
+ </entry>
+ <entry>
+ <para>Minimum number of send credits seen.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>queue </literal></para>
+ </entry>
+ <entry>
+ <para>Total bytes in active/queued sends.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ <para>Credits are initialized to allow a certain number of operations (in the example
+ above the table, eight as shown in the <literal>max</literal> column. LNet keeps track
+ of the minimum number of credits ever seen over time showing the peak congestion that
+ has occurred during the time monitored. Fewer available credits indicates a more
+ congested resource. </para>
+ <para>The number of credits currently in flight (number of transmit credits) is shown in
+ the <literal>tx</literal> column. The maximum number of send credits available is shown
+ in the <literal>max</literal> column and never changes. The number of router buffers
+ available for consumption by a peer is shown in the <literal>rtr</literal>
+ column.</para>
+ <para>Therefore, <literal>rtr</literal> – <literal>tx</literal> is the number of transmits
+ in flight. Typically, <literal>rtr == max</literal>, although a configuration can be set
+ such that <literal>max >= rtr</literal>. The ratio of routing buffer credits to send
+ credits (<literal>rtr/tx</literal>) that is less than <literal>max</literal> indicates
+ operations are in progress. If the ratio <literal>rtr/tx</literal> is greater than
+ <literal>max</literal>, operations are blocking.</para>
+ <para>LNet also limits concurrent sends and number of router buffers allocated to a single
+ peer so that no peer can occupy all these resources.</para>
+ </listitem>
+ <listitem>
+ <para><literal>nis</literal> - Shows the current queue health on this node.</para>
+ <para>Example:</para>
+ <screen># lctl get_param nis
+nid refs peer max tx min
+0@lo 3 0 0 0 0
+192.168.10.34@tcp 4 8 256 256 252
+</screen>
+ <para> The fields are explained in the table below.</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="30*"/>
+ <colspec colname="c2" colwidth="80*"/>
+ <thead>
+ <row>
+ <entry>
+ <para><emphasis role="bold">Field</emphasis></para>
+ </entry>
+ <entry>
+ <para><emphasis role="bold">Description</emphasis></para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal> nid </literal></para>
+ </entry>
+ <entry>
+ <para>Network interface.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal> refs </literal></para>
+ </entry>
+ <entry>
+ <para>Internal reference counter.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal> peer </literal></para>
+ </entry>
+ <entry>
+ <para>Number of peer-to-peer send credits on this NID. Credits are used to size
+ buffer pools.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal> max </literal></para>
+ </entry>
+ <entry>
+ <para>Total number of send credits on this NID.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal> tx </literal></para>
+ </entry>
+ <entry>
+ <para>Current number of send credits available on this NID.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal> min </literal></para>
+ </entry>
+ <entry>
+ <para>Lowest number of send credits available on this NID.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal> queue </literal></para>
+ </entry>
+ <entry>
+ <para>Total bytes in active/queued sends.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ <para><emphasis role="bold"><emphasis role="italic">Analysis:</emphasis></emphasis></para>
+ <para>Subtracting <literal>max</literal> from <literal>tx</literal>
+ (<literal>max</literal> - <literal>tx</literal>) yields the number of sends currently
+ active. A large or increasing number of active sends may indicate a problem.</para>
+ </listitem>
+ </itemizedlist></para>
+ </section>
+ <section remap="h3" xml:id="dbdoclet.balancing_free_space">
+ <title><indexterm>
+ <primary>proc</primary>
+ <secondary>free space</secondary>
+ </indexterm>Allocating Free Space on OSTs</title>
+ <para>Free space is allocated using either a round-robin or a weighted
+ algorithm. The allocation method is determined by the maximum amount of
+ free-space imbalance between the OSTs. When free space is relatively
+ balanced across OSTs, the faster round-robin allocator is used, which
+ maximizes network balancing. The weighted allocator is used when any two
+ OSTs are out of balance by more than a specified threshold.</para>
+ <para>Free space distribution can be tuned using these two
+ <literal>/proc</literal> tunables:</para>
+ <itemizedlist>
+ <listitem>
+ <para><literal>qos_threshold_rr</literal> - The threshold at which
+ the allocation method switches from round-robin to weighted is set
+ in this file. The default is to switch to the weighted algorithm when
+ any two OSTs are out of balance by more than 17 percent.</para>
+ </listitem>
+ <listitem>
+ <para><literal>qos_prio_free</literal> - The weighting priority used
+ by the weighted allocator can be adjusted in this file. Increasing the
+ value of <literal>qos_prio_free</literal> puts more weighting on the
+ amount of free space available on each OST and less on how stripes are
+ distributed across OSTs. The default value is 91 percent weighting for
+ free space rebalancing and 9 percent for OST balancing. When the
+ free space priority is set to 100, weighting is based entirely on free
+ space and location is no longer used by the striping algorithm.</para>
+ </listitem>
+ <listitem>
+ <para condition="l29"><literal>reserved_mb_low</literal> - The low
+ watermark used to stop object allocation if available space is less
+ than it. The default is 0.1 percent of total OST size.</para>
+ </listitem>
+ <listitem>
+ <para condition="l29"><literal>reserved_mb_high</literal> - The high watermark used to start
+ object allocation if available space is more than it. The default is 0.2 percent of total
+ OST size.</para>
+ </listitem>
+ </itemizedlist>
+ <para>For more information about monitoring and managing free space, see <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438209_10424"/>.</para>
+ </section>
+ <section remap="h3">
+ <title><indexterm>
+ <primary>proc</primary>
+ <secondary>locking</secondary>
+ </indexterm>Configuring Locking</title>
+ <para>The <literal>lru_size</literal> parameter is used to control the number of client-side
+ locks in an LRU cached locks queue. LRU size is dynamic, based on load to optimize the number
+ of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute
+ nodes vs. backup nodes).</para>
+ <para>The total number of locks available is a function of the server RAM. The default limit is
+ 50 locks/1 MB of RAM. If memory pressure is too high, the LRU size is shrunk. The number of
+ locks on the server is limited to <emphasis role="italic">the number of OSTs per
+ server</emphasis> * <emphasis role="italic">the number of clients</emphasis> * <emphasis
+ role="italic">the value of the</emphasis>
+ <literal>lru_size</literal>
+ <emphasis role="italic">setting on the client</emphasis> as follows: </para>
+ <itemizedlist>
+ <listitem>
+ <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0. In
+ this case, the <literal>lru_size</literal> parameter shows the current number of locks
+ being used on the export. LRU sizing is enabled by default.</para>
+ </listitem>
+ <listitem>
+ <para>To specify a maximum number of locks, set the <literal>lru_size</literal> parameter to
+ a value other than zero but, normally, less than 100 * <emphasis role="italic">number of
+ CPUs in client</emphasis>. It is recommended that you only increase the LRU size on a
+ few login nodes where users access the file system interactively.</para>
+ </listitem>
+ </itemizedlist>
+ <para>To clear the LRU on a single client, and, as a result, flush client cache without changing
+ the <literal>lru_size</literal> value, run:</para>
+ <screen>$ lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
+ <para>If the LRU size is set to be less than the number of existing unused locks, the unused
+ locks are canceled immediately. Use <literal>echo clear</literal> to cancel all locks without
+ changing the value.</para>
+ <note>
+ <para>The <literal>lru_size</literal> parameter can only be set temporarily using
+ <literal>lctl set_param</literal>; it cannot be set permanently.</para>
+ </note>
+ <para>To disable LRU sizing, on the Lustre clients, run:</para>
+ <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((<replaceable>NR_CPU</replaceable>*100))</screen>
+ <para>Replace <literal><replaceable>NR_CPU</replaceable></literal> with the number of CPUs on
+ the node.</para>
+ <para>To determine the number of locks being granted, run:</para>
+ <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
+ </section>
+ <section xml:id="dbdoclet.50438271_87260">
+ <title><indexterm>
+ <primary>proc</primary>
+ <secondary>thread counts</secondary>
+ </indexterm>Setting MDS and OSS Thread Counts</title>
+ <para>MDS and OSS thread counts tunable can be used to set the minimum and maximum thread counts
+ or get the current number of running threads for the services listed in the table
+ below.</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*"/>
+ <colspec colname="c2" colwidth="50*"/>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Service</emphasis></para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Description</emphasis></para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> mds.MDS.mdt </literal>
+ </entry>
+ <entry>
+ <para>Main metadata operations service</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> mds.MDS.mdt_readpage </literal>
+ </entry>
+ <entry>
+ <para>Metadata <literal>readdir</literal> service</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> mds.MDS.mdt_setattr </literal>
+ </entry>
+ <entry>
+ <para>Metadata <literal>setattr/close</literal> operations service </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> ost.OSS.ost </literal>
+ </entry>
+ <entry>
+ <para>Main data operations service</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> ost.OSS.ost_io </literal>
+ </entry>
+ <entry>
+ <para>Bulk data I/O services</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> ost.OSS.ost_create </literal>
+ </entry>
+ <entry>
+ <para>OST object pre-creation service</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> ldlm.services.ldlm_canceld </literal>
+ </entry>
+ <entry>
+ <para>DLM lock cancel service</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <literal> ldlm.services.ldlm_cbd </literal>
+ </entry>
+ <entry>
+ <para>DLM lock grant service</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ <para>For each service, an entry as shown below is
+ created:<screen>/proc/fs/lustre/<replaceable>service</replaceable>/*/threads_<replaceable>min|max|started</replaceable></screen></para>
+ <itemizedlist>
+ <listitem>
+ <para>To temporarily set this tunable, run:</para>
+ <screen># lctl <replaceable>get|set</replaceable>_param <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
+ </listitem>
+ <listitem>
+ <para>To permanently set this tunable, run:</para>
+ <screen># lctl conf_param <replaceable>obdname|fsname.obdtype</replaceable>.threads_<replaceable>min|max|started</replaceable> </screen>
+ <para condition='l25'>For version 2.5 or later, run:
+ <screen># lctl set_param -P <replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></screen></para>
+ </listitem>
+ </itemizedlist>
+ <para>The following examples show how to set thread counts and get the number of running threads
+ for the service <literal>ost_io</literal> using the tunable
+ <literal><replaceable>service</replaceable>.threads_<replaceable>min|max|started</replaceable></literal>.</para>
+ <itemizedlist>
+ <listitem>
+ <para>To get the number of running threads, run:</para>
+ <screen># lctl get_param ost.OSS.ost_io.threads_started
+ost.OSS.ost_io.threads_started=128</screen>
+ </listitem>
+ <listitem>
+ <para>To set the number of threads to the maximum value (512), run:</para>
+ <screen># lctl get_param ost.OSS.ost_io.threads_max
+ost.OSS.ost_io.threads_max=512</screen>
+ </listitem>
+ <listitem>
+ <para>To set the maximum thread count to 256 instead of 512 (to avoid overloading the
+ storage or for an array with requests), run:</para>
+ <screen># lctl set_param ost.OSS.ost_io.threads_max=256
+ost.OSS.ost_io.threads_max=256</screen>
+ </listitem>
+ <listitem>
+ <para>To set the maximum thread count to 256 instead of 512 permanently, run:</para>
+ <screen># lctl conf_param testfs.ost.ost_io.threads_max=256</screen>
+ <para condition='l25'>For version 2.5 or later, run:
+ <screen># lctl set_param -P ost.OSS.ost_io.threads_max=256
+ost.OSS.ost_io.threads_max=256 </screen> </para>
+ </listitem>
+ <listitem>
+ <para> To check if the <literal>threads_max</literal> setting is active, run:</para>
+ <screen># lctl get_param ost.OSS.ost_io.threads_max
+ost.OSS.ost_io.threads_max=256</screen>
+ </listitem>
+ </itemizedlist>
+ <note>
+ <para>If the number of service threads is changed while the file system is running, the change
+ may not take effect until the file system is stopped and rest. If the number of service
+ threads in use exceeds the new <literal>threads_max</literal> value setting, service threads
+ that are already running will not be stopped.</para>
+ </note>
+ <para>See also <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="lustretuning"/></para>
+ </section>
+ <section xml:id="dbdoclet.50438271_83523">
+ <title><indexterm>
+ <primary>proc</primary>
+ <secondary>debug</secondary>
+ </indexterm>Enabling and Interpreting Debugging Logs</title>
+ <para>By default, a detailed log of all operations is generated to aid in debugging. Flags that
+ control debugging are found in <literal>/proc/sys/lnet/debug</literal>. </para>
+ <para>The overhead of debugging can affect the performance of Lustre file system. Therefore, to
+ minimize the impact on performance, the debug level can be lowered, which affects the amount
+ of debugging information kept in the internal log buffer but does not alter the amount of
+ information to goes into syslog. You can raise the debug level when you need to collect logs
+ to debug problems. </para>
+ <para>The debugging mask can be set using "symbolic names". The symbolic format is
+ shown in the examples below.<itemizedlist>