service immediately and disables automatic thread creation behavior.
</para>
</note>
- <para condition='l23'>Lustre software release 2.3 introduced new
- parameters to provide more control to administrators.</para>
+ <para>Parameters are available to provide administrators control
+ over the number of service threads.</para>
<itemizedlist>
<listitem>
<para>
</itemizedlist>
</section>
</section>
- <section xml:id="dbdoclet.mdsbinding" condition='l23'>
+ <section xml:id="dbdoclet.mdsbinding">
<title>
<indexterm>
<primary>tuning</primary>
<secondary>MDS binding</secondary>
</indexterm>Binding MDS Service Thread to CPU Partitions</title>
- <para>With the introduction of Node Affinity (
- <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
- can be bound to particular CPU partitions (CPTs) to improve CPU cache
- usage and memory locality. Default values for CPT counts and CPU core
+ <para>With the Node Affinity (<xref linkend="nodeaffdef" />) feature,
+ MDS threads can be bound to particular CPU partitions (CPTs) to improve CPU
+ cache usage and memory locality. Default values for CPT counts and CPU core
bindings are selected automatically to provide good overall performance for
a given CPU count. However, an administrator can deviate from these setting
if they choose. For details on specifying the mapping of CPU cores to
<para>By default, this parameter is off. As always, you should test the
performance to compare the impact of changing this parameter.</para>
</section>
- <section condition='l23'>
+ <section>
<title>
<indexterm>
<primary>tuning</primary>
<secondary>Network interface binding</secondary>
</indexterm>Binding Network Interface Against CPU Partitions</title>
- <para>Lustre software release 2.3 and beyond provide enhanced network
- interface control. The enhancement means that an administrator can bind
- an interface to one or more CPU partitions. Bindings are specified as
- options to the LNet modules. For more information on specifying module
- options, see
+ <para>Lustre allows enhanced network interface control. This means that
+ an administrator can bind an interface to one or more CPU partitions.
+ Bindings are specified as options to the LNet modules. For more
+ information on specifying module options, see
<xref linkend="dbdoclet.50438293_15350" /></para>
<para>For example,
<literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
<screen>
ko2iblnd credits=256
</screen>
- <note condition="l23">
- <para>In Lustre software release 2.3 and beyond, LNet may revalidate
- the NI credits, so the administrator's request may not persist.</para>
+ <note>
+ <para>LNet may revalidate the NI credits, so the administrator's
+ request may not persist.</para>
</note>
</section>
<section>
<screen>
lnet large_router_buffers=8192
</screen>
- <note condition="l23">
- <para>In Lustre software release 2.3 and beyond, LNet may revalidate
- the router buffer setting, so the administrator's request may not
- persist.</para>
+ <note>
+ <para>LNet may revalidate the router buffer setting, so the
+ administrator's request may not persist.</para>
</note>
</section>
<section>
interface. The default setting is 1. (For more information about the
LNet routes parameter, see
<xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="dbdoclet.50438216_71227" /></para>
+ linkend="lnet_module_routes" /></para>
<para>A router is considered down if any of its NIDs are down. For
example, router X has three NIDs:
<literal>Xnid1</literal>,
be MAX.</para>
</section>
</section>
- <section xml:id="dbdoclet.libcfstuning" condition='l23'>
+ <section xml:id="dbdoclet.libcfstuning">
<title>
<indexterm>
<primary>tuning</primary>
<secondary>libcfs</secondary>
</indexterm>libcfs Tuning</title>
- <para>Lustre software release 2.3 introduced binding service threads via
- CPU Partition Tables (CPTs). This allows the system administrator to
- fine-tune on which CPU cores the Lustre service threads are run, for both
- OSS and MDS services, as well as on the client.
+ <para>Lustre allows binding service threads via CPU Partition Tables
+ (CPTs). This allows the system administrator to fine-tune on which CPU
+ cores the Lustre service threads are run, for both OSS and MDS services,
+ as well as on the client.
</para>
<para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
system functions such as system monitoring, HA heartbeat, or similar
</entry>
<entry>
<para>Introduced in 2.10. Number of connections to each peer. Messages
- are sent round-robin over the connection pool. Provides signifiant
+ are sent round-robin over the connection pool. Provides significant
improvement with OmniPath.</para>
</entry>
</row>
<title>The internal structure of TBF policy</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="100%"
- fileref="figures/TBF_policy.svg" />
+ <imagedata scalefit="1" width="50%"
+ fileref="figures/TBF_policy.png" />
</imageobject>
<textobject>
<phrase>The internal structure of TBF policy</phrase>
<screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf <<replaceable>policy</replaceable>>"
</screen>
<para>For now, the RPCs can be classified into the different types
- according to their NID, JOBID and OPCode. (UID/GID will be supported
- soon.) When enabling TBF policy, you can specify one of the types, or
- just use "tbf" to enable all of them to do a fine-grained RPC requests
- classification.</para>
+ according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
+ policy, you can specify one of the types, or just use "tbf" to enable
+ all of them to do a fine-grained RPC requests classification.</para>
<para>Example:</para>
<screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
-$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"</screen>
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
</section>
<section remap="h4">
<title>Start a TBF rule</title>
linkend="dbdoclet.jobstats" /> for more details.</para>
<para>Command:</para>
<screen>lctl set_param x.x.x.nrs_tbf_rule=
-"[reg|hp] start <replaceable>name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
+"[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
</screen>
<para>Wildcard is supported in
{<replaceable>jobid_list</replaceable>}.</para>
<para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
<para>Command:</para>
<screen>$ lctl set_param x.x.x.nrs_tbf_rule=
-"[reg|hp] start <replaceable>name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
+"[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
</screen>
<para>Example:</para>
<screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
"reg start iozone_user1 opcode={ost_read} rate=100"</screen>
</listitem>
<listitem>
+ <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
+$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
+ <para>Exapmle:</para>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ <para>Limit the rate of RPC requests of the gid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name gid={500} rate=100"</screen>
+ <para>Also, you can use the following rule to control all reqs
+ to mds:</para>
+ <para>Start the tbf uid QoS on MDS:</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ </listitem>
+ <listitem>
<para><emphasis role="bold">Policy combination</emphasis></para>
- <para>To support rules with complex expressions of NID/JOBID/OPCode
- conditions, TBF classifier is extented to classify RPC in a more
- fine-grained way. This feature supports logical conditional
- conjunction and disjunction operations among different types.
+ <para>To support TBF rules with complex expressions of conditions,
+ TBF classifier is extented to classify RPC in a more fine-grained
+ way. This feature supports logical conditional conjunction and
+ disjunction operations among different types.
In the rule:
"&" represents the conditional conjunction and
"," represents the conditional disjunction.</para>
CPT 1:
comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
default * 10000, ref 0</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500}&gid={500} rate=100"</screen>
+ <para>In this example, those RPC requests whose uid is 500 and
+ gid is 500 will be processed at the rate of 100 req/sec.</para>
</listitem>
</itemizedlist>
</section>
computes nid={192.168.1.[2-128]@tcp} 500, ref 0
default * 10000, ref 0</screen>
</listitem>
+ <listitem>
+ <para><emphasis role="bold">TBF realtime policies under congestion
+ </emphasis></para>
+ <para>During TBF evaluation, we find that when the sum of I/O
+ bandwidth requirements for all classes exceeds the system capacity,
+ the classes with the same rate limits get less bandwidth than if
+ preconfigured evenly. The reason for this is the heavy load on a
+ congested server will result in some missed deadlines for some
+ classes. The number of the calculated tokens may be larger than 1
+ during dequeuing. In the original implementation, all classes are
+ equally handled to simply discard exceeding tokens.</para>
+ <para>Thus, a Hard Token Compensation (HTC) strategy has been
+ implemented. A class can be configured with the HTC feature by the
+ rule it matches. This feature means that requests in this kind of
+ class queues have high real-time requirements and that the bandwidth
+ assignment must be satisfied as good as possible. When deadline
+ misses happen, the class keeps the deadline unchanged and the time
+ residue(the remainder of elapsed time divided by 1/r) is compensated
+ to the next round. This ensures that the next idle I/O thread will
+ always select this class to serve until all accumulated exceeding
+ tokens are handled or there are no pending requests in the class
+ queue.</para>
+ <para>Command:</para>
+ <para>A new command format is added to enable the realtime feature
+ for a rule:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=\
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start realjob jobid={dd.0} rate=100 realtime=1</screen>
+ <para>This example rule means the RPC requests whose JobID is dd.0
+ will be processed at the rate of 100req/sec in realtime.</para>
+ </listitem>
</itemizedlist>
</section>
</section>
Server-Side Advice and Hinting
</title>
<section><title>Overview</title>
- <para>Use the <literal>lfs ladvise</literal> command give file access
+ <para>Use the <literal>lfs ladvise</literal> command to give file access
advices or hints to servers.</para>
<screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
[--start|-s START[kMGT]]
cache</para>
<para><literal>dontneed</literal> to cleanup data cache on
server</para>
+ <para><literal>lockahead</literal> Request an LDLM extent lock
+ of the given mode on the given byte range </para>
+ <para><literal>noexpand</literal> Disable extent lock expansion
+ behavior for I/O to this file descriptor</para>
</entry>
</row>
<row>
<literal>-e</literal> option.</para>
</entry>
</row>
+ <row>
+ <entry>
+ <para><literal>-m</literal>, <literal>--mode=</literal>
+ <literal>MODE</literal></para>
+ </entry>
+ <entry>
+ <para>Lockahead request mode <literal>{READ,WRITE}</literal>.
+ Request a lock with this mode.</para>
+ </entry>
+ </row>
</tbody>
</tgroup>
</informaltable>
random IO is a net benefit. Fetching that data into each client cache with
fadvise() may not be, due to much more data being sent to the client.
</para>
+ <para>
+ <literal>ladvise lockahead</literal> is different in that it attempts to
+ control LDLM locking behavior by explicitly requesting LDLM locks in
+ advance of use. This does not directly affect caching behavior, instead
+ it is used in special cases to avoid pathological results (lock exchange)
+ from the normal LDLM locking behavior.
+ </para>
+ <para>
+ Note that the <literal>noexpand</literal> advice works on a specific
+ file descriptor, so using it via lfs has no effect. It must be used
+ on a particular file descriptor which is used for i/o to have any effect.
+ </para>
<para>The main difference between the Linux <literal>fadvise()</literal>
system call and <literal>lfs ladvise</literal> is that
<literal>fadvise()</literal> is only a client side mechanism that does
cache of the file in the memory.</para>
<screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
</screen>
+ <para>The following example requests an LDLM read lock on the first
+ 1 MiB of <literal>/mnt/lustre/file1</literal>. This will attempt to
+ request a lock from the OST holding that region of the file.</para>
+ <screen>client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1
+ </screen>
+ <para>The following example requests an LDLM write lock on
+ [3 MiB, 10 MiB] of <literal>/mnt/lustre/file1</literal>. This will
+ attempt to request a lock from the OST holding that region of the
+ file.</para>
+ <screen>client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1
+ </screen>
</section>
</section>
<section condition="l29">
16MB. To temporarily change <literal>brw_size</literal>, the
following command should be run on the OSS:</para>
<screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
- <para>To persistently change <literal>brw_size</literal>, one of the following
- commands should be run on the OSS:</para>
+ <para>To persistently change <literal>brw_size</literal>, the
+ following command should be run:</para>
<screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
- <screen>oss# lctl conf_param <replaceable>fsname</replaceable>-OST*.obdfilter.brw_size=16</screen>
<para>When a client connects to an OST target, it will fetch
<literal>brw_size</literal> from the target and pick the maximum value
of <literal>brw_size</literal> and its local setting for
<screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
<para>To persistently make this change, the following command should
be run:</para>
- <screen>client$ lctl conf_param <replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
+ <screen>client$ lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
<caution><para>The <literal>brw_size</literal> of an OST can be
changed on the fly. However, clients have to be remounted to
renegotiate the new maximum RPC size.</para></caution>