+ <title>Hardware Interrupts (
+ <literal>enable_irq_affinity</literal>)</title>
+ <para>The hardware interrupts that are generated by network adapters may
+ be handled by any CPU in the system. In some cases, we would like network
+ traffic to remain local to a single CPU to help keep the processor cache
+ warm and minimize the impact of context switches. This is helpful when an
+ SMP system has more than one network interface and ideal when the number
+ of interfaces equals the number of CPUs. To enable the
+ <literal>enable_irq_affinity</literal> parameter, enter:</para>
+ <screen>
+options ksocklnd enable_irq_affinity=1
+</screen>
+ <para>In other cases, if you have an SMP platform with a single fast
+ interface such as 10 Gb Ethernet and more than two CPUs, you may see
+ performance improve by turning this parameter off.</para>
+ <screen>
+options ksocklnd enable_irq_affinity=0
+</screen>
+ <para>By default, this parameter is off. As always, you should test the
+ performance to compare the impact of changing this parameter.</para>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network interface binding</secondary>
+ </indexterm>Binding Network Interface Against CPU Partitions</title>
+ <para>Lustre allows enhanced network interface control. This means that
+ an administrator can bind an interface to one or more CPU partitions.
+ Bindings are specified as options to the LNet modules. For more
+ information on specifying module options, see
+ <xref linkend="dbdoclet.50438293_15350" /></para>
+ <para>For example,
+ <literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
+ <literal>o2ib0</literal> will be handled by LND threads executing on
+ <literal>CPT0</literal> and
+ <literal>CPT1</literal>. An additional example might be:
+ <literal>tcp1(eth0)[0]</literal>. Messages for
+ <literal>tcp1</literal> are handled by threads on
+ <literal>CPT0</literal>.</para>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network interface credits</secondary>
+ </indexterm>Network Interface Credits</title>
+ <para>Network interface (NI) credits are shared across all CPU partitions
+ (CPT). For example, if a machine has four CPTs and the number of NI
+ credits is 512, then each partition has 128 credits. If a large number of
+ CPTs exist on the system, LNet checks and validates the NI credits for
+ each CPT to ensure each CPT has a workable number of credits. For
+ example, if a machine has 16 CPTs and the number of NI credits is 256,
+ then each partition only has 16 credits. 16 NI credits is low and could
+ negatively impact performance. As a result, LNet automatically adjusts
+ the credits to 8*
+ <literal>peer_credits</literal>(
+ <literal>peer_credits</literal> is 8 by default), so each partition has 64
+ credits.</para>
+ <para>Increasing the number of
+ <literal>credits</literal>/
+ <literal>peer_credits</literal> can improve the performance of high
+ latency networks (at the cost of consuming more memory) by enabling LNet
+ to send more inflight messages to a specific network/peer and keep the
+ pipeline saturated.</para>
+ <para>An administrator can modify the NI credit count using
+ <literal>ksoclnd</literal> or
+ <literal>ko2iblnd</literal>. In the example below, 256 credits are
+ applied to TCP connections.</para>
+ <screen>
+ksocklnd credits=256
+</screen>
+ <para>Applying 256 credits to IB connections can be achieved with:</para>
+ <screen>
+ko2iblnd credits=256
+</screen>
+ <note>
+ <para>LNet may revalidate the NI credits, so the administrator's
+ request may not persist.</para>
+ </note>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>router buffers</secondary>
+ </indexterm>Router Buffers</title>
+ <para>When a node is set up as an LNet router, three pools of buffers are
+ allocated: tiny, small and large. These pools are allocated per CPU
+ partition and are used to buffer messages that arrive at the router to be
+ forwarded to the next hop. The three different buffer sizes accommodate
+ different size messages.</para>
+ <para>If a message arrives that can fit in a tiny buffer then a tiny
+ buffer is used, if a message doesn’t fit in a tiny buffer, but fits in a
+ small buffer, then a small buffer is used. Finally if a message does not
+ fit in either a tiny buffer or a small buffer, a large buffer is
+ used.</para>
+ <para>Router buffers are shared by all CPU partitions. For a machine with
+ a large number of CPTs, the router buffer number may need to be specified
+ manually for best performance. A low number of router buffers risks
+ starving the CPU partitions of resources.</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>tiny_router_buffers</literal>: Zero payload buffers used for
+ signals and acknowledgements.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>small_router_buffers</literal>: 4 KB payload buffers for
+ small messages</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>large_router_buffers</literal>: 1 MB maximum payload
+ buffers, corresponding to the recommended RPC size of 1 MB.</para>
+ </listitem>
+ </itemizedlist>
+ <para>The default setting for router buffers typically results in
+ acceptable performance. LNet automatically sets a default value to reduce
+ the likelihood of resource starvation. The size of a router buffer can be
+ modified as shown in the example below. In this example, the size of the
+ large buffer is modified using the
+ <literal>large_router_buffers</literal> parameter.</para>
+ <screen>
+lnet large_router_buffers=8192
+</screen>
+ <note>
+ <para>LNet may revalidate the router buffer setting, so the
+ administrator's request may not persist.</para>
+ </note>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>portal round-robin</secondary>
+ </indexterm>Portal Round-Robin</title>
+ <para>Portal round-robin defines the policy LNet applies to deliver
+ events and messages to the upper layers. The upper layers are PLRPC
+ service or LNet selftest.</para>
+ <para>If portal round-robin is disabled, LNet will deliver messages to
+ CPTs based on a hash of the source NID. Hence, all messages from a
+ specific peer will be handled by the same CPT. This can reduce data
+ traffic between CPUs. However, for some workloads, this behavior may
+ result in poorly balancing loads across the CPU.</para>
+ <para>If portal round-robin is enabled, LNet will round-robin incoming
+ events across all CPTs. This may balance load better across the CPU but
+ can incur a cross CPU overhead.</para>
+ <para>The current policy can be changed by an administrator with
+ <literal>echo
+ <replaceable>value</replaceable>>
+ /proc/sys/lnet/portal_rotor</literal>. There are four options for
+ <literal>
+ <replaceable>value</replaceable>
+ </literal>:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>OFF</literal>
+ </para>
+ <para>Disable portal round-robin on all incoming requests.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ON</literal>
+ </para>
+ <para>Enable portal round-robin on all incoming requests.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>RR_RT</literal>
+ </para>
+ <para>Enable portal round-robin only for routed messages.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>HASH_RT</literal>
+ </para>
+ <para>Routed messages will be delivered to the upper layer by hash of
+ source NID (instead of NID of router.) This is the default
+ value.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section>
+ <title>LNet Peer Health</title>
+ <para>Two options are available to help determine peer health:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>peer_timeout</literal>- The timeout (in seconds) before an
+ aliveness query is sent to a peer. For example, if
+ <literal>peer_timeout</literal> is set to
+ <literal>180sec</literal>, an aliveness query is sent to the peer
+ every 180 seconds. This feature only takes effect if the node is
+ configured as an LNet router.</para>
+ <para>In a routed environment, the
+ <literal>peer_timeout</literal> feature should always be on (set to a
+ value in seconds) on routers. If the router checker has been enabled,
+ the feature should be turned off by setting it to 0 on clients and
+ servers.</para>
+ <para>For a non-routed scenario, enabling the
+ <literal>peer_timeout</literal> option provides health information
+ such as whether a peer is alive or not. For example, a client is able
+ to determine if an MGS or OST is up when it sends it a message. If a
+ response is received, the peer is alive; otherwise a timeout occurs
+ when the request is made.</para>
+ <para>In general,
+ <literal>peer_timeout</literal> should be set to no less than the LND
+ timeout setting. For more information about LND timeouts, see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="section_c24_nt5_dl" />.</para>
+ <para>When the
+ <literal>o2iblnd</literal>(IB) driver is used,
+ <literal>peer_timeout</literal> should be at least twice the value of
+ the
+ <literal>ko2iblnd</literal> keepalive option. for more information
+ about keepalive options, see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="section_ngq_qhy_zl" />.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>avoid_asym_router_failure</literal>– When set to 1, the
+ router checker running on the client or a server periodically pings
+ all the routers corresponding to the NIDs identified in the routes
+ parameter setting on the node to determine the status of each router
+ interface. The default setting is 1. (For more information about the
+ LNet routes parameter, see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="lnet_module_routes" /></para>
+ <para>A router is considered down if any of its NIDs are down. For
+ example, router X has three NIDs:
+ <literal>Xnid1</literal>,
+ <literal>Xnid2</literal>, and
+ <literal>Xnid3</literal>. A client is connected to the router via
+ <literal>Xnid1</literal>. The client has router checker enabled. The
+ router checker periodically sends a ping to the router via
+ <literal>Xnid1</literal>. The router responds to the ping with the
+ status of each of its NIDs. In this case, it responds with
+ <literal>Xnid1=up</literal>,
+ <literal>Xnid2=up</literal>,
+ <literal>Xnid3=down</literal>. If
+ <literal>avoid_asym_router_failure==1</literal>, the router is
+ considered down if any of its NIDs are down, so router X is
+ considered down and will not be used for routing messages. If
+ <literal>avoid_asym_router_failure==0</literal>, router X will
+ continue to be used for routing messages.</para>
+ </listitem>
+ </itemizedlist></para>
+ <para>The following router checker parameters must be set to the maximum
+ value of the corresponding setting for this option on any client or
+ server:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>dead_router_check_interval</literal>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>live_router_check_interval</literal>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>router_ping_timeout</literal>
+ </para>
+ </listitem>
+ </itemizedlist></para>
+ <para>For example, the
+ <literal>dead_router_check_interval</literal> parameter on any router must
+ be MAX.</para>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.libcfstuning">
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>libcfs</secondary>
+ </indexterm>libcfs Tuning</title>
+ <para>Lustre allows binding service threads via CPU Partition Tables
+ (CPTs). This allows the system administrator to fine-tune on which CPU
+ cores the Lustre service threads are run, for both OSS and MDS services,
+ as well as on the client.
+ </para>
+ <para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
+ system functions such as system monitoring, HA heartbeat, or similar
+ tasks. On the client it may be useful to restrict Lustre RPC service
+ threads to a small subset of cores so that they do not interfere with
+ computation, or because these cores are directly attached to the network
+ interfaces.
+ </para>
+ <para>By default, the Lustre software will automatically generate CPU
+ partitions (CPT) based on the number of CPUs in the system.
+ The CPT count can be explicitly set on the libcfs module using
+ <literal>cpu_npartitions=<replaceable>NUMBER</replaceable></literal>.
+ The value of <literal>cpu_npartitions</literal> must be an integer between
+ 1 and the number of online CPUs.
+ </para>
+ <para condition='l29'>In Lustre 2.9 and later the default is to use
+ one CPT per NUMA node. In earlier versions of Lustre, by default there
+ was a single CPT if the online CPU core count was four or fewer, and
+ additional CPTs would be created depending on the number of CPU cores,
+ typically with 4-8 cores per CPT.
+ </para>
+ <tip>
+ <para>Setting <literal>cpu_npartitions=1</literal> will disable most
+ of the SMP Node Affinity functionality.</para>
+ </tip>
+ <section>
+ <title>CPU Partition String Patterns</title>
+ <para>CPU partitions can be described using string pattern notation.
+ If <literal>cpu_pattern=N</literal> is used, then there will be one
+ CPT for each NUMA node in the system, with each CPT mapping all of
+ the CPU cores for that NUMA node.
+ </para>
+ <para>It is also possible to explicitly specify the mapping between
+ CPU cores and CPTs, for example:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>cpu_pattern="0[2,4,6] 1[3,5,7]</literal>
+ </para>
+ <para>Create two CPTs, CPT0 contains cores 2, 4, and 6, while CPT1
+ contains cores 3, 5, 7. CPU cores 0 and 1 will not be used by Lustre
+ service threads, and could be used for node services such as
+ system monitoring, HA heartbeat threads, etc. The binding of
+ non-Lustre services to those CPU cores may be done in userspace
+ using <literal>numactl(8)</literal> or other application-specific
+ methods, but is beyond the scope of this document.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>cpu_pattern="N 0[0-3] 1[4-7]</literal>
+ </para>
+ <para>Create two CPTs, with CPT0 containing all CPUs in NUMA
+ node[0-3], while CPT1 contains all CPUs in NUMA node [4-7].</para>
+ </listitem>
+ </itemizedlist>
+ <para>The current configuration of the CPU partition can be read via
+ <literal>lctl get_parm cpu_partition_table</literal>. For example,
+ a simple 4-core system has a single CPT with all four CPU cores:
+ <screen>$ lctl get_param cpu_partition_table
+cpu_partition_table=0 : 0 1 2 3</screen>
+ while a larger NUMA system with four 12-core CPUs may have four CPTs:
+ <screen>$ lctl get_param cpu_partition_table
+cpu_partition_table=
+0 : 0 1 2 3 4 5 6 7 8 9 10 11
+1 : 12 13 14 15 16 17 18 19 20 21 22 23
+2 : 24 25 26 27 28 29 30 31 32 33 34 35
+3 : 36 37 38 39 40 41 42 43 44 45 46 47
+</screen>
+ </para>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.lndtuning">
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>LND tuning</secondary>
+ </indexterm>LND Tuning</title>
+ <para>LND tuning allows the number of threads per CPU partition to be
+ specified. An administrator can set the threads for both
+ <literal>ko2iblnd</literal> and
+ <literal>ksocklnd</literal> using the
+ <literal>nscheds</literal> parameter. This adjusts the number of threads for
+ each partition, not the overall number of threads on the LND.</para>
+ <note>
+ <para>Lustre software release 2.3 has greatly decreased the default
+ number of threads for
+ <literal>ko2iblnd</literal> and
+ <literal>ksocklnd</literal> on high-core count machines. The current
+ default values are automatically set and are chosen to work well across a
+ number of typical scenarios.</para>
+ </note>
+ <section>
+ <title>ko2iblnd Tuning</title>
+ <para>The following table outlines the ko2iblnd module parameters to be used
+ for tuning:</para>
+ <informaltable frame="all">
+ <tgroup cols="3">
+ <colspec colname="c1" colwidth="50*" />
+ <colspec colname="c2" colwidth="50*" />
+ <colspec colname="c3" colwidth="50*" />
+ <thead>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Module Parameter</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Default Value</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Description</emphasis>
+ </para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal>service</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>987</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Service number (within RDMA_PS_TCP).</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>cksum</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Set non-zero to enable message (not RDMA) checksums.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>timeout</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>50</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Timeout in seconds.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>nscheds</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of threads in each scheduler pool (per CPT). Value of
+ zero means we derive the number from the number of cores.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>conns_per_peer</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>4 (OmniPath), 1 (Everything else)</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Introduced in 2.10. Number of connections to each peer. Messages
+ are sent round-robin over the connection pool. Provides significant
+ improvement with OmniPath.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ntx</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>512</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of message descriptors allocated for each pool at
+ startup. Grows at runtime. Shared by all CPTs.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>256</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of concurrent sends on network.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>8</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of concurrent sends to 1 peer. Related/limited by IB
+ queue size.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_credits_hiw</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>When eagerly to return credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_buffer_credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number per-peer router buffer credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_timeout</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>180</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Seconds without aliveness news to declare peer dead (less than
+ or equal to 0 to disable).</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ipif_name</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>ib0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>IPoIB interface name.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>retry_count</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>5</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Retransmissions when no ACK received.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>rnr_retry_count</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>6</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>RNR retransmissions.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>keepalive</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>100</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Idle time in seconds before sending a keepalive.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ib_mtu</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>IB MTU 256/512/1024/2048/4096.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>concurrent_sends</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Send work-queue sizing. If zero, derived from
+ <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>map_on_demand</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of fragments reserved for connection. If zero, use
+ global memory region (found to be security issue). If non-zero, use
+ FMR or FastReg for memory registration. Value needs to agree between
+ both peers of connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_pool_size</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>512</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Size of fmr pool on each CPT (>= ntx / 4). Grows at runtime.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_flush_trigger</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>384</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number dirty FMRs that triggers pool flush.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_cache</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>1</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Non-zero to enable FMR caching.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>dev_failover</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>require_privileged_port</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Require privileged port when accepting connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>use_privileged_port</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>1</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Use privileged port when initiating connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>wrq_sge</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>2</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Introduced in 2.10. Number scatter/gather element groups per
+ work request. Used to deal with fragmentations which can consume
+ double the number of work requests.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.nrstuning" condition='l24'>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ </indexterm>Network Request Scheduler (NRS) Tuning</title>
+ <para>The Network Request Scheduler (NRS) allows the administrator to
+ influence the order in which RPCs are handled at servers, on a per-PTLRPC
+ service basis, by providing different policies that can be activated and
+ tuned in order to influence the RPC ordering. The aim of this is to provide
+ for better performance, and possibly discrete performance characteristics
+ using future policies.</para>
+ <para>The NRS policy state of a PTLRPC service can be read and set via the
+ <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
+ service's NRS policy state, run:</para>
+ <screen>
+lctl get_param {service}.nrs_policies
+</screen>
+ <para>For example, to read the NRS policy state of the
+ <literal>ost_io</literal> service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_policies
+ost.OSS.ost_io.nrs_policies=
+
+regular_requests:
+ - name: fifo
+ state: started
+ fallback: yes
+ queued: 0
+ active: 0
+
+ - name: crrn
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: orr
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: trr
+ state: started
+ fallback: no
+ queued: 2420
+ active: 268
+
+ - name: tbf
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: delay
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+high_priority_requests:
+ - name: fifo
+ state: started
+ fallback: yes
+ queued: 0
+ active: 0
+
+ - name: crrn
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: orr
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: trr
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: tbf
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: delay
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+</screen>
+ <para>NRS policy state is shown in either one or two sections, depending on
+ the PTLRPC service being queried. The first section is named
+ <literal>regular_requests</literal> and is available for all PTLRPC
+ services, optionally followed by a second section which is named
+ <literal>high_priority_requests</literal>. This is because some PTLRPC
+ services are able to treat some types of RPCs as higher priority ones, such
+ that they are handled by the server with higher priority compared to other,
+ regular RPC traffic. For PTLRPC services that do not support high-priority
+ RPCs, you will only see the
+ <literal>regular_requests</literal> section.</para>
+ <para>There is a separate instance of each NRS policy on each PTLRPC
+ service for handling regular and high-priority RPCs (if the service
+ supports high-priority RPCs). For each policy instance, the following
+ fields are shown:</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*" />
+ <colspec colname="c2" colwidth="50*" />
+ <thead>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Field</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Description</emphasis>
+ </para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal>name</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The name of the policy.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>state</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The state of the policy; this can be any of
+ <literal>invalid, stopping, stopped, starting, started</literal>.
+ A fully enabled policy is in the
+ <literal>started</literal> state.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fallback</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Whether the policy is acting as a fallback policy or not. A
+ fallback policy is used to handle RPCs that other enabled
+ policies fail to handle, or do not support the handling of. The
+ possible values are
+ <literal>no, yes</literal>. Currently, only the FIFO policy can
+ act as a fallback policy.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>queued</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The number of RPCs that the policy has waiting to be
+ serviced.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>active</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The number of RPCs that the policy is currently
+ handling.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ <para>To enable an NRS policy on a PTLRPC service run:</para>
+ <screen>
+lctl set_param {service}.nrs_policies=
+<replaceable>policy_name</replaceable>
+</screen>
+ <para>This will enable the policy
+ <replaceable>policy_name</replaceable>for both regular and high-priority
+ RPCs (if the PLRPC service supports high-priority RPCs) on the given
+ service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
+ service, run:</para>
+ <screen>
+$ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
+ldlm.services.ldlm_cbd.nrs_policies=crrn
+
+</screen>
+ <para>For PTLRPC services that support high-priority RPCs, you can also
+ supply an optional
+ <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
+ for handling only regular or high-priority RPCs on a given PTLRPC service,
+ by running:</para>
+ <screen>
+lctl set_param {service}.nrs_policies="
+<replaceable>policy_name</replaceable>
+<replaceable>reg|hp</replaceable>"
+</screen>
+ <para>For example, to enable the TRR policy for handling only regular, but
+ not high-priority RPCs on the
+ <literal>ost_io</literal> service, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
+ost.OSS.ost_io.nrs_policies="trr reg"
+
+</screen>
+ <note>
+ <para>When enabling an NRS policy, the policy name must be given in
+ lower-case characters, otherwise the operation will fail with an error
+ message.</para>
+ </note>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>first in, first out (FIFO) policy</tertiary>
+ </indexterm>First In, First Out (FIFO) policy</title>
+ <para>The first in, first out (FIFO) policy handles RPCs in a service in
+ the same order as they arrive from the LNet layer, so no special
+ processing takes place to modify the RPC handling stream. FIFO is the
+ default policy for all types of RPCs on all PTLRPC services, and is
+ always enabled irrespective of the state of other policies, so that it
+ can be used as a backup policy, in case a more elaborate policy that has
+ been enabled fails to handle an RPC, or does not support handling a given
+ type of RPC.</para>
+ <para>The FIFO policy has no tunables that adjust its behaviour.</para>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
+ </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
+ <para>The client round-robin over NIDs (CRR-N) policy performs batched
+ round-robin scheduling of all types of RPCs, with each batch consisting
+ of RPCs originating from the same client node, as identified by its NID.
+ CRR-N aims to provide for better resource utilization across the cluster,
+ and to help shorten completion times of jobs in some cases, by
+ distributing available bandwidth more evenly across all clients.</para>
+ <para>The CRR-N policy can be enabled on all types of PTLRPC services,
+ and has the following tunable that can be used to adjust its
+ behavior:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_crrn_quantum</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
+ maximum allowed size of each batch of RPCs; the unit of measure is in
+ number of RPCs. To read the maximum allowed batch size of a CRR-N
+ policy, run:</para>
+ <screen>
+lctl get_param {service}.nrs_crrn_quantum
+</screen>
+ <para>For example, to read the maximum allowed batch size of a CRR-N
+ policy on the ost_io service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
+ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
+hp_quantum:8
+
+</screen>
+ <para>You can see that there is a separate maximum allowed batch size
+ value for regular (
+ <literal>reg_quantum</literal>) and high-priority (
+ <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
+ high-priority RPCs).</para>
+ <para>To set the maximum allowed batch size of a CRR-N policy on a
+ given service, run:</para>
+ <screen>
+lctl set_param {service}.nrs_crrn_quantum=
+<replaceable>1-65535</replaceable>
+</screen>
+ <para>This will set the maximum allowed batch size on a given
+ service, for both regular and high-priority RPCs (if the PLRPC
+ service supports high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the maximum allowed batch size on the
+ ldlm_canceld service to 16 RPCs, run:</para>
+ <screen>
+$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
+ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
+
+</screen>
+ <para>For PTLRPC services that support high-priority RPCs, you can
+ also specify a different maximum allowed batch size for regular and
+ high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param {service}.nrs_crrn_quantum=
+<replaceable>reg_quantum|hp_quantum</replaceable>:
+<replaceable>1-65535</replaceable>"
+</screen>
+ <para>For example, to set the maximum allowed batch size on the
+ ldlm_canceld service, for high-priority RPCs to 32, run:</para>
+ <screen>
+$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
+ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
+
+</screen>
+ <para>By using the last method, you can also set the maximum regular
+ and high-priority RPC batch sizes to different values, in a single
+ command invocation.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>object-based round-robin (ORR) policy</tertiary>
+ </indexterm>Object-based Round-Robin (ORR) policy</title>
+ <para>The object-based round-robin (ORR) policy performs batched
+ round-robin scheduling of bulk read write (brw) RPCs, with each batch
+ consisting of RPCs that pertain to the same backend-file system object,
+ as identified by its OST FID.</para>
+ <para>The ORR policy is only available for use on the ost_io service. The
+ RPC batches it forms can potentially consist of mixed bulk read and bulk
+ write RPCs. The RPCs in each batch are ordered in an ascending manner,
+ based on either the file offsets, or the physical disk offsets of each
+ RPC (only applicable to bulk read RPCs).</para>
+ <para>The aim of the ORR policy is to provide for increased bulk read
+ throughput in some cases, by ordering bulk read RPCs (and potentially
+ bulk write RPCs), and thus minimizing costly disk seek operations.
+ Performance may also benefit from any resulting improvement in resource
+ utilization, or by taking advantage of better locality of reference
+ between RPCs.</para>
+ <para>The ORR policy has the following tunables that can be used to
+ adjust its behaviour:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
+ </para>
+ <para>The
+ <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
+ the maximum allowed size of each batch of RPCs; the unit of measure
+ is in number of RPCs. To read the maximum allowed batch size of the
+ ORR policy, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
+ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
+hp_quantum:16
+
+</screen>
+ <para>You can see that there is a separate maximum allowed batch size
+ value for regular (
+ <literal>reg_quantum</literal>) and high-priority (
+ <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
+ high-priority RPCs).</para>
+ <para>To set the maximum allowed batch size for the ORR policy,
+ run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
+<replaceable>1-65535</replaceable>
+</screen>
+ <para>This will set the maximum allowed batch size for both regular
+ and high-priority RPCs, to the indicated value.</para>
+ <para>You can also specify a different maximum allowed batch size for
+ regular and high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
+<replaceable>reg_quantum|hp_quantum</replaceable>:
+<replaceable>1-65535</replaceable>
+</screen>
+ <para>For example, to set the maximum allowed batch size for regular
+ RPCs to 128, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
+ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
+
+</screen>
+ <para>By using the last method, you can also set the maximum regular
+ and high-priority RPC batch sizes to different values, in a single
+ command invocation.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
+ </para>
+ <para>The
+ <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
+ determines whether the ORR policy orders RPCs within each batch based
+ on logical file offsets or physical disk offsets. To read the offset
+ type value for the ORR policy, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
+ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
+hp_offset_type:logical
+
+</screen>
+ <para>You can see that there is a separate offset type value for
+ regular (
+ <literal>reg_offset_type</literal>) and high-priority (
+ <literal>hp_offset_type</literal>) RPCs.</para>
+ <para>To set the ordering type for the ORR policy, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
+<replaceable>physical|logical</replaceable>
+</screen>
+ <para>This will set the offset type for both regular and
+ high-priority RPCs, to the indicated value.</para>
+ <para>You can also specify a different offset type for regular and
+ high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
+<replaceable>reg_offset_type|hp_offset_type</replaceable>:
+<replaceable>physical|logical</replaceable>
+</screen>
+ <para>For example, to set the offset type for high-priority RPCs to
+ physical disk offsets, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
+ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
+</screen>
+ <para>By using the last method, you can also set offset type for
+ regular and high-priority RPCs to different values, in a single
+ command invocation.</para>
+ <note>
+ <para>Irrespective of the value of this tunable, only logical
+ offsets can, and are used for ordering bulk write RPCs.</para>
+ </note>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
+ </para>
+ <para>The
+ <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
+ the type of RPCs that the ORR policy will handle. To read the types
+ of supported RPCs by the ORR policy, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_supported
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
+hp_supported=reads_and_writes
+
+</screen>
+ <para>You can see that there is a separate supported 'RPC types'
+ value for regular (
+ <literal>reg_supported</literal>) and high-priority (
+ <literal>hp_supported</literal>) RPCs.</para>
+ <para>To set the supported RPC types for the ORR policy, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
+<replaceable>reads|writes|reads_and_writes</replaceable>
+</screen>
+ <para>This will set the supported RPC types for both regular and
+ high-priority RPCs, to the indicated value.</para>
+ <para>You can also specify a different supported 'RPC types' value
+ for regular and high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
+<replaceable>reg_supported|hp_supported</replaceable>:
+<replaceable>reads|writes|reads_and_writes</replaceable>
+</screen>
+ <para>For example, to set the supported RPC types to bulk read and
+ bulk write RPCs for regular requests, run:</para>
+ <screen>
+$ lctl set_param
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
+
+</screen>
+ <para>By using the last method, you can also set the supported RPC
+ types for regular and high-priority RPC to different values, in a
+ single command invocation.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Target-based round-robin (TRR) policy</tertiary>
+ </indexterm>Target-based Round-Robin (TRR) policy</title>
+ <para>The target-based round-robin (TRR) policy performs batched
+ round-robin scheduling of brw RPCs, with each batch consisting of RPCs
+ that pertain to the same OST, as identified by its OST index.</para>
+ <para>The TRR policy is identical to the object-based round-robin (ORR)
+ policy, apart from using the brw RPC's target OST index instead of the
+ backend-fs object's OST FID, for determining the RPC scheduling order.
+ The goals of TRR are effectively the same as for ORR, and it uses the
+ following tunables to adjust its behaviour:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
+ </para>
+ <para>The purpose of this tunable is exactly the same as for the
+ <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
+ policy, and you can use it in exactly the same way.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
+ </para>
+ <para>The purpose of this tunable is exactly the same as for the
+ <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
+ ORR policy, and you can use it in exactly the same way.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
+ </para>
+ <para>The purpose of this tunable is exactly the same as for the
+ <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
+ ORR policy, and you can use it in exactly the sme way.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section xml:id="dbdoclet.tbftuning" condition='l26'>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Token Bucket Filter (TBF) policy</tertiary>
+ </indexterm>Token Bucket Filter (TBF) policy</title>
+ <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
+ Lustre services to enforce the RPC rate limit on clients/jobs for QoS
+ (Quality of Service) purposes.</para>
+ <figure>
+ <title>The internal structure of TBF policy</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata scalefit="1" width="50%"
+ fileref="figures/TBF_policy.png" />
+ </imageobject>
+ <textobject>
+ <phrase>The internal structure of TBF policy</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>When a RPC request arrives, TBF policy puts it to a waiting queue
+ according to its classification. The classification of RPC requests is
+ based on either NID or JobID of the RPC according to the configure of
+ TBF. TBF policy maintains multiple queues in the system, one queue for
+ each category in the classification of RPC requests. The requests waits
+ for tokens in the FIFO queue before they have been handled so as to keep
+ the RPC rates under the limits.</para>
+ <para>When Lustre services are too busy to handle all of the requests in
+ time, all of the specified rates of the queues will not be satisfied.
+ Nothing bad will happen except some of the RPC rates are slower than
+ configured. In this case, the queue with higher rate will have an
+ advantage over the queues with lower rates, but none of them will be
+ starved.</para>
+ <para>To manage the RPC rate of queues, we don't need to set the rate of
+ each queue manually. Instead, we define rules which TBF policy matches to
+ determine RPC rate limits. All of the defined rules are organized as an
+ ordered list. Whenever a queue is newly created, it goes though the rule
+ list and takes the first matched rule as its rule, so that the queue
+ knows its RPC token rate. A rule can be added to or removed from the list
+ at run time. Whenever the list of rules is changed, the queues will
+ update their matched rules.</para>
+ <section remap="h4">
+ <title>Enable TBF policy</title>
+ <para>Command:</para>
+ <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf <<replaceable>policy</replaceable>>"
+ </screen>
+ <para>For now, the RPCs can be classified into the different types
+ according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
+ policy, you can specify one of the types, or just use "tbf" to enable
+ all of them to do a fine-grained RPC requests classification.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
+ </section>
+ <section remap="h4">
+ <title>Start a TBF rule</title>
+ <para>The TBF rule is defined in the parameter
+ <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
+ </screen>
+ <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
+ policy rule's name and '<replaceable>arguments</replaceable>' is a
+ string to specify the detailed rule according to the different types.
+ </para>
+ <itemizedlist>
+ <para>Next, the different types of TBF policies will be described.</para>
+ <listitem>
+ <para><emphasis role="bold">NID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>'<replaceable>nidlist</replaceable>' uses the same format
+ as configuring LNET route. '<replaceable>rate</replaceable>' is
+ the (upper limit) RPC rate of the rule.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start other_clients nid={192.168.*.*@tcp} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+ <para>In this example, the rate of processing RPC requests from
+ compute nodes is at most 5x as fast as those from login nodes.
+ The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
+ like:</para>
+ <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0
+high_priority_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start loginnode nid={192.168.1.1@tcp} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
+ <para>For the JobID, please see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="dbdoclet.jobstats" /> for more details.</para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Wildcard is supported in
+ {<replaceable>jobid_list</replaceable>}.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start dd_user jobid={dd.*} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={*.600} rate=10"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user2 jobid={io*.10* *.500} rate=200"</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 opcode={ost_read} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
+$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
+ <para>Exapmle:</para>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ <para>Limit the rate of RPC requests of the gid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name gid={500} rate=100"</screen>
+ <para>Also, you can use the following rule to control all reqs
+ to mds:</para>
+ <para>Start the tbf uid QoS on MDS:</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">Policy combination</emphasis></para>
+ <para>To support TBF rules with complex expressions of conditions,
+ TBF classifier is extented to classify RPC in a more fine-grained
+ way. This feature supports logical conditional conjunction and
+ disjunction operations among different types.
+ In the rule:
+ "&" represents the conditional conjunction and
+ "," represents the conditional disjunction.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start comp_rule opcode={ost_write}&jobid={dd.0},\
+nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
+ <para>In this example, those RPCs whose <literal>opcode</literal> is
+ ost_write and <literal>jobid</literal> is dd.0, or
+ <literal>nid</literal> satisfies the condition of
+ {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
+ req/sec.
+ The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
+ </para>
+ <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500}&gid={500} rate=100"</screen>
+ <para>In this example, those RPC requests whose uid is 500 and
+ gid is 500 will be processed at the rate of 100 req/sec.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section remap="h4">
+ <title>Change a TBF rule</title>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp change loginnode rate=200"
+</screen>
+ </section>
+ <section remap="h4">
+ <title>Stop a TBF rule</title>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
+<replaceable>rule_name</replaceable>"</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
+ </section>
+ <section remap="h4">
+ <title>Rule options</title>
+ <para>To support more flexible rule conditions, the following options
+ are added.</para>
+ <itemizedlist>
+ <listitem>
+ <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
+ <para>By default, a newly started rule is prior to the old ones,
+ but by specifying the argument '<literal>rank=</literal>' when
+ inserting a new rule with "<literal>start</literal>" command,
+ the rank of the rule can be changed. Also, it can be changed by
+ "<literal>change</literal>" command.
+ </para>
+ <para>Command:</para>
+ <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
+lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
+</screen>
+ <para>By specifying the existing rule
+ '<replaceable>obj_rule_name</replaceable>', the new rule
+ '<replaceable>rule_name</replaceable>' will be moved to the front of
+ '<replaceable>obj_rule_name</replaceable>'.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={iozone.500 dd.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
+ <para>In this example, rule "iozone_user1" is added to the front of
+ rule "computes". We can see the order by the following command:
+ </para>
+ <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">TBF realtime policies under congestion
+ </emphasis></para>
+ <para>During TBF evaluation, we find that when the sum of I/O
+ bandwidth requirements for all classes exceeds the system capacity,
+ the classes with the same rate limits get less bandwidth than if
+ preconfigured evenly. The reason for this is the heavy load on a
+ congested server will result in some missed deadlines for some
+ classes. The number of the calculated tokens may be larger than 1
+ during dequeuing. In the original implementation, all classes are
+ equally handled to simply discard exceeding tokens.</para>
+ <para>Thus, a Hard Token Compensation (HTC) strategy has been
+ implemented. A class can be configured with the HTC feature by the
+ rule it matches. This feature means that requests in this kind of
+ class queues have high real-time requirements and that the bandwidth
+ assignment must be satisfied as good as possible. When deadline
+ misses happen, the class keeps the deadline unchanged and the time
+ residue(the remainder of elapsed time divided by 1/r) is compensated
+ to the next round. This ensures that the next idle I/O thread will
+ always select this class to serve until all accumulated exceeding
+ tokens are handled or there are no pending requests in the class
+ queue.</para>
+ <para>Command:</para>
+ <para>A new command format is added to enable the realtime feature
+ for a rule:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=\
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start realjob jobid={dd.0} rate=100 realtime=1</screen>
+ <para>This example rule means the RPC requests whose JobID is dd.0
+ will be processed at the rate of 100req/sec in realtime.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.delaytuning" condition='l2A'>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Delay policy</tertiary>
+ </indexterm>Delay policy</title>
+ <para>The NRS Delay policy seeks to perturb the timing of request
+ processing at the PtlRPC layer, with the goal of simulating high server
+ load, and finding and exposing timing related problems. When this policy
+ is active, upon arrival of a request the policy will calculate an offset,
+ within a defined, user-configurable range, from the request arrival
+ time, to determine a time after which the request should be handled.
+ The request is then stored using the cfs_binheap implementation,
+ which sorts the request according to the assigned start time.
+ Requests are removed from the binheap for handling once their start
+ time has been passed.</para>
+ <para>The Delay policy can be enabled on all types of PtlRPC services,
+ and has the following tunables that can be used to adjust its behavior:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_min</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_min</literal> tunable controls the
+ minimum amount of time, in seconds, that a request will be delayed by
+ this policy. The default is 5 seconds. To read this value run:</para>
+ <screen>
+lctl get_param {service}.nrs_delay_min</screen>
+ <para>For example, to read the minimum delay set on the ost_io
+ service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_min
+ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
+hp_delay_min:5</screen>
+ <para>To set the minimum delay in RPC processing, run:</para>
+ <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
+ <para>This will set the minimum delay time on a given service, for both
+ regular and high-priority RPCs (if the PtlRPC service supports
+ high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the minimum delay time on the ost_io service
+ to 10, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
+ost.OSS.ost_io.nrs_delay_min=10</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different minimum delay time for regular and high-priority RPCs, run:
+ </para>
+ <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
+ </screen>
+ <para>For example, to set the minimum delay time on the ost_io service
+ for high-priority RPCs to 3, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
+ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
+ <para>Note, in all cases the minimum delay time cannot exceed the
+ maximum delay time.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_max</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_max</literal> tunable controls the
+ maximum amount of time, in seconds, that a request will be delayed by
+ this policy. The default is 300 seconds. To read this value run:
+ </para>
+ <screen>lctl get_param {service}.nrs_delay_max</screen>
+ <para>For example, to read the maximum delay set on the ost_io
+ service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_max
+ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
+hp_delay_max:300</screen>
+ <para>To set the maximum delay in RPC processing, run:</para>
+ <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
+</screen>
+ <para>This will set the maximum delay time on a given service, for both
+ regular and high-priority RPCs (if the PtlRPC service supports
+ high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the maximum delay time on the ost_io service
+ to 60, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
+ost.OSS.ost_io.nrs_delay_max=60</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different maximum delay time for regular and high-priority RPCs, run:
+ </para>
+ <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
+ <para>For example, to set the maximum delay time on the ost_io service
+ for high-priority RPCs to 30, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
+ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
+ <para>Note, in all cases the maximum delay time cannot be less than the
+ minimum delay time.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_pct</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_pct</literal> tunable controls the
+ percentage of requests that will be delayed by this policy. The
+ default is 100. Note, when a request is not selected for handling by
+ the delay policy due to this variable then the request will be handled
+ by whatever fallback policy is defined for that service. If no other
+ fallback policy is defined then the request will be handled by the
+ FIFO policy. To read this value run:</para>
+ <screen>lctl get_param {service}.nrs_delay_pct</screen>
+ <para>For example, to read the percentage of requests being delayed on
+ the ost_io service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_pct
+ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
+hp_delay_pct:100</screen>
+ <para>To set the percentage of delayed requests, run:</para>
+ <screen>
+lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
+ <para>This will set the percentage of requests delayed on a given
+ service, for both regular and high-priority RPCs (if the PtlRPC service
+ supports high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the percentage of delayed requests on the
+ ost_io service to 50, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
+ost.OSS.ost_io.nrs_delay_pct=50
+</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different delay percentage for regular and high-priority RPCs, run:
+ </para>
+ <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
+</screen>
+ <para>For example, to set the percentage of delayed requests on the
+ ost_io service for high-priority RPCs to 5, run:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+</screen>
+ </listitem>
+ </itemizedlist>