+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>LND tuning</secondary>
+ </indexterm>LND Tuning</title>
+ <para>LND tuning allows the number of threads per CPU partition to be
+ specified. An administrator can set the threads for both
+ <literal>ko2iblnd</literal> and
+ <literal>ksocklnd</literal> using the
+ <literal>nscheds</literal> parameter. This adjusts the number of threads for
+ each partition, not the overall number of threads on the LND.</para>
+ <note>
+ <para>The default number of threads for
+ <literal>ko2iblnd</literal> and
+ <literal>ksocklnd</literal> are automatically set and are chosen to
+ work well across a number of typical scenarios, for systems with both
+ high and low core counts.</para>
+ </note>
+ <section>
+ <title>ko2iblnd Tuning</title>
+ <para>The following table outlines the ko2iblnd module parameters to be used
+ for tuning:</para>
+ <informaltable frame="all">
+ <tgroup cols="3">
+ <colspec colname="c1" colwidth="50*" />
+ <colspec colname="c2" colwidth="50*" />
+ <colspec colname="c3" colwidth="50*" />
+ <thead>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Module Parameter</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Default Value</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Description</emphasis>
+ </para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal>service</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>987</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Service number (within RDMA_PS_TCP).</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>cksum</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Set non-zero to enable message (not RDMA) checksums.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>timeout</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>50</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Timeout in seconds.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>nscheds</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of threads in each scheduler pool (per CPT). Value of
+ zero means we derive the number from the number of cores.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>conns_per_peer</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>4 (OmniPath), 1 (Everything else)</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Introduced in 2.10. Number of connections to each peer. Messages
+ are sent round-robin over the connection pool. Provides significant
+ improvement with OmniPath.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ntx</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>512</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of message descriptors allocated for each pool at
+ startup. Grows at runtime. Shared by all CPTs.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>256</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of concurrent sends on network.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>8</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of concurrent sends to 1 peer. Related/limited by IB
+ queue size.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_credits_hiw</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>When eagerly to return credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_buffer_credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number per-peer router buffer credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_timeout</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>180</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Seconds without aliveness news to declare peer dead (less than
+ or equal to 0 to disable).</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ipif_name</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>ib0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>IPoIB interface name.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>retry_count</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>5</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Retransmissions when no ACK received.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>rnr_retry_count</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>6</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>RNR retransmissions.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>keepalive</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>100</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Idle time in seconds before sending a keepalive.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ib_mtu</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>IB MTU 256/512/1024/2048/4096.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>concurrent_sends</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Send work-queue sizing. If zero, derived from
+ <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>map_on_demand</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of fragments reserved for connection. If zero, use
+ global memory region (found to be security issue). If non-zero, use
+ FMR or FastReg for memory registration. Value needs to agree between
+ both peers of connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_pool_size</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>512</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Size of fmr pool on each CPT (>= ntx / 4). Grows at runtime.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_flush_trigger</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>384</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number dirty FMRs that triggers pool flush.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_cache</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>1</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Non-zero to enable FMR caching.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>dev_failover</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>require_privileged_port</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Require privileged port when accepting connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>use_privileged_port</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>1</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Use privileged port when initiating connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>wrq_sge</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>2</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Introduced in 2.10. Number scatter/gather element groups per
+ work request. Used to deal with fragmentations which can consume
+ double the number of work requests.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.nrstuning">
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ </indexterm>Network Request Scheduler (NRS) Tuning</title>
+ <para>The Network Request Scheduler (NRS) allows the administrator to
+ influence the order in which RPCs are handled at servers, on a per-PTLRPC
+ service basis, by providing different policies that can be activated and
+ tuned in order to influence the RPC ordering. The aim of this is to provide
+ for better performance, and possibly discrete performance characteristics
+ using future policies.</para>
+ <para>The NRS policy state of a PTLRPC service can be read and set via the
+ <literal>{service}.nrs_policies</literal> tunable. To read a PTLRPC
+ service's NRS policy state, run:</para>
+ <screen>
+lctl get_param {service}.nrs_policies
+</screen>
+ <para>For example, to read the NRS policy state of the
+ <literal>ost_io</literal> service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_policies
+ost.OSS.ost_io.nrs_policies=
+
+regular_requests:
+ - name: fifo
+ state: started
+ fallback: yes
+ queued: 0
+ active: 0
+
+ - name: crrn
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: orr
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: trr
+ state: started
+ fallback: no
+ queued: 2420
+ active: 268
+
+ - name: tbf
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: delay
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+high_priority_requests:
+ - name: fifo
+ state: started
+ fallback: yes
+ queued: 0
+ active: 0
+
+ - name: crrn
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: orr
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: trr
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: tbf
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: delay
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+</screen>
+ <para>NRS policy state is shown in either one or two sections, depending on
+ the PTLRPC service being queried. The first section is named
+ <literal>regular_requests</literal> and is available for all PTLRPC
+ services, optionally followed by a second section which is named
+ <literal>high_priority_requests</literal>. This is because some PTLRPC
+ services are able to treat some types of RPCs as higher priority ones, such
+ that they are handled by the server with higher priority compared to other,
+ regular RPC traffic. For PTLRPC services that do not support high-priority
+ RPCs, you will only see the
+ <literal>regular_requests</literal> section.</para>
+ <para>There is a separate instance of each NRS policy on each PTLRPC
+ service for handling regular and high-priority RPCs (if the service
+ supports high-priority RPCs). For each policy instance, the following
+ fields are shown:</para>
+ <informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*" />
+ <colspec colname="c2" colwidth="50*" />
+ <thead>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Field</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Description</emphasis>
+ </para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal>name</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The name of the policy.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>state</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The state of the policy; this can be any of
+ <literal>invalid, stopping, stopped, starting, started</literal>.
+ A fully enabled policy is in the
+ <literal>started</literal> state.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fallback</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Whether the policy is acting as a fallback policy or not. A
+ fallback policy is used to handle RPCs that other enabled
+ policies fail to handle, or do not support the handling of. The
+ possible values are
+ <literal>no, yes</literal>. Currently, only the FIFO policy can
+ act as a fallback policy.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>queued</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The number of RPCs that the policy has waiting to be
+ serviced.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>active</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>The number of RPCs that the policy is currently
+ handling.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ <para>To enable an NRS policy on a PTLRPC service run:</para>
+ <screen>
+lctl set_param {service}.nrs_policies=
+<replaceable>policy_name</replaceable>
+</screen>
+ <para>This will enable the policy
+ <replaceable>policy_name</replaceable>for both regular and high-priority
+ RPCs (if the PLRPC service supports high-priority RPCs) on the given
+ service. For example, to enable the CRR-N NRS policy for the ldlm_cbd
+ service, run:</para>
+ <screen>
+$ lctl set_param ldlm.services.ldlm_cbd.nrs_policies=crrn
+ldlm.services.ldlm_cbd.nrs_policies=crrn
+
+</screen>
+ <para>For PTLRPC services that support high-priority RPCs, you can also
+ supply an optional
+ <replaceable>reg|hp</replaceable>token, in order to enable an NRS policy
+ for handling only regular or high-priority RPCs on a given PTLRPC service,
+ by running:</para>
+ <screen>
+lctl set_param {service}.nrs_policies="
+<replaceable>policy_name</replaceable>
+<replaceable>reg|hp</replaceable>"
+</screen>
+ <para>For example, to enable the TRR policy for handling only regular, but
+ not high-priority RPCs on the
+ <literal>ost_io</literal> service, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_policies="trr reg"
+ost.OSS.ost_io.nrs_policies="trr reg"
+
+</screen>
+ <note>
+ <para>When enabling an NRS policy, the policy name must be given in
+ lower-case characters, otherwise the operation will fail with an error
+ message.</para>
+ </note>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>first in, first out (FIFO) policy</tertiary>
+ </indexterm>First In, First Out (FIFO) policy</title>
+ <para>The first in, first out (FIFO) policy handles RPCs in a service in
+ the same order as they arrive from the LNet layer, so no special
+ processing takes place to modify the RPC handling stream. FIFO is the
+ default policy for all types of RPCs on all PTLRPC services, and is
+ always enabled irrespective of the state of other policies, so that it
+ can be used as a backup policy, in case a more elaborate policy that has
+ been enabled fails to handle an RPC, or does not support handling a given
+ type of RPC.</para>
+ <para>The FIFO policy has no tunables that adjust its behaviour.</para>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>client round-robin over NIDs (CRR-N) policy</tertiary>
+ </indexterm>Client Round-Robin over NIDs (CRR-N) policy</title>
+ <para>The client round-robin over NIDs (CRR-N) policy performs batched
+ round-robin scheduling of all types of RPCs, with each batch consisting
+ of RPCs originating from the same client node, as identified by its NID.
+ CRR-N aims to provide for better resource utilization across the cluster,
+ and to help shorten completion times of jobs in some cases, by
+ distributing available bandwidth more evenly across all clients.</para>
+ <para>The CRR-N policy can be enabled on all types of PTLRPC services,
+ and has the following tunable that can be used to adjust its
+ behavior:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_crrn_quantum</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_crrn_quantum</literal> tunable determines the
+ maximum allowed size of each batch of RPCs; the unit of measure is in
+ number of RPCs. To read the maximum allowed batch size of a CRR-N
+ policy, run:</para>
+ <screen>
+lctl get_param {service}.nrs_crrn_quantum
+</screen>
+ <para>For example, to read the maximum allowed batch size of a CRR-N
+ policy on the ost_io service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_crrn_quantum
+ost.OSS.ost_io.nrs_crrn_quantum=reg_quantum:16
+hp_quantum:8
+
+</screen>
+ <para>You can see that there is a separate maximum allowed batch size
+ value for regular (
+ <literal>reg_quantum</literal>) and high-priority (
+ <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
+ high-priority RPCs).</para>
+ <para>To set the maximum allowed batch size of a CRR-N policy on a
+ given service, run:</para>
+ <screen>
+lctl set_param {service}.nrs_crrn_quantum=
+<replaceable>1-65535</replaceable>
+</screen>
+ <para>This will set the maximum allowed batch size on a given
+ service, for both regular and high-priority RPCs (if the PLRPC
+ service supports high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the maximum allowed batch size on the
+ ldlm_canceld service to 16 RPCs, run:</para>
+ <screen>
+$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
+ldlm.services.ldlm_canceld.nrs_crrn_quantum=16
+
+</screen>
+ <para>For PTLRPC services that support high-priority RPCs, you can
+ also specify a different maximum allowed batch size for regular and
+ high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param {service}.nrs_crrn_quantum=
+<replaceable>reg_quantum|hp_quantum</replaceable>:
+<replaceable>1-65535</replaceable>"
+</screen>
+ <para>For example, to set the maximum allowed batch size on the
+ ldlm_canceld service, for high-priority RPCs to 32, run:</para>
+ <screen>
+$ lctl set_param ldlm.services.ldlm_canceld.nrs_crrn_quantum="hp_quantum:32"
+ldlm.services.ldlm_canceld.nrs_crrn_quantum=hp_quantum:32
+
+</screen>
+ <para>By using the last method, you can also set the maximum regular
+ and high-priority RPC batch sizes to different values, in a single
+ command invocation.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>object-based round-robin (ORR) policy</tertiary>
+ </indexterm>Object-based Round-Robin (ORR) policy</title>
+ <para>The object-based round-robin (ORR) policy performs batched
+ round-robin scheduling of bulk read write (brw) RPCs, with each batch
+ consisting of RPCs that pertain to the same backend-file system object,
+ as identified by its OST FID.</para>
+ <para>The ORR policy is only available for use on the ost_io service. The
+ RPC batches it forms can potentially consist of mixed bulk read and bulk
+ write RPCs. The RPCs in each batch are ordered in an ascending manner,
+ based on either the file offsets, or the physical disk offsets of each
+ RPC (only applicable to bulk read RPCs).</para>
+ <para>The aim of the ORR policy is to provide for increased bulk read
+ throughput in some cases, by ordering bulk read RPCs (and potentially
+ bulk write RPCs), and thus minimizing costly disk seek operations.
+ Performance may also benefit from any resulting improvement in resource
+ utilization, or by taking advantage of better locality of reference
+ between RPCs.</para>
+ <para>The ORR policy has the following tunables that can be used to
+ adjust its behaviour:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_orr_quantum</literal>
+ </para>
+ <para>The
+ <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable determines
+ the maximum allowed size of each batch of RPCs; the unit of measure
+ is in number of RPCs. To read the maximum allowed batch size of the
+ ORR policy, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_quantum
+ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:256
+hp_quantum:16
+
+</screen>
+ <para>You can see that there is a separate maximum allowed batch size
+ value for regular (
+ <literal>reg_quantum</literal>) and high-priority (
+ <literal>hp_quantum</literal>) RPCs (if the PTLRPC service supports
+ high-priority RPCs).</para>
+ <para>To set the maximum allowed batch size for the ORR policy,
+ run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
+<replaceable>1-65535</replaceable>
+</screen>
+ <para>This will set the maximum allowed batch size for both regular
+ and high-priority RPCs, to the indicated value.</para>
+ <para>You can also specify a different maximum allowed batch size for
+ regular and high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=
+<replaceable>reg_quantum|hp_quantum</replaceable>:
+<replaceable>1-65535</replaceable>
+</screen>
+ <para>For example, to set the maximum allowed batch size for regular
+ RPCs to 128, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
+ost.OSS.ost_io.nrs_orr_quantum=reg_quantum:128
+
+</screen>
+ <para>By using the last method, you can also set the maximum regular
+ and high-priority RPC batch sizes to different values, in a single
+ command invocation.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal>
+ </para>
+ <para>The
+ <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable
+ determines whether the ORR policy orders RPCs within each batch based
+ on logical file offsets or physical disk offsets. To read the offset
+ type value for the ORR policy, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_offset_type
+ost.OSS.ost_io.nrs_orr_offset_type=reg_offset_type:physical
+hp_offset_type:logical
+
+</screen>
+ <para>You can see that there is a separate offset type value for
+ regular (
+ <literal>reg_offset_type</literal>) and high-priority (
+ <literal>hp_offset_type</literal>) RPCs.</para>
+ <para>To set the ordering type for the ORR policy, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
+<replaceable>physical|logical</replaceable>
+</screen>
+ <para>This will set the offset type for both regular and
+ high-priority RPCs, to the indicated value.</para>
+ <para>You can also specify a different offset type for regular and
+ high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=
+<replaceable>reg_offset_type|hp_offset_type</replaceable>:
+<replaceable>physical|logical</replaceable>
+</screen>
+ <para>For example, to set the offset type for high-priority RPCs to
+ physical disk offsets, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
+ost.OSS.ost_io.nrs_orr_offset_type=hp_offset_type:physical
+</screen>
+ <para>By using the last method, you can also set offset type for
+ regular and high-priority RPCs to different values, in a single
+ command invocation.</para>
+ <note>
+ <para>Irrespective of the value of this tunable, only logical
+ offsets can, and are used for ordering bulk write RPCs.</para>
+ </note>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_orr_supported</literal>
+ </para>
+ <para>The
+ <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable determines
+ the type of RPCs that the ORR policy will handle. To read the types
+ of supported RPCs by the ORR policy, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_orr_supported
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads
+hp_supported=reads_and_writes
+
+</screen>
+ <para>You can see that there is a separate supported 'RPC types'
+ value for regular (
+ <literal>reg_supported</literal>) and high-priority (
+ <literal>hp_supported</literal>) RPCs.</para>
+ <para>To set the supported RPC types for the ORR policy, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
+<replaceable>reads|writes|reads_and_writes</replaceable>
+</screen>
+ <para>This will set the supported RPC types for both regular and
+ high-priority RPCs, to the indicated value.</para>
+ <para>You can also specify a different supported 'RPC types' value
+ for regular and high-priority RPCs, by running:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_orr_supported=
+<replaceable>reg_supported|hp_supported</replaceable>:
+<replaceable>reads|writes|reads_and_writes</replaceable>
+</screen>
+ <para>For example, to set the supported RPC types to bulk read and
+ bulk write RPCs for regular requests, run:</para>
+ <screen>
+$ lctl set_param
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
+ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes
+
+</screen>
+ <para>By using the last method, you can also set the supported RPC
+ types for regular and high-priority RPC to different values, in a
+ single command invocation.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Target-based round-robin (TRR) policy</tertiary>
+ </indexterm>Target-based Round-Robin (TRR) policy</title>
+ <para>The target-based round-robin (TRR) policy performs batched
+ round-robin scheduling of brw RPCs, with each batch consisting of RPCs
+ that pertain to the same OST, as identified by its OST index.</para>
+ <para>The TRR policy is identical to the object-based round-robin (ORR)
+ policy, apart from using the brw RPC's target OST index instead of the
+ backend-fs object's OST FID, for determining the RPC scheduling order.
+ The goals of TRR are effectively the same as for ORR, and it uses the
+ following tunables to adjust its behaviour:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_trr_quantum</literal>
+ </para>
+ <para>The purpose of this tunable is exactly the same as for the
+ <literal>ost.OSS.ost_io.nrs_orr_quantum</literal> tunable for the ORR
+ policy, and you can use it in exactly the same way.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_trr_offset_type</literal>
+ </para>
+ <para>The purpose of this tunable is exactly the same as for the
+ <literal>ost.OSS.ost_io.nrs_orr_offset_type</literal> tunable for the
+ ORR policy, and you can use it in exactly the same way.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>ost.OSS.ost_io.nrs_trr_supported</literal>
+ </para>
+ <para>The purpose of this tunable is exactly the same as for the
+ <literal>ost.OSS.ost_io.nrs_orr_supported</literal> tunable for the
+ ORR policy, and you can use it in exactly the sme way.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section xml:id="dbdoclet.tbftuning" condition='l26'>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Token Bucket Filter (TBF) policy</tertiary>
+ </indexterm>Token Bucket Filter (TBF) policy</title>
+ <para>The TBF (Token Bucket Filter) is a Lustre NRS policy which enables
+ Lustre services to enforce the RPC rate limit on clients/jobs for QoS
+ (Quality of Service) purposes.</para>
+ <figure>
+ <title>The internal structure of TBF policy</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata scalefit="1" width="50%"
+ fileref="figures/TBF_policy.png" />
+ </imageobject>
+ <textobject>
+ <phrase>The internal structure of TBF policy</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>When a RPC request arrives, TBF policy puts it to a waiting queue
+ according to its classification. The classification of RPC requests is
+ based on either NID or JobID of the RPC according to the configure of
+ TBF. TBF policy maintains multiple queues in the system, one queue for
+ each category in the classification of RPC requests. The requests waits
+ for tokens in the FIFO queue before they have been handled so as to keep
+ the RPC rates under the limits.</para>
+ <para>When Lustre services are too busy to handle all of the requests in
+ time, all of the specified rates of the queues will not be satisfied.
+ Nothing bad will happen except some of the RPC rates are slower than
+ configured. In this case, the queue with higher rate will have an
+ advantage over the queues with lower rates, but none of them will be
+ starved.</para>
+ <para>To manage the RPC rate of queues, we don't need to set the rate of
+ each queue manually. Instead, we define rules which TBF policy matches to
+ determine RPC rate limits. All of the defined rules are organized as an
+ ordered list. Whenever a queue is newly created, it goes though the rule
+ list and takes the first matched rule as its rule, so that the queue
+ knows its RPC token rate. A rule can be added to or removed from the list
+ at run time. Whenever the list of rules is changed, the queues will
+ update their matched rules.</para>
+ <section remap="h4">
+ <title>Enable TBF policy</title>
+ <para>Command:</para>
+ <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf <<replaceable>policy</replaceable>>"
+ </screen>
+ <para>For now, the RPCs can be classified into the different types
+ according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
+ policy, you can specify one of the types, or just use "tbf" to enable
+ all of them to do a fine-grained RPC requests classification.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
+ </section>
+ <section remap="h4">
+ <title>Start a TBF rule</title>
+ <para>The TBF rule is defined in the parameter
+ <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
+ </screen>
+ <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
+ policy rule's name and '<replaceable>arguments</replaceable>' is a
+ string to specify the detailed rule according to the different types.
+ </para>
+ <itemizedlist>
+ <para>Next, the different types of TBF policies will be described.</para>
+ <listitem>
+ <para><emphasis role="bold">NID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>'<replaceable>nidlist</replaceable>' uses the same format
+ as configuring LNET route. '<replaceable>rate</replaceable>' is
+ the (upper limit) RPC rate of the rule.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start other_clients nid={192.168.*.*@tcp} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+ <para>In this example, the rate of processing RPC requests from
+ compute nodes is at most 5x as fast as those from login nodes.
+ The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
+ like:</para>
+ <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0
+high_priority_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start loginnode nid={192.168.1.1@tcp} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
+ <para>For the JobID, please see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="dbdoclet.jobstats" /> for more details.</para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Wildcard is supported in
+ {<replaceable>jobid_list</replaceable>}.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start dd_user jobid={dd.*} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={*.600} rate=10"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user2 jobid={io*.10* *.500} rate=200"</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 opcode={ost_read} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
+$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
+ <para>Exapmle:</para>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ <para>Limit the rate of RPC requests of the gid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name gid={500} rate=100"</screen>
+ <para>Also, you can use the following rule to control all reqs
+ to mds:</para>
+ <para>Start the tbf uid QoS on MDS:</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">Policy combination</emphasis></para>
+ <para>To support TBF rules with complex expressions of conditions,
+ TBF classifier is extented to classify RPC in a more fine-grained
+ way. This feature supports logical conditional conjunction and
+ disjunction operations among different types.
+ In the rule:
+ "&" represents the conditional conjunction and
+ "," represents the conditional disjunction.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start comp_rule opcode={ost_write}&jobid={dd.0},\
+nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
+ <para>In this example, those RPCs whose <literal>opcode</literal> is
+ ost_write and <literal>jobid</literal> is dd.0, or
+ <literal>nid</literal> satisfies the condition of
+ {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
+ req/sec.
+ The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
+ </para>
+ <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500}&gid={500} rate=100"</screen>
+ <para>In this example, those RPC requests whose uid is 500 and
+ gid is 500 will be processed at the rate of 100 req/sec.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section remap="h4">
+ <title>Change a TBF rule</title>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp change loginnode rate=200"
+</screen>
+ </section>
+ <section remap="h4">
+ <title>Stop a TBF rule</title>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
+<replaceable>rule_name</replaceable>"</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
+ </section>
+ <section remap="h4">
+ <title>Rule options</title>
+ <para>To support more flexible rule conditions, the following options
+ are added.</para>
+ <itemizedlist>
+ <listitem>
+ <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
+ <para>By default, a newly started rule is prior to the old ones,
+ but by specifying the argument '<literal>rank=</literal>' when
+ inserting a new rule with "<literal>start</literal>" command,
+ the rank of the rule can be changed. Also, it can be changed by
+ "<literal>change</literal>" command.
+ </para>
+ <para>Command:</para>
+ <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
+lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
+</screen>
+ <para>By specifying the existing rule
+ '<replaceable>obj_rule_name</replaceable>', the new rule
+ '<replaceable>rule_name</replaceable>' will be moved to the front of
+ '<replaceable>obj_rule_name</replaceable>'.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={iozone.500 dd.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
+ <para>In this example, rule "iozone_user1" is added to the front of
+ rule "computes". We can see the order by the following command:
+ </para>
+ <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">TBF realtime policies under congestion
+ </emphasis></para>
+ <para>During TBF evaluation, we find that when the sum of I/O
+ bandwidth requirements for all classes exceeds the system capacity,
+ the classes with the same rate limits get less bandwidth than if
+ preconfigured evenly. The reason for this is the heavy load on a
+ congested server will result in some missed deadlines for some
+ classes. The number of the calculated tokens may be larger than 1
+ during dequeuing. In the original implementation, all classes are
+ equally handled to simply discard exceeding tokens.</para>
+ <para>Thus, a Hard Token Compensation (HTC) strategy has been
+ implemented. A class can be configured with the HTC feature by the
+ rule it matches. This feature means that requests in this kind of
+ class queues have high real-time requirements and that the bandwidth
+ assignment must be satisfied as good as possible. When deadline
+ misses happen, the class keeps the deadline unchanged and the time
+ residue(the remainder of elapsed time divided by 1/r) is compensated
+ to the next round. This ensures that the next idle I/O thread will
+ always select this class to serve until all accumulated exceeding
+ tokens are handled or there are no pending requests in the class
+ queue.</para>
+ <para>Command:</para>
+ <para>A new command format is added to enable the realtime feature
+ for a rule:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=\
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start realjob jobid={dd.0} rate=100 realtime=1</screen>
+ <para>This example rule means the RPC requests whose JobID is dd.0
+ will be processed at the rate of 100req/sec in realtime.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.delaytuning" condition='l2A'>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Delay policy</tertiary>
+ </indexterm>Delay policy</title>
+ <para>The NRS Delay policy seeks to perturb the timing of request
+ processing at the PtlRPC layer, with the goal of simulating high server
+ load, and finding and exposing timing related problems. When this policy
+ is active, upon arrival of a request the policy will calculate an offset,
+ within a defined, user-configurable range, from the request arrival
+ time, to determine a time after which the request should be handled.
+ The request is then stored using the cfs_binheap implementation,
+ which sorts the request according to the assigned start time.
+ Requests are removed from the binheap for handling once their start
+ time has been passed.</para>
+ <para>The Delay policy can be enabled on all types of PtlRPC services,
+ and has the following tunables that can be used to adjust its behavior:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_min</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_min</literal> tunable controls the
+ minimum amount of time, in seconds, that a request will be delayed by
+ this policy. The default is 5 seconds. To read this value run:</para>
+ <screen>
+lctl get_param {service}.nrs_delay_min</screen>
+ <para>For example, to read the minimum delay set on the ost_io
+ service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_min
+ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
+hp_delay_min:5</screen>
+ <para>To set the minimum delay in RPC processing, run:</para>
+ <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
+ <para>This will set the minimum delay time on a given service, for both
+ regular and high-priority RPCs (if the PtlRPC service supports
+ high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the minimum delay time on the ost_io service
+ to 10, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
+ost.OSS.ost_io.nrs_delay_min=10</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different minimum delay time for regular and high-priority RPCs, run:
+ </para>
+ <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
+ </screen>
+ <para>For example, to set the minimum delay time on the ost_io service
+ for high-priority RPCs to 3, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
+ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
+ <para>Note, in all cases the minimum delay time cannot exceed the
+ maximum delay time.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_max</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_max</literal> tunable controls the
+ maximum amount of time, in seconds, that a request will be delayed by
+ this policy. The default is 300 seconds. To read this value run:
+ </para>
+ <screen>lctl get_param {service}.nrs_delay_max</screen>
+ <para>For example, to read the maximum delay set on the ost_io
+ service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_max
+ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
+hp_delay_max:300</screen>
+ <para>To set the maximum delay in RPC processing, run:</para>
+ <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
+</screen>
+ <para>This will set the maximum delay time on a given service, for both
+ regular and high-priority RPCs (if the PtlRPC service supports
+ high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the maximum delay time on the ost_io service
+ to 60, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
+ost.OSS.ost_io.nrs_delay_max=60</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different maximum delay time for regular and high-priority RPCs, run:
+ </para>
+ <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
+ <para>For example, to set the maximum delay time on the ost_io service
+ for high-priority RPCs to 30, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
+ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
+ <para>Note, in all cases the maximum delay time cannot be less than the
+ minimum delay time.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_pct</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_pct</literal> tunable controls the
+ percentage of requests that will be delayed by this policy. The
+ default is 100. Note, when a request is not selected for handling by
+ the delay policy due to this variable then the request will be handled
+ by whatever fallback policy is defined for that service. If no other
+ fallback policy is defined then the request will be handled by the
+ FIFO policy. To read this value run:</para>
+ <screen>lctl get_param {service}.nrs_delay_pct</screen>
+ <para>For example, to read the percentage of requests being delayed on
+ the ost_io service, run:</para>
+ <screen>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_pct
+ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
+hp_delay_pct:100</screen>
+ <para>To set the percentage of delayed requests, run:</para>
+ <screen>
+lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
+ <para>This will set the percentage of requests delayed on a given
+ service, for both regular and high-priority RPCs (if the PtlRPC service
+ supports high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the percentage of delayed requests on the
+ ost_io service to 50, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
+ost.OSS.ost_io.nrs_delay_pct=50
+</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different delay percentage for regular and high-priority RPCs, run:
+ </para>
+ <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
+</screen>
+ <para>For example, to set the percentage of delayed requests on the
+ ost_io service for high-priority RPCs to 5, run:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+</screen>
+ </listitem>
+ </itemizedlist>
+ </section>