<?xml version='1.0' encoding='utf-8'?>
<chapter xmlns="http://docbook.org/ns/docbook"
-xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
-xml:id="lustretuning">
+ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
+ xml:id="lustretuning">
<title xml:id="lustretuning.title">Tuning a Lustre File System</title>
<para>This chapter contains information about tuning a Lustre file system for
better performance.</para>
service immediately and disables automatic thread creation behavior.
</para>
</note>
- <para condition='l23'>Lustre software release 2.3 introduced new
- parameters to provide more control to administrators.</para>
+ <para>Parameters are available to provide administrators control
+ over the number of service threads.</para>
<itemizedlist>
<listitem>
<para>
in providing the read page service. The read page service handles
file close and readdir operations.</para>
</listitem>
- <listitem>
- <para>
- <literal>mds_attr_num_threads</literal> controls the number of threads
- in providing the setattr service to clients running Lustre software
- release 1.8.</para>
- </listitem>
</itemizedlist>
</section>
</section>
- <section xml:id="dbdoclet.mdsbinding" condition='l23'>
+ <section xml:id="dbdoclet.mdsbinding">
<title>
<indexterm>
<primary>tuning</primary>
<secondary>MDS binding</secondary>
</indexterm>Binding MDS Service Thread to CPU Partitions</title>
- <para>With the introduction of Node Affinity (
- <xref linkend="nodeaffdef" />) in Lustre software release 2.3, MDS threads
- can be bound to particular CPU partitions (CPTs) to improve CPU cache
- usage and memory locality. Default values for CPT counts and CPU core
+ <para>With the Node Affinity (<xref linkend="nodeaffdef" />) feature,
+ MDS threads can be bound to particular CPU partitions (CPTs) to improve CPU
+ cache usage and memory locality. Default values for CPT counts and CPU core
bindings are selected automatically to provide good overall performance for
a given CPU count. However, an administrator can deviate from these setting
if they choose. For details on specifying the mapping of CPU cores to
to
<literal>CPT4</literal>.</para>
</listitem>
- <listitem>
- <para>
- <literal>mds_attr_num_cpts=[EXPRESSION]</literal> binds the setattr
- service threads to CPTs defined by
- <literal>EXPRESSION</literal>.</para>
- </listitem>
</itemizedlist>
- <para>Parameters must be set before module load in the file
+ <para>Parameters must be set before module load in the file
<literal>/etc/modprobe.d/lustre.conf</literal>. For example:
<example><title>lustre.conf</title>
<screen>options lnet networks=tcp0(eth0)
<para>By default, this parameter is off. As always, you should test the
performance to compare the impact of changing this parameter.</para>
</section>
- <section condition='l23'>
+ <section>
<title>
<indexterm>
<primary>tuning</primary>
<secondary>Network interface binding</secondary>
</indexterm>Binding Network Interface Against CPU Partitions</title>
- <para>Lustre software release 2.3 and beyond provide enhanced network
- interface control. The enhancement means that an administrator can bind
- an interface to one or more CPU partitions. Bindings are specified as
- options to the LNet modules. For more information on specifying module
- options, see
+ <para>Lustre allows enhanced network interface control. This means that
+ an administrator can bind an interface to one or more CPU partitions.
+ Bindings are specified as options to the LNet modules. For more
+ information on specifying module options, see
<xref linkend="dbdoclet.50438293_15350" /></para>
<para>For example,
<literal>o2ib0(ib0)[0,1]</literal> will ensure that all messages for
<screen>
ko2iblnd credits=256
</screen>
- <note condition="l23">
- <para>In Lustre software release 2.3 and beyond, LNet may revalidate
- the NI credits, so the administrator's request may not persist.</para>
+ <note>
+ <para>LNet may revalidate the NI credits, so the administrator's
+ request may not persist.</para>
</note>
</section>
<section>
<screen>
lnet large_router_buffers=8192
</screen>
- <note condition="l23">
- <para>In Lustre software release 2.3 and beyond, LNet may revalidate
- the router buffer setting, so the administrator's request may not
- persist.</para>
+ <note>
+ <para>LNet may revalidate the router buffer setting, so the
+ administrator's request may not persist.</para>
</note>
</section>
<section>
events across all CPTs. This may balance load better across the CPU but
can incur a cross CPU overhead.</para>
<para>The current policy can be changed by an administrator with
- <literal>echo
- <replaceable>value</replaceable>>
- /proc/sys/lnet/portal_rotor</literal>. There are four options for
+ <literal>lctl set_param portal_rotor=value</literal>.
+ There are four options for
<literal>
<replaceable>value</replaceable>
</literal>:</para>
interface. The default setting is 1. (For more information about the
LNet routes parameter, see
<xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="dbdoclet.50438216_71227" /></para>
+ linkend="lnet_module_routes" /></para>
<para>A router is considered down if any of its NIDs are down. For
example, router X has three NIDs:
<literal>Xnid1</literal>,
be MAX.</para>
</section>
</section>
- <section xml:id="dbdoclet.libcfstuning" condition='l23'>
+ <section xml:id="dbdoclet.libcfstuning">
<title>
<indexterm>
<primary>tuning</primary>
<secondary>libcfs</secondary>
</indexterm>libcfs Tuning</title>
- <para>Lustre software release 2.3 introduced binding service threads via
- CPU Partition Tables (CPTs). This allows the system administrator to
- fine-tune on which CPU cores the Lustre service threads are run, for both
- OSS and MDS services, as well as on the client.
+ <para>Lustre allows binding service threads via CPU Partition Tables
+ (CPTs). This allows the system administrator to fine-tune on which CPU
+ cores the Lustre service threads are run, for both OSS and MDS services,
+ as well as on the client.
</para>
<para>CPTs are useful to reserve some cores on the OSS or MDS nodes for
system functions such as system monitoring, HA heartbeat, or similar
<literal>nscheds</literal> parameter. This adjusts the number of threads for
each partition, not the overall number of threads on the LND.</para>
<note>
- <para>Lustre software release 2.3 has greatly decreased the default
- number of threads for
+ <para>The default number of threads for
<literal>ko2iblnd</literal> and
- <literal>ksocklnd</literal> on high-core count machines. The current
- default values are automatically set and are chosen to work well across a
- number of typical scenarios.</para>
+ <literal>ksocklnd</literal> are automatically set and are chosen to
+ work well across a number of typical scenarios, for systems with both
+ high and low core counts.</para>
</note>
+ <section>
+ <title>ko2iblnd Tuning</title>
+ <para>The following table outlines the ko2iblnd module parameters to be used
+ for tuning:</para>
+ <informaltable frame="all">
+ <tgroup cols="3">
+ <colspec colname="c1" colwidth="50*" />
+ <colspec colname="c2" colwidth="50*" />
+ <colspec colname="c3" colwidth="50*" />
+ <thead>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Module Parameter</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Default Value</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Description</emphasis>
+ </para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <literal>service</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>987</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Service number (within RDMA_PS_TCP).</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>cksum</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Set non-zero to enable message (not RDMA) checksums.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>timeout</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>50</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Timeout in seconds.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>nscheds</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of threads in each scheduler pool (per CPT). Value of
+ zero means we derive the number from the number of cores.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>conns_per_peer</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>4 (OmniPath), 1 (Everything else)</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Introduced in 2.10. Number of connections to each peer. Messages
+ are sent round-robin over the connection pool. Provides significant
+ improvement with OmniPath.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ntx</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>512</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of message descriptors allocated for each pool at
+ startup. Grows at runtime. Shared by all CPTs.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>256</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of concurrent sends on network.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>8</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of concurrent sends to 1 peer. Related/limited by IB
+ queue size.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_credits_hiw</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>When eagerly to return credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_buffer_credits</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number per-peer router buffer credits.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>peer_timeout</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>180</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Seconds without aliveness news to declare peer dead (less than
+ or equal to 0 to disable).</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ipif_name</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>ib0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>IPoIB interface name.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>retry_count</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>5</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Retransmissions when no ACK received.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>rnr_retry_count</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>6</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>RNR retransmissions.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>keepalive</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>100</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Idle time in seconds before sending a keepalive.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>ib_mtu</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>IB MTU 256/512/1024/2048/4096.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>concurrent_sends</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Send work-queue sizing. If zero, derived from
+ <literal>map_on_demand</literal> and <literal>peer_credits</literal>.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>map_on_demand</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath)</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number of fragments reserved for connection. If zero, use
+ global memory region (found to be security issue). If non-zero, use
+ FMR or FastReg for memory registration. Value needs to agree between
+ both peers of connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_pool_size</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>512</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Size of fmr pool on each CPT (>= ntx / 4). Grows at runtime.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_flush_trigger</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>384</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Number dirty FMRs that triggers pool flush.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>fmr_cache</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>1</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Non-zero to enable FMR caching.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>dev_failover</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>HCA failover for bonding (0 OFF, 1 ON, other values reserved).
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>require_privileged_port</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>0</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Require privileged port when accepting connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>use_privileged_port</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>1</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Use privileged port when initiating connection.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <literal>wrq_sge</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <literal>2</literal>
+ </para>
+ </entry>
+ <entry>
+ <para>Introduced in 2.10. Number scatter/gather element groups per
+ work request. Used to deal with fragmentations which can consume
+ double the number of work requests.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+ </section>
</section>
- <section xml:id="dbdoclet.nrstuning" condition='l24'>
+ <section xml:id="dbdoclet.nrstuning">
<title>
<indexterm>
<primary>tuning</primary>
queued: 2420
active: 268
+ - name: tbf
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: delay
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
high_priority_requests:
- name: fifo
state: started
fallback: no
queued: 0
active: 0
-
+
+ - name: tbf
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
+ - name: delay
+ state: stopped
+ fallback: no
+ queued: 0
+ active: 0
+
</screen>
<para>NRS policy state is shown in either one or two sections, depending on
the PTLRPC service being queried. The first section is named
</listitem>
</itemizedlist>
</section>
- <section condition='l26'>
+ <section xml:id="dbdoclet.tbftuning" condition='l26'>
<title>
<indexterm>
<primary>tuning</primary>
<title>The internal structure of TBF policy</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="100%"
- fileref="figures/TBF_policy.svg" />
+ <imagedata scalefit="1" width="50%"
+ fileref="figures/TBF_policy.png" />
</imageobject>
<textobject>
<phrase>The internal structure of TBF policy</phrase>
knows its RPC token rate. A rule can be added to or removed from the list
at run time. Whenever the list of rules is changed, the queues will
update their matched rules.</para>
+ <section remap="h4">
+ <title>Enable TBF policy</title>
+ <para>Command:</para>
+ <screen>lctl set_param ost.OSS.ost_io.nrs_policies="tbf <<replaceable>policy</replaceable>>"
+ </screen>
+ <para>For now, the RPCs can be classified into the different types
+ according to their NID, JOBID, OPCode and UID/GID. When enabling TBF
+ policy, you can specify one of the types, or just use "tbf" to enable
+ all of them to do a fine-grained RPC requests classification.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid"
+$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid"</screen>
+ </section>
+ <section remap="h4">
+ <title>Start a TBF rule</title>
+ <para>The TBF rule is defined in the parameter
+ <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>.</para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>..."
+ </screen>
+ <para>'<replaceable>rule_name</replaceable>' is a string of the TBF
+ policy rule's name and '<replaceable>arguments</replaceable>' is a
+ string to specify the detailed rule according to the different types.
+ </para>
+ <itemizedlist>
+ <para>Next, the different types of TBF policies will be described.</para>
+ <listitem>
+ <para><emphasis role="bold">NID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> nid={<replaceable>nidlist</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>'<replaceable>nidlist</replaceable>' uses the same format
+ as configuring LNET route. '<replaceable>rate</replaceable>' is
+ the (upper limit) RPC rate of the rule.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start other_clients nid={192.168.*.*@tcp} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+ <para>In this example, the rate of processing RPC requests from
+ compute nodes is at most 5x as fast as those from login nodes.
+ The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal> is
+ like:</para>
+ <screen>lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0
+high_priority_requests:
+CPT 0:
+loginnode {192.168.1.1@tcp} 100, ref 0
+computes {192.168.1.[2-128]@tcp} 500, ref 0
+other_clients {192.168.*.*@tcp} 50, ref 0
+default {*} 10000, ref 0</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start loginnode nid={192.168.1.1@tcp} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start loginnode nid={192.168.1.1@tcp} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">JobID based TBF policy</emphasis></para>
+ <para>For the JobID, please see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="dbdoclet.jobstats" /> for more details.</para>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> jobid={<replaceable>jobid_list</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Wildcard is supported in
+ {<replaceable>jobid_list</replaceable>}.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start dd_user jobid={dd.*} rate=50"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={*.600} rate=10"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user2 jobid={io*.10* *.500} rate=200"</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 jobid={iozone.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 jobid={iozone.500} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">Opcode based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] start <replaceable>rule_name</replaceable> opcode={<replaceable>opcode_list</replaceable>} rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200"</screen>
+ <para>Also, the rule can be written in <literal>reg</literal> and
+ <literal>hp</literal> formats:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp start iozone_user1 opcode={ost_read} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg start iozone_user1 opcode={ost_read} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">UID/GID based TBF policy</emphasis></para>
+ <para>Command:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> uid={<replaceable>uid</replaceable>} rate=<replaceable>rate</replaceable>"
+$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"[reg][hp] start <replaceable>rule_name</replaceable> gid={<replaceable>gid</replaceable>} rate=<replaceable>rate</replaceable>"</screen>
+ <para>Exapmle:</para>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ <para>Limit the rate of RPC requests of the gid 500</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name gid={500} rate=100"</screen>
+ <para>Also, you can use the following rule to control all reqs
+ to mds:</para>
+ <para>Start the tbf uid QoS on MDS:</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_policies="tbf uid"</screen>
+ <para>Limit the rate of RPC requests of the uid 500</para>
+ <screen>$ lctl set_param mds.MDS.*.nrs_tbf_rule=\
+"start tbf_name uid={500} rate=100"</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">Policy combination</emphasis></para>
+ <para>To support TBF rules with complex expressions of conditions,
+ TBF classifier is extented to classify RPC in a more fine-grained
+ way. This feature supports logical conditional conjunction and
+ disjunction operations among different types.
+ In the rule:
+ "&" represents the conditional conjunction and
+ "," represents the conditional disjunction.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start comp_rule opcode={ost_write}&jobid={dd.0},\
+nid={192.168.1.[1-128]@tcp 0@lo} rate=100"</screen>
+ <para>In this example, those RPCs whose <literal>opcode</literal> is
+ ost_write and <literal>jobid</literal> is dd.0, or
+ <literal>nid</literal> satisfies the condition of
+ {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100
+ req/sec.
+ The output of <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>is like:
+ </para>
+ <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0
+CPT 1:
+comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0
+default * 10000, ref 0</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.*.nrs_tbf_rule=\
+"start tbf_name uid={500}&gid={500} rate=100"</screen>
+ <para>In this example, those RPC requests whose uid is 500 and
+ gid is 500 will be processed at the rate of 100 req/sec.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ <section remap="h4">
+ <title>Change a TBF rule</title>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=
+"[reg|hp] change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable>"
+ </screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"reg change loginnode rate=200"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"hp change loginnode rate=200"
+</screen>
+ </section>
+ <section remap="h4">
+ <title>Stop a TBF rule</title>
+ <para>Command:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
+<replaceable>rule_name</replaceable>"</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"</screen>
+ </section>
+ <section remap="h4">
+ <title>Rule options</title>
+ <para>To support more flexible rule conditions, the following options
+ are added.</para>
+ <itemizedlist>
+ <listitem>
+ <para><emphasis role="bold">Reordering of TBF rules</emphasis></para>
+ <para>By default, a newly started rule is prior to the old ones,
+ but by specifying the argument '<literal>rank=</literal>' when
+ inserting a new rule with "<literal>start</literal>" command,
+ the rank of the rule can be changed. Also, it can be changed by
+ "<literal>change</literal>" command.
+ </para>
+ <para>Command:</para>
+ <screen>lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... rank=<replaceable>obj_rule_name</replaceable>"
+lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"change <replaceable>rule_name</replaceable> rate=<replaceable>rate</replaceable> rank=<replaceable>obj_rule_name</replaceable>"
+</screen>
+ <para>By specifying the existing rule
+ '<replaceable>obj_rule_name</replaceable>', the new rule
+ '<replaceable>rule_name</replaceable>' will be moved to the front of
+ '<replaceable>obj_rule_name</replaceable>'.</para>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start computes nid={192.168.1.[2-128]@tcp} rate=500"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start user1 jobid={iozone.500 dd.500} rate=100"
+$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\
+"start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes"</screen>
+ <para>In this example, rule "iozone_user1" is added to the front of
+ rule "computes". We can see the order by the following command:
+ </para>
+ <screen>$ lctl get_param ost.OSS.ost_io.nrs_tbf_rule
+ost.OSS.ost_io.nrs_tbf_rule=
+regular_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+high_priority_requests:
+CPT 0:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0
+CPT 1:
+user1 jobid={iozone.500 dd.500} 100, ref 0
+iozone_user1 opcode={ost_read ost_write} 200, ref 0
+computes nid={192.168.1.[2-128]@tcp} 500, ref 0
+default * 10000, ref 0</screen>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">TBF realtime policies under congestion
+ </emphasis></para>
+ <para>During TBF evaluation, we find that when the sum of I/O
+ bandwidth requirements for all classes exceeds the system capacity,
+ the classes with the same rate limits get less bandwidth than if
+ preconfigured evenly. The reason for this is the heavy load on a
+ congested server will result in some missed deadlines for some
+ classes. The number of the calculated tokens may be larger than 1
+ during dequeuing. In the original implementation, all classes are
+ equally handled to simply discard exceeding tokens.</para>
+ <para>Thus, a Hard Token Compensation (HTC) strategy has been
+ implemented. A class can be configured with the HTC feature by the
+ rule it matches. This feature means that requests in this kind of
+ class queues have high real-time requirements and that the bandwidth
+ assignment must be satisfied as good as possible. When deadline
+ misses happen, the class keeps the deadline unchanged and the time
+ residue(the remainder of elapsed time divided by 1/r) is compensated
+ to the next round. This ensures that the next idle I/O thread will
+ always select this class to serve until all accumulated exceeding
+ tokens are handled or there are no pending requests in the class
+ queue.</para>
+ <para>Command:</para>
+ <para>A new command format is added to enable the realtime feature
+ for a rule:</para>
+ <screen>lctl set_param x.x.x.nrs_tbf_rule=\
+"start <replaceable>rule_name</replaceable> <replaceable>arguments</replaceable>... realtime=1</screen>
+ <para>Example:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
+"start realjob jobid={dd.0} rate=100 realtime=1</screen>
+ <para>This example rule means the RPC requests whose JobID is dd.0
+ will be processed at the rate of 100req/sec in realtime.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
+ </section>
+ <section xml:id="dbdoclet.delaytuning" condition='l2A'>
+ <title>
+ <indexterm>
+ <primary>tuning</primary>
+ <secondary>Network Request Scheduler (NRS) Tuning</secondary>
+ <tertiary>Delay policy</tertiary>
+ </indexterm>Delay policy</title>
+ <para>The NRS Delay policy seeks to perturb the timing of request
+ processing at the PtlRPC layer, with the goal of simulating high server
+ load, and finding and exposing timing related problems. When this policy
+ is active, upon arrival of a request the policy will calculate an offset,
+ within a defined, user-configurable range, from the request arrival
+ time, to determine a time after which the request should be handled.
+ The request is then stored using the cfs_binheap implementation,
+ which sorts the request according to the assigned start time.
+ Requests are removed from the binheap for handling once their start
+ time has been passed.</para>
+ <para>The Delay policy can be enabled on all types of PtlRPC services,
+ and has the following tunables that can be used to adjust its behavior:
+ </para>
<itemizedlist>
<listitem>
<para>
- <literal>ost.OSS.ost_io.nrs_tbf_rule</literal>
+ <literal>{service}.nrs_delay_min</literal>
</para>
- <para>The format of the rule start command of TBF policy is as
- follows:</para>
- <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
- "[reg|hp] start
-<replaceable>rule_name</replaceable>
-<replaceable>arguments</replaceable>..."
-</screen>
- <para>The '
- <replaceable>rule_name</replaceable>' argument is a string which
- identifies a rule. The format of the '
- <replaceable>arguments</replaceable>' is changing according to the
- type of the TBF policy. For the NID based TBF policy, its format is
- as follows:</para>
- <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
- "[reg|hp] start
-<replaceable>rule_name</replaceable> {
-<replaceable>nidlist</replaceable>}
-<replaceable>rate</replaceable>"
-</screen>
- <para>The format of '
- <replaceable>nidlist</replaceable>' argument is the same as the
- format when configuring LNet route. The '
- <replaceable>rate</replaceable>' argument is the RPC rate of the
- rule, means the upper limit number of requests per second.</para>
- <para>Following commands are valid. Please note that a newly started
- rule is prior to old rules, so the order of starting rules is
- critical too.</para>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "start other_clients {192.168.*.*@tcp} 50"
-</screen>
+ <para>The
+ <literal>{service}.nrs_delay_min</literal> tunable controls the
+ minimum amount of time, in seconds, that a request will be delayed by
+ this policy. The default is 5 seconds. To read this value run:</para>
<screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "start loginnode {192.168.1.1@tcp} 100"
-</screen>
- <para>General rule can be replaced by two rules (reg and hp) as
- follows:</para>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "reg start loginnode {192.168.1.1@tcp} 100"
-</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "hp start loginnode {192.168.1.1@tcp} 100"
-</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "start computes {192.168.1.[2-128]@tcp} 500"
-</screen>
- <para>The above rules will put an upper limit for servers to process
- at most 5x as many RPCs from compute nodes as login nodes.</para>
- <para>For the JobID (please see
- <xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="dbdoclet.jobstats" />for more details) based TBF policy, its
- format is as follows:</para>
- <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
- "[reg|hp] start
-<replaceable>name</replaceable> {
-<replaceable>jobid_list</replaceable>}
-<replaceable>rate</replaceable>"
-</screen>
- <para>Following commands are valid:</para>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "start user1 {iozone.500 dd.500} 100"
-</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "start iozone_user1 {iozone.500} 100"
-</screen>
- <para>Same as nid, could use reg and hp rules separately:</para>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "hp start iozone_user1 {iozone.500} 100"
-</screen>
+lctl get_param {service}.nrs_delay_min</screen>
+ <para>For example, to read the minimum delay set on the ost_io
+ service, run:</para>
<screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=
- "reg start iozone_user1 {iozone.500} 100"
-</screen>
- <para>The format of the rule change command of TBF policy is as
- follows:</para>
- <screen>
-$ lctl set_param x.x.x.nrs_tbf_rule=
- "[reg|hp] change
-<replaceable>rule_name</replaceable>
-<replaceable>rate</replaceable>"
-</screen>
- <para>Following commands are valid:</para>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200"
-</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200"
-</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200"
-</screen>
- <para>The format of the rule stop command of TBF policy is as
- follows:</para>
+$ lctl get_param ost.OSS.ost_io.nrs_delay_min
+ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5
+hp_delay_min:5</screen>
+ <para>To set the minimum delay in RPC processing, run:</para>
+ <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>0-65535</replaceable></screen>
+ <para>This will set the minimum delay time on a given service, for both
+ regular and high-priority RPCs (if the PtlRPC service supports
+ high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the minimum delay time on the ost_io service
+ to 10, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=10
+ost.OSS.ost_io.nrs_delay_min=10</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different minimum delay time for regular and high-priority RPCs, run:
+ </para>
+ <screen>
+lctl set_param {service}.nrs_delay_min=<replaceable>reg_delay_min|hp_delay_min</replaceable>:<replaceable>0-65535</replaceable>
+ </screen>
+ <para>For example, to set the minimum delay time on the ost_io service
+ for high-priority RPCs to 3, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3
+ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3</screen>
+ <para>Note, in all cases the minimum delay time cannot exceed the
+ maximum delay time.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_max</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_max</literal> tunable controls the
+ maximum amount of time, in seconds, that a request will be delayed by
+ this policy. The default is 300 seconds. To read this value run:
+ </para>
+ <screen>lctl get_param {service}.nrs_delay_max</screen>
+ <para>For example, to read the maximum delay set on the ost_io
+ service, run:</para>
<screen>
-$ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop
-<replaceable>rule_name</replaceable>"
+$ lctl get_param ost.OSS.ost_io.nrs_delay_max
+ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300
+hp_delay_max:300</screen>
+ <para>To set the maximum delay in RPC processing, run:</para>
+ <screen>lctl set_param {service}.nrs_delay_max=<replaceable>0-65535</replaceable>
</screen>
- <para>Following commands are valid:</para>
+ <para>This will set the maximum delay time on a given service, for both
+ regular and high-priority RPCs (if the PtlRPC service supports
+ high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the maximum delay time on the ost_io service
+ to 60, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=60
+ost.OSS.ost_io.nrs_delay_max=60</screen>
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different maximum delay time for regular and high-priority RPCs, run:
+ </para>
+ <screen>lctl set_param {service}.nrs_delay_max=<replaceable>reg_delay_max|hp_delay_max</replaceable>:<replaceable>0-65535</replaceable></screen>
+ <para>For example, to set the maximum delay time on the ost_io service
+ for high-priority RPCs to 30, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30
+ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30</screen>
+ <para>Note, in all cases the maximum delay time cannot be less than the
+ minimum delay time.</para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>{service}.nrs_delay_pct</literal>
+ </para>
+ <para>The
+ <literal>{service}.nrs_delay_pct</literal> tunable controls the
+ percentage of requests that will be delayed by this policy. The
+ default is 100. Note, when a request is not selected for handling by
+ the delay policy due to this variable then the request will be handled
+ by whatever fallback policy is defined for that service. If no other
+ fallback policy is defined then the request will be handled by the
+ FIFO policy. To read this value run:</para>
+ <screen>lctl get_param {service}.nrs_delay_pct</screen>
+ <para>For example, to read the percentage of requests being delayed on
+ the ost_io service, run:</para>
<screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode"
+$ lctl get_param ost.OSS.ost_io.nrs_delay_pct
+ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100
+hp_delay_pct:100</screen>
+ <para>To set the percentage of delayed requests, run:</para>
+ <screen>
+lctl set_param {service}.nrs_delay_pct=<replaceable>0-100</replaceable></screen>
+ <para>This will set the percentage of requests delayed on a given
+ service, for both regular and high-priority RPCs (if the PtlRPC service
+ supports high-priority RPCs), to the indicated value.</para>
+ <para>For example, to set the percentage of delayed requests on the
+ ost_io service to 50, run:</para>
+ <screen>
+$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50
+ost.OSS.ost_io.nrs_delay_pct=50
</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode"
+ <para>For PtlRPC services that support high-priority RPCs, to set a
+ different delay percentage for regular and high-priority RPCs, run:
+ </para>
+ <screen>lctl set_param {service}.nrs_delay_pct=<replaceable>reg_delay_pct|hp_delay_pct</replaceable>:<replaceable>0-100</replaceable>
</screen>
- <screen>
-$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode"
+ <para>For example, to set the percentage of delayed requests on the
+ ost_io service for high-priority RPCs to 5, run:</para>
+ <screen>$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
+ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5
</screen>
</listitem>
</itemizedlist>
<secondary>lockless I/O</secondary>
</indexterm>Lockless I/O Tunables</title>
<para>The lockless I/O tunable feature allows servers to ask clients to do
- lockless I/O (liblustre-style where the server does the locking) on
- contended files.</para>
+ lockless I/O (the server does the locking on behalf of clients) for
+ contended files to avoid lock ping-pong.</para>
<para>The lockless I/O patch introduces these tunables:</para>
<itemizedlist>
<listitem>
<emphasis role="bold">OST-side:</emphasis>
</para>
<screen>
-/proc/fs/lustre/ldlm/namespaces/filter-lustre-*
+ldlm.namespaces.filter-<replaceable>fsname</replaceable>-*.
</screen>
<para>
<literal>contended_locks</literal>- If the number of lock conflicts in
contended state as set in the parameter.</para>
<para>
<literal>max_nolock_bytes</literal>- Server-side locking set only for
- requests less than the blocks set in the
- <literal>max_nolock_bytes</literal> parameter. If this tunable is set to
- zero (0), it disables server-side locking for read/write
+ requests less than the blocks set in the
+ <literal>max_nolock_bytes</literal> parameter. If this tunable is
+ set to zero (0), it disables server-side locking for read/write
requests.</para>
</listitem>
<listitem>
<para>
<emphasis role="bold">Client-side:</emphasis>
</para>
- <screen>
-/proc/fs/lustre/llite/lustre-*
-</screen>
+ <screen>llite.<replaceable>fsname</replaceable>-*</screen>
<para>
<literal>contention_seconds</literal>-
<literal>llite</literal> inode remembers its contended state for the
<emphasis role="bold">Client-side statistics:</emphasis>
</para>
<para>The
- <literal>/proc/fs/lustre/llite/lustre-*/stats</literal> file has new
- rows for lockless I/O statistics.</para>
+ <literal>llite.<replaceable>fsname</replaceable>-*.stats</literal>
+ parameter has several entries for lockless I/O statistics.</para>
<para>
<literal>lockless_read_bytes</literal> and
<literal>lockless_write_bytes</literal>- To count the total bytes read
Server-Side Advice and Hinting
</title>
<section><title>Overview</title>
- <para>Use the <literal>lfs ladvise</literal> command give file access
+ <para>Use the <literal>lfs ladvise</literal> command to give file access
advices or hints to servers.</para>
<screen>lfs ladvise [--advice|-a ADVICE ] [--background|-b]
[--start|-s START[kMGT]]
cache</para>
<para><literal>dontneed</literal> to cleanup data cache on
server</para>
+ <para><literal>lockahead</literal> Request an LDLM extent lock
+ of the given mode on the given byte range </para>
+ <para><literal>noexpand</literal> Disable extent lock expansion
+ behavior for I/O to this file descriptor</para>
</entry>
</row>
<row>
<literal>-e</literal> option.</para>
</entry>
</row>
+ <row>
+ <entry>
+ <para><literal>-m</literal>, <literal>--mode=</literal>
+ <literal>MODE</literal></para>
+ </entry>
+ <entry>
+ <para>Lockahead request mode <literal>{READ,WRITE}</literal>.
+ Request a lock with this mode.</para>
+ </entry>
+ </row>
</tbody>
</tgroup>
</informaltable>
random IO is a net benefit. Fetching that data into each client cache with
fadvise() may not be, due to much more data being sent to the client.
</para>
+ <para>
+ <literal>ladvise lockahead</literal> is different in that it attempts to
+ control LDLM locking behavior by explicitly requesting LDLM locks in
+ advance of use. This does not directly affect caching behavior, instead
+ it is used in special cases to avoid pathological results (lock exchange)
+ from the normal LDLM locking behavior.
+ </para>
+ <para>
+ Note that the <literal>noexpand</literal> advice works on a specific
+ file descriptor, so using it via lfs has no effect. It must be used
+ on a particular file descriptor which is used for i/o to have any effect.
+ </para>
<para>The main difference between the Linux <literal>fadvise()</literal>
system call and <literal>lfs ladvise</literal> is that
<literal>fadvise()</literal> is only a client side mechanism that does
cache of the file in the memory.</para>
<screen>client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1
</screen>
+ <para>The following example requests an LDLM read lock on the first
+ 1 MiB of <literal>/mnt/lustre/file1</literal>. This will attempt to
+ request a lock from the OST holding that region of the file.</para>
+ <screen>client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1
+ </screen>
+ <para>The following example requests an LDLM write lock on
+ [3 MiB, 10 MiB] of <literal>/mnt/lustre/file1</literal>. This will
+ attempt to request a lock from the OST holding that region of the
+ file.</para>
+ <screen>client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1
+ </screen>
</section>
</section>
<section condition="l29">
<para>Beginning with Lustre 2.9, Lustre is extended to support RPCs up
to 16MB in size. By enabling a larger RPC size, fewer RPCs will be
required to transfer the same amount of data between clients and
- servers. With a larger RPC size, the OST can submit more data to the
+ servers. With a larger RPC size, the OSS can submit more data to the
underlying disks at once, therefore it can produce larger disk I/Os
to fully utilize the increasing bandwidth of disks.</para>
- <para>At client connecting time, clients will negotiate with
- servers for the RPC size it is going to use.</para>
- <para>A new parameter, <literal>brw_size</literal>, is introduced on
- the OST to tell the client the preferred IO size. All clients that
+ <para>At client connection time, clients will negotiate with
+ servers what the maximum RPC size it is possible to use, but the
+ client can always send RPCs smaller than this maximum.</para>
+ <para>The parameter <literal>brw_size</literal> is used on the OST
+ to tell the client the maximum (preferred) IO size. All clients that
talk to this target should never send an RPC greater than this size.
+ Clients can individually set a smaller RPC size limit via the
+ <literal>osc.*.max_pages_per_rpc</literal> tunable.
+ </para>
+ <note>
+ <para>The smallest <literal>brw_size</literal> that can be set for
+ ZFS OSTs is the <literal>recordsize</literal> of that dataset. This
+ ensures that the client can always write a full ZFS file block if it
+ has enough dirty data, and does not otherwise force it to do read-
+ modify-write operations for every RPC.
</para>
+ </note>
</section>
<section><title>Usage</title>
<para>In order to enable a larger RPC size,
16MB. To temporarily change <literal>brw_size</literal>, the
following command should be run on the OSS:</para>
<screen>oss# lctl set_param obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
- <para>To persistently change <literal>brw_size</literal>, one of the following
- commands should be run on the OSS:</para>
+ <para>To persistently change <literal>brw_size</literal>, the
+ following command should be run:</para>
<screen>oss# lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.brw_size=16</screen>
- <screen>oss# lctl conf_param <replaceable>fsname</replaceable>-OST*.obdfilter.brw_size=16</screen>
<para>When a client connects to an OST target, it will fetch
<literal>brw_size</literal> from the target and pick the maximum value
of <literal>brw_size</literal> and its local setting for
<screen>client$ lctl set_param osc.<replaceable>fsname</replaceable>-OST*.max_pages_per_rpc=16M</screen>
<para>To persistently make this change, the following command should
be run:</para>
- <screen>client$ lctl conf_param <replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
+ <screen>client$ lctl set_param -P obdfilter.<replaceable>fsname</replaceable>-OST*.osc.max_pages_per_rpc=16M</screen>
<caution><para>The <literal>brw_size</literal> of an OST can be
changed on the fly. However, clients have to be remounted to
- renegotiate the new RPC size.</para></caution>
+ renegotiate the new maximum RPC size.</para></caution>
</section>
</section>
<section xml:id="dbdoclet.50438272_80545">
<indexterm>
<primary>tuning</primary>
<secondary>for small files</secondary>
- </indexterm>Improving Lustre File System Performance When Working with
- Small Files</title>
+ </indexterm>Improving Lustre I/O Performance for Small Files</title>
<para>An environment where an application writes small file chunks from
- many clients to a single file will result in bad I/O performance. To
+ many clients to a single file can result in poor I/O performance. To
improve the performance of the Lustre file system with small files:</para>
<itemizedlist>
<listitem>
<para>Have the application aggregate writes some amount before
submitting them to the Lustre file system. By default, the Lustre
software enforces POSIX coherency semantics, so it results in lock
- ping-pong between client nodes if they are all writing to the same file
- at one time.</para>
+ ping-pong between client nodes if they are all writing to the same
+ file at one time.</para>
+ <para>Using MPI-IO Collective Write functionality in
+ the Lustre ADIO driver is one way to achieve this in a straight
+ forward manner if the application is already using MPI-IO.</para>
</listitem>
<listitem>
- <para>Have the application do 4kB
- <literal>O_DIRECT</literal> sized I/O to the file and disable locking on
- the output file. This avoids partial-page IO submissions and, by
+ <para>Have the application do 4kB
+ <literal>O_DIRECT</literal> sized I/O to the file and disable locking
+ on the output file. This avoids partial-page IO submissions and, by
disabling locking, you avoid contention between clients.</para>
</listitem>
<listitem>
client is more likely to become CPU-bound during reads than writes.</para>
</section>
</chapter>
+<!--
+ vim:expandtab:shiftwidth=2:tabstop=8:
+ -->