X-Git-Url: https://git.whamcloud.com/?a=blobdiff_plain;f=LustreTuning.xml;h=f51e32583fed78e546520f29520dfa3201dc6541;hb=f499e19c65bfcb727e00e3bf3a674fc85c05b8fc;hp=7caf0e6c27122e7ea9ca36e4b459cadfe04cb445;hpb=c8f54b87abfdaf778d1946a4d487085b58a2cc7c;p=doc%2Fmanual.git diff --git a/LustreTuning.xml b/LustreTuning.xml index 7caf0e6..f51e325 100644 --- a/LustreTuning.xml +++ b/LustreTuning.xml @@ -1,7 +1,7 @@ + xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" + xml:id="lustretuning"> Tuning a Lustre File System This chapter contains information about tuning a Lustre file system for better performance. @@ -149,8 +149,8 @@ lctl {get,set}_param {service}.thread_{min,max,started} service immediately and disables automatic thread creation behavior. - Lustre software release 2.3 introduced new - parameters to provide more control to administrators. + Parameters are available to provide administrators control + over the number of service threads. @@ -158,25 +158,18 @@ lctl {get,set}_param {service}.thread_{min,max,started} in providing the read page service. The read page service handles file close and readdir operations. - - - mds_attr_num_threads controls the number of threads - in providing the setattr service to clients running Lustre software - release 1.8. - -
+
<indexterm> <primary>tuning</primary> <secondary>MDS binding</secondary> </indexterm>Binding MDS Service Thread to CPU Partitions - With the introduction of Node Affinity ( - ) in Lustre software release 2.3, MDS threads - can be bound to particular CPU partitions (CPTs) to improve CPU cache - usage and memory locality. Default values for CPT counts and CPU core + With the Node Affinity () feature, + MDS threads can be bound to particular CPU partitions (CPTs) to improve CPU + cache usage and memory locality. Default values for CPT counts and CPU core bindings are selected automatically to provide good overall performance for a given CPU count. However, an administrator can deviate from these setting if they choose. For details on specifying the mapping of CPU cores to @@ -202,14 +195,8 @@ lctl {get,set}_param {service}.thread_{min,max,started} to CPT4. - - - mds_attr_num_cpts=[EXPRESSION] binds the setattr - service threads to CPTs defined by - EXPRESSION. - - Parameters must be set before module load in the file + Parameters must be set before module load in the file /etc/modprobe.d/lustre.conf. For example: lustre.conf options lnet networks=tcp0(eth0) @@ -268,17 +255,16 @@ options ksocklnd enable_irq_affinity=0 By default, this parameter is off. As always, you should test the performance to compare the impact of changing this parameter.
-
+
<indexterm> <primary>tuning</primary> <secondary>Network interface binding</secondary> </indexterm>Binding Network Interface Against CPU Partitions - Lustre software release 2.3 and beyond provide enhanced network - interface control. The enhancement means that an administrator can bind - an interface to one or more CPU partitions. Bindings are specified as - options to the LNet modules. For more information on specifying module - options, see + Lustre allows enhanced network interface control. This means that + an administrator can bind an interface to one or more CPU partitions. + Bindings are specified as options to the LNet modules. For more + information on specifying module options, see For example, o2ib0(ib0)[0,1] will ensure that all messages for @@ -324,9 +310,9 @@ ksocklnd credits=256 ko2iblnd credits=256 - - In Lustre software release 2.3 and beyond, LNet may revalidate - the NI credits, so the administrator's request may not persist. + + LNet may revalidate the NI credits, so the administrator's + request may not persist.
@@ -375,10 +361,9 @@ ko2iblnd credits=256 lnet large_router_buffers=8192 - - In Lustre software release 2.3 and beyond, LNet may revalidate - the router buffer setting, so the administrator's request may not - persist. + + LNet may revalidate the router buffer setting, so the + administrator's request may not persist.
@@ -399,9 +384,8 @@ lnet large_router_buffers=8192 events across all CPTs. This may balance load better across the CPU but can incur a cross CPU overhead. The current policy can be changed by an administrator with - echo - value> - /proc/sys/lnet/portal_rotor. There are four options for + lctl set_param portal_rotor=value. + There are four options for value : @@ -480,7 +464,7 @@ lnet large_router_buffers=8192 interface. The default setting is 1. (For more information about the LNet routes parameter, see + linkend="lnet_module_routes" /> A router is considered down if any of its NIDs are down. For example, router X has three NIDs: Xnid1, @@ -525,16 +509,16 @@ lnet large_router_buffers=8192 be MAX.
-
+
<indexterm> <primary>tuning</primary> <secondary>libcfs</secondary> </indexterm>libcfs Tuning - Lustre software release 2.3 introduced binding service threads via - CPU Partition Tables (CPTs). This allows the system administrator to - fine-tune on which CPU cores the Lustre service threads are run, for both - OSS and MDS services, as well as on the client. + Lustre allows binding service threads via CPU Partition Tables + (CPTs). This allows the system administrator to fine-tune on which CPU + cores the Lustre service threads are run, for both OSS and MDS services, + as well as on the client. CPTs are useful to reserve some cores on the OSS or MDS nodes for system functions such as system monitoring, HA heartbeat, or similar @@ -619,15 +603,437 @@ cpu_partition_table= nscheds parameter. This adjusts the number of threads for each partition, not the overall number of threads on the LND. - Lustre software release 2.3 has greatly decreased the default - number of threads for + The default number of threads for ko2iblnd and - ksocklnd on high-core count machines. The current - default values are automatically set and are chosen to work well across a - number of typical scenarios. + ksocklnd are automatically set and are chosen to + work well across a number of typical scenarios, for systems with both + high and low core counts. +
+ ko2iblnd Tuning + The following table outlines the ko2iblnd module parameters to be used + for tuning: + + + + + + + + + + Module Parameter + + + + + Default Value + + + + + Description + + + + + + + + + service + + + + + 987 + + + + Service number (within RDMA_PS_TCP). + + + + + + cksum + + + + + 0 + + + + Set non-zero to enable message (not RDMA) checksums. + + + + + + timeout + + + + + 50 + + + + Timeout in seconds. + + + + + + nscheds + + + + + 0 + + + + Number of threads in each scheduler pool (per CPT). Value of + zero means we derive the number from the number of cores. + + + + + + conns_per_peer + + + + + 4 (OmniPath), 1 (Everything else) + + + + Introduced in 2.10. Number of connections to each peer. Messages + are sent round-robin over the connection pool. Provides significant + improvement with OmniPath. + + + + + + ntx + + + + + 512 + + + + Number of message descriptors allocated for each pool at + startup. Grows at runtime. Shared by all CPTs. + + + + + + credits + + + + + 256 + + + + Number of concurrent sends on network. + + + + + + peer_credits + + + + + 8 + + + + Number of concurrent sends to 1 peer. Related/limited by IB + queue size. + + + + + + peer_credits_hiw + + + + + 0 + + + + When eagerly to return credits. + + + + + + peer_buffer_credits + + + + + 0 + + + + Number per-peer router buffer credits. + + + + + + peer_timeout + + + + + 180 + + + + Seconds without aliveness news to declare peer dead (less than + or equal to 0 to disable). + + + + + + ipif_name + + + + + ib0 + + + + IPoIB interface name. + + + + + + retry_count + + + + + 5 + + + + Retransmissions when no ACK received. + + + + + + rnr_retry_count + + + + + 6 + + + + RNR retransmissions. + + + + + + keepalive + + + + + 100 + + + + Idle time in seconds before sending a keepalive. + + + + + + ib_mtu + + + + + 0 + + + + IB MTU 256/512/1024/2048/4096. + + + + + + concurrent_sends + + + + + 0 + + + + Send work-queue sizing. If zero, derived from + map_on_demand and peer_credits. + + + + + + + map_on_demand + + + + + 0 (pre-4.8 Linux) 1 (4.8 Linux onward) 32 (OmniPath) + + + + Number of fragments reserved for connection. If zero, use + global memory region (found to be security issue). If non-zero, use + FMR or FastReg for memory registration. Value needs to agree between + both peers of connection. + + + + + + fmr_pool_size + + + + + 512 + + + + Size of fmr pool on each CPT (>= ntx / 4). Grows at runtime. + + + + + + + fmr_flush_trigger + + + + + 384 + + + + Number dirty FMRs that triggers pool flush. + + + + + + fmr_cache + + + + + 1 + + + + Non-zero to enable FMR caching. + + + + + + dev_failover + + + + + 0 + + + + HCA failover for bonding (0 OFF, 1 ON, other values reserved). + + + + + + + require_privileged_port + + + + + 0 + + + + Require privileged port when accepting connection. + + + + + + use_privileged_port + + + + + 1 + + + + Use privileged port when initiating connection. + + + + + + wrq_sge + + + + + 2 + + + + Introduced in 2.10. Number scatter/gather element groups per + work request. Used to deal with fragmentations which can consume + double the number of work requests. + + + + + +
-
+
<indexterm> <primary>tuning</primary> @@ -676,6 +1082,18 @@ regular_requests: queued: 2420 active: 268 + - name: tbf + state: stopped + fallback: no + queued: 0 + active: 0 + + - name: delay + state: stopped + fallback: no + queued: 0 + active: 0 + high_priority_requests: - name: fifo state: started @@ -700,7 +1118,19 @@ high_priority_requests: fallback: no queued: 0 active: 0 - + + - name: tbf + state: stopped + fallback: no + queued: 0 + active: 0 + + - name: delay + state: stopped + fallback: no + queued: 0 + active: 0 + </screen> <para>NRS policy state is shown in either one or two sections, depending on the PTLRPC service being queried. The first section is named @@ -1152,8 +1582,8 @@ ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes <title>The internal structure of TBF policy - + The internal structure of TBF policy @@ -1181,116 +1611,449 @@ ost.OSS.ost_io.nrs_orr_supported=reg_supported:reads_and_writes knows its RPC token rate. A rule can be added to or removed from the list at run time. Whenever the list of rules is changed, the queues will update their matched rules. +
+ Enable TBF policy + Command: + lctl set_param ost.OSS.ost_io.nrs_policies="tbf <policy>" + + For now, the RPCs can be classified into the different types + according to their NID, JOBID, OPCode and UID/GID. When enabling TBF + policy, you can specify one of the types, or just use "tbf" to enable + all of them to do a fine-grained RPC requests classification. + Example: + $ lctl set_param ost.OSS.ost_io.nrs_policies="tbf" +$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid" +$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid" +$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode" +$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf uid" +$ lctl set_param ost.OSS.ost_io.nrs_policies="tbf gid" +
+
+ Start a TBF rule + The TBF rule is defined in the parameter + ost.OSS.ost_io.nrs_tbf_rule. + Command: + lctl set_param x.x.x.nrs_tbf_rule= +"[reg|hp] start rule_name arguments..." + + 'rule_name' is a string of the TBF + policy rule's name and 'arguments' is a + string to specify the detailed rule according to the different types. + + + Next, the different types of TBF policies will be described. + + NID based TBF policy + Command: + lctl set_param x.x.x.nrs_tbf_rule= +"[reg|hp] start rule_name nid={nidlist} rate=rate" + + 'nidlist' uses the same format + as configuring LNET route. 'rate' is + the (upper limit) RPC rate of the rule. + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start other_clients nid={192.168.*.*@tcp} rate=50" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start computes nid={192.168.1.[2-128]@tcp} rate=500" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start loginnode nid={192.168.1.1@tcp} rate=100" + In this example, the rate of processing RPC requests from + compute nodes is at most 5x as fast as those from login nodes. + The output of ost.OSS.ost_io.nrs_tbf_rule is + like: + lctl get_param ost.OSS.ost_io.nrs_tbf_rule +ost.OSS.ost_io.nrs_tbf_rule= +regular_requests: +CPT 0: +loginnode {192.168.1.1@tcp} 100, ref 0 +computes {192.168.1.[2-128]@tcp} 500, ref 0 +other_clients {192.168.*.*@tcp} 50, ref 0 +default {*} 10000, ref 0 +high_priority_requests: +CPT 0: +loginnode {192.168.1.1@tcp} 100, ref 0 +computes {192.168.1.[2-128]@tcp} 500, ref 0 +other_clients {192.168.*.*@tcp} 50, ref 0 +default {*} 10000, ref 0 + Also, the rule can be written in reg and + hp formats: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"reg start loginnode nid={192.168.1.1@tcp} rate=100" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"hp start loginnode nid={192.168.1.1@tcp} rate=100" + + + JobID based TBF policy + For the JobID, please see + for more details. + Command: + lctl set_param x.x.x.nrs_tbf_rule= +"[reg|hp] start rule_name jobid={jobid_list} rate=rate" + + Wildcard is supported in + {jobid_list}. + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start iozone_user jobid={iozone.500} rate=100" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start dd_user jobid={dd.*} rate=50" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start user1 jobid={*.600} rate=10" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start user2 jobid={io*.10* *.500} rate=200" + Also, the rule can be written in reg and + hp formats: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"hp start iozone_user1 jobid={iozone.500} rate=100" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"reg start iozone_user1 jobid={iozone.500} rate=100" + + + Opcode based TBF policy + Command: + $ lctl set_param x.x.x.nrs_tbf_rule= +"[reg|hp] start rule_name opcode={opcode_list} rate=rate" + + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start user1 opcode={ost_read} rate=100" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start iozone_user1 opcode={ost_read ost_write} rate=200" + Also, the rule can be written in reg and + hp formats: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"hp start iozone_user1 opcode={ost_read} rate=100" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"reg start iozone_user1 opcode={ost_read} rate=100" + + + UID/GID based TBF policy + Command: + $ lctl set_param ost.OSS.*.nrs_tbf_rule=\ +"[reg][hp] start rule_name uid={uid} rate=rate" +$ lctl set_param ost.OSS.*.nrs_tbf_rule=\ +"[reg][hp] start rule_name gid={gid} rate=rate" + Exapmle: + Limit the rate of RPC requests of the uid 500 + $ lctl set_param ost.OSS.*.nrs_tbf_rule=\ +"start tbf_name uid={500} rate=100" + Limit the rate of RPC requests of the gid 500 + $ lctl set_param ost.OSS.*.nrs_tbf_rule=\ +"start tbf_name gid={500} rate=100" + Also, you can use the following rule to control all reqs + to mds: + Start the tbf uid QoS on MDS: + $ lctl set_param mds.MDS.*.nrs_policies="tbf uid" + Limit the rate of RPC requests of the uid 500 + $ lctl set_param mds.MDS.*.nrs_tbf_rule=\ +"start tbf_name uid={500} rate=100" + + + Policy combination + To support TBF rules with complex expressions of conditions, + TBF classifier is extented to classify RPC in a more fine-grained + way. This feature supports logical conditional conjunction and + disjunction operations among different types. + In the rule: + "&" represents the conditional conjunction and + "," represents the conditional disjunction. + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start comp_rule opcode={ost_write}&jobid={dd.0},\ +nid={192.168.1.[1-128]@tcp 0@lo} rate=100" + In this example, those RPCs whose opcode is + ost_write and jobid is dd.0, or + nid satisfies the condition of + {192.168.1.[1-128]@tcp 0@lo} will be processed at the rate of 100 + req/sec. + The output of ost.OSS.ost_io.nrs_tbf_ruleis like: + + $ lctl get_param ost.OSS.ost_io.nrs_tbf_rule +ost.OSS.ost_io.nrs_tbf_rule= +regular_requests: +CPT 0: +comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0 +default * 10000, ref 0 +CPT 1: +comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0 +default * 10000, ref 0 +high_priority_requests: +CPT 0: +comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0 +default * 10000, ref 0 +CPT 1: +comp_rule opcode={ost_write}&jobid={dd.0},nid={192.168.1.[1-128]@tcp 0@lo} 100, ref 0 +default * 10000, ref 0 + Example: + $ lctl set_param ost.OSS.*.nrs_tbf_rule=\ +"start tbf_name uid={500}&gid={500} rate=100" + In this example, those RPC requests whose uid is 500 and + gid is 500 will be processed at the rate of 100 req/sec. + + +
+
+ Change a TBF rule + Command: + lctl set_param x.x.x.nrs_tbf_rule= +"[reg|hp] change rule_name rate=rate" + + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"change loginnode rate=200" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"reg change loginnode rate=200" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"hp change loginnode rate=200" + +
+
+ Stop a TBF rule + Command: + lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop +rule_name" + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode" +
+
+ Rule options + To support more flexible rule conditions, the following options + are added. + + + Reordering of TBF rules + By default, a newly started rule is prior to the old ones, + but by specifying the argument 'rank=' when + inserting a new rule with "start" command, + the rank of the rule can be changed. Also, it can be changed by + "change" command. + + Command: + lctl set_param ost.OSS.ost_io.nrs_tbf_rule= +"start rule_name arguments... rank=obj_rule_name" +lctl set_param ost.OSS.ost_io.nrs_tbf_rule= +"change rule_name rate=rate rank=obj_rule_name" + + By specifying the existing rule + 'obj_rule_name', the new rule + 'rule_name' will be moved to the front of + 'obj_rule_name'. + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start computes nid={192.168.1.[2-128]@tcp} rate=500" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start user1 jobid={iozone.500 dd.500} rate=100" +$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule=\ +"start iozone_user1 opcode={ost_read ost_write} rate=200 rank=computes" + In this example, rule "iozone_user1" is added to the front of + rule "computes". We can see the order by the following command: + + $ lctl get_param ost.OSS.ost_io.nrs_tbf_rule +ost.OSS.ost_io.nrs_tbf_rule= +regular_requests: +CPT 0: +user1 jobid={iozone.500 dd.500} 100, ref 0 +iozone_user1 opcode={ost_read ost_write} 200, ref 0 +computes nid={192.168.1.[2-128]@tcp} 500, ref 0 +default * 10000, ref 0 +CPT 1: +user1 jobid={iozone.500 dd.500} 100, ref 0 +iozone_user1 opcode={ost_read ost_write} 200, ref 0 +computes nid={192.168.1.[2-128]@tcp} 500, ref 0 +default * 10000, ref 0 +high_priority_requests: +CPT 0: +user1 jobid={iozone.500 dd.500} 100, ref 0 +iozone_user1 opcode={ost_read ost_write} 200, ref 0 +computes nid={192.168.1.[2-128]@tcp} 500, ref 0 +default * 10000, ref 0 +CPT 1: +user1 jobid={iozone.500 dd.500} 100, ref 0 +iozone_user1 opcode={ost_read ost_write} 200, ref 0 +computes nid={192.168.1.[2-128]@tcp} 500, ref 0 +default * 10000, ref 0 + + + TBF realtime policies under congestion + + During TBF evaluation, we find that when the sum of I/O + bandwidth requirements for all classes exceeds the system capacity, + the classes with the same rate limits get less bandwidth than if + preconfigured evenly. The reason for this is the heavy load on a + congested server will result in some missed deadlines for some + classes. The number of the calculated tokens may be larger than 1 + during dequeuing. In the original implementation, all classes are + equally handled to simply discard exceeding tokens. + Thus, a Hard Token Compensation (HTC) strategy has been + implemented. A class can be configured with the HTC feature by the + rule it matches. This feature means that requests in this kind of + class queues have high real-time requirements and that the bandwidth + assignment must be satisfied as good as possible. When deadline + misses happen, the class keeps the deadline unchanged and the time + residue(the remainder of elapsed time divided by 1/r) is compensated + to the next round. This ensures that the next idle I/O thread will + always select this class to serve until all accumulated exceeding + tokens are handled or there are no pending requests in the class + queue. + Command: + A new command format is added to enable the realtime feature + for a rule: + lctl set_param x.x.x.nrs_tbf_rule=\ +"start rule_name arguments... realtime=1 + Example: + $ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= +"start realjob jobid={dd.0} rate=100 realtime=1 + This example rule means the RPC requests whose JobID is dd.0 + will be processed at the rate of 100req/sec in realtime. + + +
+
+
+ + <indexterm> + <primary>tuning</primary> + <secondary>Network Request Scheduler (NRS) Tuning</secondary> + <tertiary>Delay policy</tertiary> + </indexterm>Delay policy + The NRS Delay policy seeks to perturb the timing of request + processing at the PtlRPC layer, with the goal of simulating high server + load, and finding and exposing timing related problems. When this policy + is active, upon arrival of a request the policy will calculate an offset, + within a defined, user-configurable range, from the request arrival + time, to determine a time after which the request should be handled. + The request is then stored using the cfs_binheap implementation, + which sorts the request according to the assigned start time. + Requests are removed from the binheap for handling once their start + time has been passed. + The Delay policy can be enabled on all types of PtlRPC services, + and has the following tunables that can be used to adjust its behavior: + - ost.OSS.ost_io.nrs_tbf_rule + {service}.nrs_delay_min - The format of the rule start command of TBF policy is as - follows: - -$ lctl set_param x.x.x.nrs_tbf_rule= - "[reg|hp] start rule_name arguments..." - - The ' - rule_name' argument is a string which - identifies a rule. The format of the ' - arguments' is changing according to the - type of the TBF policy. For the NID based TBF policy, its format is - as follows: - -$ lctl set_param x.x.x.nrs_tbf_rule= - "[reg|hp] start rule_name {nidlist} rate" - - The format of ' - nidlist' argument is the same as the - format when configuring LNet route. The ' - rate' argument is the RPC rate of the - rule, means the upper limit number of requests per second. - Following commands are valid. Please note that a newly started - rule is prior to old rules, so the order of starting rules is - critical too. - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "start other_clients {192.168.*.*@tcp} 50" - + The + {service}.nrs_delay_min tunable controls the + minimum amount of time, in seconds, that a request will be delayed by + this policy. The default is 5 seconds. To read this value run: -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "start loginnode {192.168.1.1@tcp} 100" - - General rule can be replaced by two rules (reg and hp) as - follows: - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "reg start loginnode {192.168.1.1@tcp} 100" - - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "hp start loginnode {192.168.1.1@tcp} 100" - - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "start computes {192.168.1.[2-128]@tcp} 500" - - The above rules will put an upper limit for servers to process - at most 5x as many RPCs from compute nodes as login nodes. - For the JobID (please see - for more details) based TBF - policy, its format is as follows: - -$ lctl set_param x.x.x.nrs_tbf_rule= - "[reg|hp] start name {jobid_list} rate" - - Following commands are valid: - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "start user1 {iozone.500 dd.500} 100" - - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "start iozone_user1 {iozone.500} 100" - - Same as nid, could use reg and hp rules separately: - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "hp start iozone_user1 {iozone.500} 100" - +lctl get_param {service}.nrs_delay_min + For example, to read the minimum delay set on the ost_io + service, run: -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule= - "reg start iozone_user1 {iozone.500} 100" - - The format of the rule change command of TBF policy is as - follows: - -$ lctl set_param x.x.x.nrs_tbf_rule= - "[reg|hp] change rule_name rate" - - Following commands are valid: - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change loginnode 200" - - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg change loginnode 200" - - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp change loginnode 200" - - The format of the rule stop command of TBF policy is as - follows: +$ lctl get_param ost.OSS.ost_io.nrs_delay_min +ost.OSS.ost_io.nrs_delay_min=reg_delay_min:5 +hp_delay_min:5 + To set the minimum delay in RPC processing, run: + +lctl set_param {service}.nrs_delay_min=0-65535 + This will set the minimum delay time on a given service, for both + regular and high-priority RPCs (if the PtlRPC service supports + high-priority RPCs), to the indicated value. + For example, to set the minimum delay time on the ost_io service + to 10, run: + +$ lctl set_param ost.OSS.ost_io.nrs_delay_min=10 +ost.OSS.ost_io.nrs_delay_min=10 + For PtlRPC services that support high-priority RPCs, to set a + different minimum delay time for regular and high-priority RPCs, run: + + +lctl set_param {service}.nrs_delay_min=reg_delay_min|hp_delay_min:0-65535 + + For example, to set the minimum delay time on the ost_io service + for high-priority RPCs to 3, run: + +$ lctl set_param ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3 +ost.OSS.ost_io.nrs_delay_min=hp_delay_min:3 + Note, in all cases the minimum delay time cannot exceed the + maximum delay time. + + + + {service}.nrs_delay_max + + The + {service}.nrs_delay_max tunable controls the + maximum amount of time, in seconds, that a request will be delayed by + this policy. The default is 300 seconds. To read this value run: + + lctl get_param {service}.nrs_delay_max + For example, to read the maximum delay set on the ost_io + service, run: -$ lctl set_param x.x.x.nrs_tbf_rule="[reg|hp] stop -rule_name" +$ lctl get_param ost.OSS.ost_io.nrs_delay_max +ost.OSS.ost_io.nrs_delay_max=reg_delay_max:300 +hp_delay_max:300 + To set the maximum delay in RPC processing, run: + lctl set_param {service}.nrs_delay_max=0-65535 - Following commands are valid: + This will set the maximum delay time on a given service, for both + regular and high-priority RPCs (if the PtlRPC service supports + high-priority RPCs), to the indicated value. + For example, to set the maximum delay time on the ost_io service + to 60, run: + +$ lctl set_param ost.OSS.ost_io.nrs_delay_max=60 +ost.OSS.ost_io.nrs_delay_max=60 + For PtlRPC services that support high-priority RPCs, to set a + different maximum delay time for regular and high-priority RPCs, run: + + lctl set_param {service}.nrs_delay_max=reg_delay_max|hp_delay_max:0-65535 + For example, to set the maximum delay time on the ost_io service + for high-priority RPCs to 30, run: + +$ lctl set_param ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30 +ost.OSS.ost_io.nrs_delay_max=hp_delay_max:30 + Note, in all cases the maximum delay time cannot be less than the + minimum delay time. + + + + {service}.nrs_delay_pct + + The + {service}.nrs_delay_pct tunable controls the + percentage of requests that will be delayed by this policy. The + default is 100. Note, when a request is not selected for handling by + the delay policy due to this variable then the request will be handled + by whatever fallback policy is defined for that service. If no other + fallback policy is defined then the request will be handled by the + FIFO policy. To read this value run: + lctl get_param {service}.nrs_delay_pct + For example, to read the percentage of requests being delayed on + the ost_io service, run: -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop loginnode" +$ lctl get_param ost.OSS.ost_io.nrs_delay_pct +ost.OSS.ost_io.nrs_delay_pct=reg_delay_pct:100 +hp_delay_pct:100 + To set the percentage of delayed requests, run: + +lctl set_param {service}.nrs_delay_pct=0-100 + This will set the percentage of requests delayed on a given + service, for both regular and high-priority RPCs (if the PtlRPC service + supports high-priority RPCs), to the indicated value. + For example, to set the percentage of delayed requests on the + ost_io service to 50, run: + +$ lctl set_param ost.OSS.ost_io.nrs_delay_pct=50 +ost.OSS.ost_io.nrs_delay_pct=50 - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="reg stop loginnode" + For PtlRPC services that support high-priority RPCs, to set a + different delay percentage for regular and high-priority RPCs, run: + + lctl set_param {service}.nrs_delay_pct=reg_delay_pct|hp_delay_pct:0-100 - -$ lctl set_param ost.OSS.ost_io.nrs_tbf_rule="hp stop loginnode" + For example, to set the percentage of delayed requests on the + ost_io service for high-priority RPCs to 5, run: + $ lctl set_param ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5 +ost.OSS.ost_io.nrs_delay_pct=hp_delay_pct:5 @@ -1332,9 +2095,7 @@ ldlm.namespaces.filter-fsname-*. Client-side: - -/proc/fs/lustre/llite/lustre-* - + llite.fsname-* contention_seconds- llite inode remembers its contended state for the @@ -1345,8 +2106,8 @@ ldlm.namespaces.filter-fsname-*. Client-side statistics: The - /proc/fs/lustre/llite/lustre-*/stats file has new - rows for lockless I/O statistics. + llite.fsname-*.stats + parameter has several entries for lockless I/O statistics. lockless_read_bytes and lockless_write_bytes- To count the total bytes read @@ -1367,7 +2128,7 @@ ldlm.namespaces.filter-fsname-*. Server-Side Advice and Hinting
Overview - Use the lfs ladvise command give file access + Use the lfs ladvise command to give file access advices or hints to servers. lfs ladvise [--advice|-a ADVICE ] [--background|-b] [--start|-s START[kMGT]] @@ -1402,6 +2163,10 @@ ldlm.namespaces.filter-fsname-*. cache dontneed to cleanup data cache on server + lockahead Request an LDLM extent lock + of the given mode on the given byte range + noexpand Disable extent lock expansion + behavior for I/O to this file descriptor @@ -1447,6 +2212,16 @@ ldlm.namespaces.filter-fsname-*. -e option. + + + -m, --mode= + MODE + + + Lockahead request mode {READ,WRITE}. + Request a lock with this mode. + + @@ -1463,6 +2238,18 @@ ldlm.namespaces.filter-fsname-*. random IO is a net benefit. Fetching that data into each client cache with fadvise() may not be, due to much more data being sent to the client. + + ladvise lockahead is different in that it attempts to + control LDLM locking behavior by explicitly requesting LDLM locks in + advance of use. This does not directly affect caching behavior, instead + it is used in special cases to avoid pathological results (lock exchange) + from the normal LDLM locking behavior. + + + Note that the noexpand advice works on a specific + file descriptor, so using it via lfs has no effect. It must be used + on a particular file descriptor which is used for i/o to have any effect. + The main difference between the Linux fadvise() system call and lfs ladvise is that fadvise() is only a client side mechanism that does @@ -1481,6 +2268,17 @@ ldlm.namespaces.filter-fsname-*. cache of the file in the memory. client1$ lfs ladvise -a dontneed -s 0 -e 1048576000 /mnt/lustre/file1 + The following example requests an LDLM read lock on the first + 1 MiB of /mnt/lustre/file1. This will attempt to + request a lock from the OST holding that region of the file. + client1$ lfs ladvise -a lockahead -m READ -s 0 -e 1M /mnt/lustre/file1 + + The following example requests an LDLM write lock on + [3 MiB, 10 MiB] of /mnt/lustre/file1. This will + attempt to request a lock from the OST holding that region of the + file. + client1$ lfs ladvise -a lockahead -m WRITE -s 3M -e 10M /mnt/lustre/file1 +
@@ -1495,15 +2293,26 @@ ldlm.namespaces.filter-fsname-*. Beginning with Lustre 2.9, Lustre is extended to support RPCs up to 16MB in size. By enabling a larger RPC size, fewer RPCs will be required to transfer the same amount of data between clients and - servers. With a larger RPC size, the OST can submit more data to the + servers. With a larger RPC size, the OSS can submit more data to the underlying disks at once, therefore it can produce larger disk I/Os to fully utilize the increasing bandwidth of disks. - At client connecting time, clients will negotiate with - servers for the RPC size it is going to use. - A new parameter, brw_size, is introduced on - the OST to tell the client the preferred IO size. All clients that + At client connection time, clients will negotiate with + servers what the maximum RPC size it is possible to use, but the + client can always send RPCs smaller than this maximum. + The parameter brw_size is used on the OST + to tell the client the maximum (preferred) IO size. All clients that talk to this target should never send an RPC greater than this size. + Clients can individually set a smaller RPC size limit via the + osc.*.max_pages_per_rpc tunable. + + + The smallest brw_size that can be set for + ZFS OSTs is the recordsize of that dataset. This + ensures that the client can always write a full ZFS file block if it + has enough dirty data, and does not otherwise force it to do read- + modify-write operations for every RPC. +
Usage In order to enable a larger RPC size, @@ -1511,10 +2320,9 @@ ldlm.namespaces.filter-fsname-*. 16MB. To temporarily change brw_size, the following command should be run on the OSS: oss# lctl set_param obdfilter.fsname-OST*.brw_size=16 - To persistently change brw_size, one of the following - commands should be run on the OSS: + To persistently change brw_size, the + following command should be run: oss# lctl set_param -P obdfilter.fsname-OST*.brw_size=16 - oss# lctl conf_param fsname-OST*.obdfilter.brw_size=16 When a client connects to an OST target, it will fetch brw_size from the target and pick the maximum value of brw_size and its local setting for @@ -1527,10 +2335,10 @@ ldlm.namespaces.filter-fsname-*. client$ lctl set_param osc.fsname-OST*.max_pages_per_rpc=16M To persistently make this change, the following command should be run: - client$ lctl conf_param fsname-OST*.osc.max_pages_per_rpc=16M + client$ lctl set_param -P obdfilter.fsname-OST*.osc.max_pages_per_rpc=16M The brw_size of an OST can be changed on the fly. However, clients have to be remounted to - renegotiate the new RPC size. + renegotiate the new maximum RPC size.
@@ -1600,3 +2408,6 @@ ldlm.namespaces.filter-fsname-*. client is more likely to become CPU-bound during reads than writes.
+