1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="benchmarkingtests">
2 <title xml:id="benchmarkingtests.title">Benchmarking Lustre File System Performance (Lustre I/O
4 <para>This chapter describes the Lustre I/O kit, a collection of I/O
5 benchmarking tools for a Lustre cluster. It includes:</para>
8 <para><xref linkend="benchmark.iokit"/></para>
11 <para><xref linkend="benchmark.sgpdd-survey"/></para>
14 <para><xref linkend="benchmark.ost_perf"/></para>
17 <para><xref linkend="benchmark.ost_io"/></para>
20 <para><xref linkend="benchmark.mds_survey_ref"/></para>
23 <para><xref linkend="benchmark.stats-collect"/></para>
26 <section xml:id="benchmark.iokit">
29 <primary>benchmarking</primary>
30 <secondary>with Lustre I/O Kit</secondary>
32 <indexterm><primary>profiling</primary><see>benchmarking</see></indexterm>
33 <indexterm><primary>tuning</primary><see>benchmarking</see></indexterm>
35 <primary>performance</primary><see>benchmarking</see>
37 Using Lustre I/O Kit Tools</title>
38 <para>The tools in the Lustre I/O Kit are used to benchmark Lustre file
39 system hardware and validate that it is working as expected before you
40 install the Lustre software. It can also be used to to validate the
41 performance of the various hardware and software layers in the cluster and
42 also to find and troubleshoot I/O issues.</para>
43 <para>Typically, performance is measured starting with single raw devices
44 and then proceeding to groups of devices. Once raw performance has been
45 established, other software layers are then added incrementally and tested.
48 <title>Contents of the Lustre I/O Kit</title>
49 <para>The I/O kit contains three tests, each of which tests a
50 progressively higher layer in the Lustre software stack:</para>
53 <para><literal>sgpdd-survey</literal> - Measure basic
54 'bare metal' performance of devices while bypassing the
55 kernel block device layers, buffer cache, and file system.</para>
58 <para><literal>obdfilter-survey</literal> - Measure the performance of
59 one or more OSTs directly on the OSS node or alternately over the
60 network from a Lustre client.</para>
63 <para><literal>ost-survey</literal> - Performs I/O against OSTs
64 individually to allow performance comparisons to detect if an OST is
65 performing sub-optimally due to hardware issues.</para>
68 <para>Typically with these tests, a Lustre file system should deliver
69 85-90% of the raw device performance.</para>
70 <para>A utility <literal>stats-collect</literal> is also provided to
71 collect application profiling information from Lustre clients and servers.
72 See <xref linkend="benchmark.stats-collect"/> for more information.</para>
75 <title>Preparing to Use the Lustre I/O Kit</title>
76 <para>The following prerequisites must be met to use the tests in the
77 Lustre I/O kit:</para>
80 <para>Password-free remote access to nodes in the system (provided by
81 <literal>ssh</literal> or <literal>rsh</literal>).</para>
84 <para>LNet self-test completed to test that Lustre networking has been
85 properly installed and configured. See <xref linkend="lnetselftest"/>.
89 <para>Lustre file system software installed.</para>
92 <para><literal>sg3_utils</literal> package providing the
93 <literal>sgp_dd</literal> tool (<literal>sg3_utils</literal> is a
94 separate RPM package available online using YUM).</para>
97 <para>Download the Lustre I/O kit (<literal>lustre-iokit</literal>)from:
100 <link xl:href="http://downloads.whamcloud.com/">http://downloads.whamcloud.com/</link>
104 <section xml:id="benchmark.sgpdd-survey">
106 <primary>benchmarking</primary>
107 <secondary>raw hardware with sgpdd-survey</secondary></indexterm>
108 Testing I/O Performance of Raw Hardware (<literal>sgpdd-survey</literal>)
110 <para>The <literal>sgpdd-survey</literal> tool is used to test bare metal
111 I/O performance of the raw hardware, while bypassing as much of the kernel
112 as possible. This survey may be used to characterize the performance of a
113 SCSI device by simulating an OST serving multiple stripe files. The data
114 gathered by this survey can help set expectations for the performance of a
115 Lustre OST using this device.</para>
116 <para>The script uses <literal>sgp_dd</literal> to carry out raw sequential
117 disk I/O. It runs with variable numbers of <literal>sgp_dd</literal> threads
118 to show how performance varies with different request queue depths.</para>
119 <para>The script spawns variable numbers of <literal>sgp_dd</literal>
120 instances, each reading or writing a separate area of the disk to
121 demonstrate performance variance within a number of concurrent stripe files.
123 <para>Several tips and insights for disk performance measurement are
124 described below. Some of this information is specific to RAID arrays and/or
125 the Linux RAID implementation.</para>
128 <para><emphasis>Performance is limited by the slowest disk.</emphasis>
130 <para>Before creating a RAID array, benchmark all disks individually.
131 We have frequently encountered situations where drive performance was
132 not consistent for all devices in the array. Replace any disks that are
133 significantly slower than the rest.</para>
137 <emphasis>Disks and arrays are very sensitive to request size.
140 <para>To identify the optimal request size for a given disk, benchmark
141 the disk with different record sizes ranging from 4 KB to 1 to 2 MB.
146 <para>The <literal>sgpdd-survey</literal> script overwrites the device
147 being tested, which results in the <emphasis><emphasis role="bold">
148 LOSS OF ALL DATA</emphasis></emphasis> on that device. Exercise caution
149 when selecting the device to be tested.</para>
152 <para>Array performance with all LUNs loaded does not always match the
153 performance of a single LUN when tested in isolation.</para>
155 <para><emphasis role="bold">Prerequisites:</emphasis></para>
158 <para><literal>sgp_dd</literal> tool in the <literal>sg3_utils</literal>
162 <para>Lustre software is <emphasis>NOT</emphasis> required</para>
165 <para>The device(s) being tested must meet one of these two requirements:
169 <para>If the device is a SCSI device, it must appear in the output of
170 <literal>sg_map</literal> (make sure the kernel module
171 <literal>sg</literal> is loaded).</para>
174 <para>If the device is a raw device, it must appear in the output of
175 <literal>raw -qa</literal>.</para>
178 <para>Raw and SCSI devices cannot be mixed in the test specification.</para>
180 <para>If you need to create raw devices to use the
181 <literal>sgpdd-survey</literal> tool, note that raw device 0 cannot be
182 used due to a bug in certain versions of the "raw" utility
183 (including the version shipped with Red Hat Enterprise Linux 4U4.)</para>
186 <title><indexterm><primary>benchmarking</primary>
187 <secondary>tuning storage</secondary></indexterm>
188 Tuning Linux Storage Devices</title>
189 <para>To get large I/O transfers (1 MB) to disk, it may be necessary to
190 tune several kernel parameters as specified:</para>
191 <screen>/sys/block/sdN/queue/max_sectors_kb = 4096
192 /sys/block/sdN/queue/max_phys_segments = 256
193 /proc/scsi/sg/allow_dio = 1
194 /sys/module/ib_srp/parameters/srp_sg_tablesize = 255
195 /sys/block/sdN/queue/scheduler</screen>
197 <para>Recommended schedulers are <emphasis role="bold">deadline
198 </emphasis> and <emphasis role="bold">noop</emphasis>. The scheduler is
199 set by default to <emphasis role="bold">deadline</emphasis>, unless it
200 has already been set to <emphasis role="bold">noop</emphasis>.</para>
204 <title>Running sgpdd-survey</title>
205 <para>The <literal>sgpdd-survey</literal> script must be customized for
206 the particular device being tested and for the location where the script
207 saves its working and result files (by specifying the
208 <literal>${rslt}</literal> variable). Customization variables are
209 described at the beginning of the script.</para>
210 <para>When the <literal>sgpdd-survey</literal> script runs, it creates a
211 number of working files and a pair of result files. The names of all the
212 files created start with the prefix defined in the variable
213 <literal>${rslt}</literal>. (The default value is <literal>/tmp</literal>.
214 ) The files include:</para>
217 <para>File containing standard output data (same as
218 <literal>stdout</literal>)</para>
219 <screen><replaceable>rslt_date_time</replaceable>.summary</screen>
222 <para>Temporary (tmp) files</para>
223 <screen><replaceable>rslt_date_time</replaceable>_*</screen>
226 <para>Collected tmp files for post-mortem</para>
227 <screen><replaceable>rslt_date_time</replaceable>.detail</screen>
230 <para>The <literal>stdout</literal> and the <literal>.summary</literal>
231 file will contain lines like this:</para>
232 <screen>total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \
233 = 180.50 MB/s</screen>
234 <para>Each line corresponds to a run of the test. Each test run will have
235 a different number of threads, record size, or number of regions.</para>
238 <para><literal>total_size</literal> - Size of file being tested in
239 KBs (8 GB in above example).</para>
242 <para><literal>rsz</literal> - Record size in KBs (1 MB in above
246 <para><literal>thr</literal> - Number of threads generating I/O (1
247 thread in above example).</para>
250 <para><literal>crg</literal> - Current regions, the number of
251 disjoint areas on the disk to which I/O is being sent (1 region in
252 above example, indicating that no seeking is done).</para>
255 <para><literal>MB/s</literal> - Aggregate bandwidth measured by
256 dividing the total amount of data by the elapsed time (180.45 MB/s in
257 the above example).</para>
260 <para><literal>MB/s</literal> - The remaining numbers show the number
261 of regions X performance of the slowest disk as a sanity check on the
262 aggregate bandwidth.</para>
265 <para>If there are so many threads that the <literal>sgp_dd</literal>
266 script is unlikely to be able to allocate I/O buffers, then
267 <literal>ENOMEM</literal> is printed in place of the aggregate bandwidth
269 <para>If one or more <literal>sgp_dd</literal> instances do not
270 successfully report a bandwidth number, then <literal>FAILED</literal>
271 is printed in place of the aggregate bandwidth result.</para>
274 <section xml:id="benchmark.ost_perf">
276 <primary>benchmarking</primary>
277 <secondary>OST performance</secondary>
278 </indexterm>Testing OST Performance (<literal>obdfilter-survey</literal>)
280 <para>The <literal>obdfilter-survey</literal> script generates sequential
281 I/O from varying numbers of threads and objects (files) to simulate the I/O
282 patterns of a Lustre client.</para>
283 <para>The <literal>obdfilter-survey</literal> script can be run directly on
284 the OSS node to measure the OST storage performance without any intervening
285 network, or it can be run remotely on a Lustre client to measure the OST
286 performance including network overhead.</para>
287 <para>The <literal>obdfilter-survey</literal> is used to characterize the
288 performance of the following:</para>
291 <para><emphasis role="bold">Local file system</emphasis> - In this mode,
292 the <literal>obdfilter-survey</literal> script exercises one or more
293 instances of the obdfilter directly. The script may run on one or more
294 OSS nodes, for example, when the OSSs are all attached to the same
295 multi-ported disk subsystem.</para>
296 <para>Run the script using the <literal>case=disk</literal> parameter to
297 run the test against all the local OSTs. The script automatically
298 detects all local OSTs and includes them in the survey.</para>
299 <para>To run the test against only specific OSTs, run the script using
300 the <literal>targets=parameter</literal> to list the OSTs to be tested
301 explicitly. If some OSTs are on remote nodes, specify their hostnames in
302 addition to the OST name (for example,
303 <literal>oss2:lustre-OST0004</literal>).</para>
304 <para>All <literal>obdfilter</literal> instances are driven directly.
305 The script automatically loads the <literal>obdecho</literal> module (if
306 required) and creates one instance of <literal>echo_client</literal> for
307 each <literal>obdfilter</literal> instance in order to generate I/O
308 requests directly to the OST.</para>
309 <para>For more details, see <xref linkend="dbdoclet.50438212_59319"/>.
313 <para><emphasis role="bold">Network</emphasis> - In this mode, the
314 Lustre client generates I/O requests over the network but these requests
315 are not sent to the OST file system. The OSS node runs the obdecho
316 server to receive the requests but discards them before they are sent to
318 <para>Pass the parameters <literal>case=network</literal> and
319 <literal>targets=<replaceable>hostname|IP_of_server</replaceable>
320 </literal> to the script. For each network case, the script does the
321 required setup.</para>
322 <para>For more details, see <xref linkend="benchmark.network"/>
326 <para><emphasis role="bold">Remote file system over the network
327 </emphasis> - In this mode the <literal>obdfilter-survey</literal>
328 script generates I/O from a Lustre client to a remote OSS to write the
329 data to the file system.</para>
330 <para>To run the test against all the local OSCs, pass the parameter
331 <literal>case=netdisk</literal> to the script. Alternately you can pass
332 the target= parameter with one or more OSC devices (e.g.,
333 <literal>lustre-OST0000-osc-ffff88007754bc00</literal>) against which
334 the tests are to be run.</para>
335 <para>For more details, see <xref linkend="benchmark.remote_disk"/>.
340 <para>The <literal>obdfilter-survey</literal> script is potentially
341 destructive and there is a small risk data may be lost. To reduce this
342 risk, <literal>obdfilter-survey</literal> should not be run on devices
343 that contain data that needs to be preserved. Thus, the best time to run
344 <literal>obdfilter-survey</literal> is before the Lustre file system is
345 put into production. The reason <literal>obdfilter-survey</literal> may be
346 safe to run on a production file system is because it creates objects with
347 object sequence 2. Normal file system objects are typically created with
348 object sequence 0.</para>
351 <para>If the <literal>obdfilter-survey</literal> test is terminated before
352 it completes, some small amount of space is leaked. you can either ignore
353 it or reformat the file system.</para>
356 <para>The <literal>obdfilter-survey</literal> script is
357 <emphasis>NOT</emphasis> scalable beyond tens of OSTs since it is only
358 intended to measure the I/O performance of individual storage subsystems,
359 not the scalability of the entire system.</para>
362 <para>The <literal>obdfilter-survey</literal> script must be customized,
363 depending on the components under test and where the script's working
364 files should be kept. Customization variables are described at the
365 beginning of the <literal>obdfilter-survey</literal> script. In
366 particular, pay attention to the listed maximum values listed for each
367 parameter in the script.</para>
369 <section xml:id="dbdoclet.50438212_59319">
370 <title><indexterm><primary>benchmarking</primary>
371 <secondary>local disk</secondary></indexterm>
372 Testing Local Disk Performance</title>
373 <para>The <literal>obdfilter-survey</literal> script can be run
374 automatically or manually against a local disk. This script profiles the
375 overall throughput of storage hardware, including the file system and RAID
376 layers managing the storage, by sending workloads to the OSTs that vary in
377 thread count, object count, and I/O size.</para>
378 <para>When the <literal>obdfilter-survey</literal> script is run, it
379 provides information about the performance abilities of the storage
380 hardware and shows the saturation points.</para>
381 <para>The <literal>plot-obdfilter</literal> script generates from the
382 output of the <literal>obdfilter-survey</literal> a CSV file and
383 parameters for importing into a spreadsheet or gnuplot to visualize the
385 <para>To run the <literal>obdfilter-survey</literal> script, create a
386 standard Lustre file system configuration; no special setup is needed.
388 <para><emphasis role="bold">To perform an automatic run:</emphasis></para>
391 <para>Start the Lustre OSTs.</para>
392 <para>The Lustre OSTs should be mounted on the OSS node(s) to be
393 tested. The Lustre client is not required to be mounted at this time.
397 <para>Verify that the obdecho module is loaded. Run:</para>
398 <screen>modprobe obdecho</screen>
401 <para>Run the <literal>obdfilter-survey</literal> script with the
402 parameter <literal>case=disk</literal>.</para>
403 <para>For example, to run a local test with up to two objects
404 (nobjhi), up to two threads (thrhi), and 1024 MB transfer size (size):
406 <screen>$ nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey</screen>
409 <para>Performance measurements for write, rewrite, read etc are
410 provided below:</para>
411 <screen># example output
412 Fri Sep 25 11:14:03 EDT 2015 Obdfilter-survey for case=disk from hds1fnb6123
413 ost 10 sz 167772160K rsz 1024K obj 10 thr 10 write 10982.73 [ 601.97,2912.91] rewrite 15696.54 [1160.92,3450.85] read 12358.60 [ 938.96,2634.87]
416 <literal>./lustre-iokit/obdfilter-survey/README.obdfilter-survey
417 </literal>provides an explaination for the output as follows:</para>
418 <screen>ost 10 is the total number of OSTs under test.
419 sz 167772160K is the total amount of data read or written (in bytes).
420 rsz 1024K is the record size (size of each echo_client I/O, in bytes).
421 obj 10 is the total number of objects over all OSTs
422 thr 10 is the total number of threads over all OSTs and objects
423 write is the test name. If more tests have been specified they
424 all appear on the same line.
425 10982.73 is the aggregate bandwidth over all OSTs measured by
426 dividing the total number of MB by the elapsed time.
427 [601.97,2912.91] are the minimum and maximum instantaneous bandwidths seen on
429 Note that although the numbers of threads and objects are specifed per-OST
430 in the customization section of the script, results are reported aggregated
431 over all OSTs.</screen>
434 <para><emphasis role="italic">To perform a manual run:</emphasis></para>
437 <para>Start the Lustre OSTs.</para>
438 <para>The Lustre OSTs should be mounted on the OSS node(s) to be
439 tested. The Lustre client is not required to be mounted at this time.
443 <para>Verify that the <literal>obdecho</literal> module is loaded.
445 <screen>modprobe obdecho</screen>
448 <para>Determine the OST names.</para>
449 <para>On the OSS nodes to be tested, run the
450 <literal>lctl dl</literal> command. The OST device names are listed in
451 the fourth column of the output. For example:</para>
452 <screen>$ lctl dl |grep obdfilter
453 0 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 1159
454 2 UP obdfilter lustre-OST0002 lustre-OST0002_UUID 1159
458 <para>List all OSTs you want to test.</para>
459 <para>Use the <literal>targets=parameter</literal> to list the OSTs
460 separated by spaces. List the individual OSTs by name using the format
462 <replaceable>fsname</replaceable>-<replaceable>OSTnumber</replaceable>
463 </literal> (for example, <literal>lustre-OST0001</literal>). You do
464 not have to specify an MDS or LOV.</para>
467 <para>Run the <literal>obdfilter-survey</literal> script with the
468 <literal>targets=parameter</literal>.</para>
469 <para>For example, to run a local test with up to two objects
470 (<literal>nobjhi</literal>), up to two threads (
471 <literal>thrhi</literal>), and 1024 Mb (size) transfer size:</para>
472 <screen>$ nobjhi=2 thrhi=2 size=1024 targets="lustre-OST0001 \
473 lustre-OST0002" sh obdfilter-survey</screen>
477 <section xml:id="benchmark.network">
478 <title><indexterm><primary>benchmarking</primary>
479 <secondary>network</secondary></indexterm>
480 Testing Network Performance</title>
481 <para>The <literal>obdfilter-survey</literal> script can only be run
482 automatically against a network; no manual test is provided.</para>
483 <para>To run the network test, a specific Lustre file system setup is
484 needed. Make sure that these configuration requirements have been met.
486 <para><emphasis role="bold">To perform an automatic run:</emphasis></para>
489 <para>Start the Lustre OSTs.</para>
490 <para>The Lustre OSTs should be mounted on the OSS node(s) to be
491 tested. The Lustre client is not required to be mounted at this time.
495 <para>Verify that the <literal>obdecho</literal> module is loaded.
497 <screen>modprobe obdecho</screen>
500 <para>Start <literal>lctl</literal> and check the device list, which
501 must be empty. Run:</para>
502 <screen>lctl dl</screen>
505 <para>Run the <literal>obdfilter-survey</literal> script with the
506 parameters <literal>case=network</literal> and
507 <literal>targets=<replaceable>hostname|ip_of_server</replaceable>
508 </literal>. For example:</para>
509 <screen>$ nobjhi=2 thrhi=2 size=1024 targets="oss0 oss1" \
510 case=network sh obdfilter-survey</screen>
513 <para>On the server side, view the statistics at:</para>
514 <screen>lctl get_param obdecho.<replaceable>echo_srv</replaceable>.stats</screen>
515 <para>where <literal><replaceable>echo_srv</replaceable></literal>
516 is the <literal>obdecho</literal> server created by the script.</para>
520 <section xml:id="benchmark.remote_disk">
521 <title><indexterm><primary>benchmarking</primary>
522 <secondary>remote disk</secondary></indexterm>
523 Testing Remote Disk Performance</title>
524 <para>The <literal>obdfilter-survey</literal> script can be run
525 automatically or manually against a network disk. To run the network disk
526 test, start with a standard Lustre configuration. No special setup is
528 <para><emphasis role="bold">To perform an automatic run:</emphasis></para>
531 <para>Start the Lustre OSTs.</para>
532 <para>The Lustre OSTs should be mounted on the OSS node(s) to be
533 tested. The Lustre client is not required to be mounted at this time.
537 <para>Verify that the <literal>obdecho</literal> module is loaded.
539 <screen>modprobe obdecho</screen>
542 <para>Run the <literal>obdfilter-survey</literal> script with the
543 parameter <literal>case=netdisk</literal>. For example:</para>
544 <screen>$ nobjhi=2 thrhi=2 size=1024 case=netdisk sh obdfilter-survey
548 <para><emphasis role="bold">To perform a manual run:</emphasis></para>
551 <para>Start the Lustre OSTs.</para>
552 <para>The Lustre OSTs should be mounted on the OSS node(s) to be
553 tested. The Lustre client is not required to be mounted at this time.
557 <para>Verify that the <literal>obdecho</literal> module is loaded.
559 <para>modprobe obdecho</para>
562 <para>Determine the OSC names.</para>
563 <para>On the OSS nodes to be tested, run the
564 <literal>lctl dl</literal> command. The OSC device names are listed in
565 the fourth column of the output. For example:</para>
566 <screen>$ lctl dl |grep obdfilter
567 3 UP osc lustre-OST0000-osc-ffff88007754bc00 \
568 54b91eab-0ea9-1516-b571-5e6df349592e 5
569 4 UP osc lustre-OST0001-osc-ffff88007754bc00 \
570 54b91eab-0ea9-1516-b571-5e6df349592e 5
575 <para>List all OSCs you want to test.</para>
576 <para>Use the <literal>targets=parameter</literal> to list the OSCs
577 separated by spaces. List the individual OSCs by name separated by
578 spaces using the format <literal>
579 <replaceable>fsname</replaceable>-<replaceable>OST_name</replaceable>-osc-<replaceable>instance</replaceable>
580 </literal> (for example,
581 <literal>lustre-OST0000-osc-ffff88007754bc00</literal>). You
582 <emphasis>do not have to specify an MDS or LOV.</emphasis></para>
585 <para>Run the <literal>obdfilter-survey</literal> script with the
586 <literal>targets=<replaceable>osc</replaceable></literal> and
587 <literal>case=netdisk</literal>.</para>
588 <para>An example of a local test run with up to two objects
589 (<literal>nobjhi</literal>), up to two threads
590 (<literal>thrhi</literal>), and 1024 Mb (size) transfer size is shown
592 <screen>$ nobjhi=2 thrhi=2 size=1024 \
593 targets="lustre-OST0000-osc-ffff88007754bc00 \
594 lustre-OST0001-osc-ffff88007754bc00" sh obdfilter-survey</screen>
599 <title>Output Files</title>
600 <para>When the <literal>obdfilter-survey</literal> script runs, it creates
601 a number of working files and a pair of result files. All files start with
602 the prefix defined in the variable <literal>${rslt}</literal>.</para>
603 <informaltable frame="all">
605 <colspec colname="c1" colwidth="50*"/>
606 <colspec colname="c2" colwidth="50*"/>
610 <para><emphasis role="bold">File</emphasis></para>
613 <para><emphasis role="bold">Description</emphasis></para>
620 <para> <literal>${rslt}.summary</literal></para>
623 <para> Same as stdout</para>
628 <para> <literal>${rslt}.script_*</literal></para>
631 <para> Per-host test script files</para>
636 <para> <literal>${rslt}.detail_tmp*</literal></para>
639 <para> Per-OST result files</para>
644 <para> <literal>${rslt}.detail</literal></para>
647 <para> Collected result files for post-mortem</para>
653 <para>The <literal>obdfilter-survey</literal> script iterates over the
654 given number of threads and objects performing the specified tests and
655 checks that all test processes have completed successfully.</para>
657 <para>The <literal>obdfilter-survey</literal> script may not clean up
658 properly if it is aborted or if it encounters an unrecoverable error. In
659 this case, a manual cleanup may be required, possibly including killing
660 any running instances of <literal>lctl</literal> (local or remote),
661 removing <literal>echo_client</literal> instances created by the script
662 and unloading <literal>obdecho</literal>.</para>
665 <title>Script Output</title>
666 <para>The <literal>.summary</literal> file and <literal>stdout</literal>
667 of the <literal>obdfilter-survey</literal> script contain lines like:
669 <screen>ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]</screen>
671 <informaltable frame="all">
673 <colspec colname="c1" colwidth="50*"/>
674 <colspec colname="c2" colwidth="50*"/>
678 <para><emphasis role="bold">Parameter and value</emphasis>
682 <para><emphasis role="bold">Description</emphasis></para>
692 <para> Total number of OSTs being tested.</para>
697 <para> sz 67108864K</para>
700 <para> Total amount of data read or written (in KB).</para>
705 <para> rsz 1024</para>
708 <para> Record size (size of each echo_client I/O, in KB).
717 <para> Total number of objects over all OSTs.</para>
725 <para> Total number of threads over all OSTs and objects.
734 <para> Test name. If more tests have been specified, they all
735 appear on the same line.</para>
743 <para> Aggregate bandwidth over all OSTs (measured by dividing
744 the total number of MB by the elapsed time).</para>
749 <para> [64, 82.00]</para>
752 <para> Minimum and maximum instantaneous bandwidths on an
753 individual OST.</para>
760 <para>Although the numbers of threads and objects are specified
761 per-OST in the customization section of the script, the reported
762 results are aggregated over all OSTs.</para>
766 <title>Visualizing Results</title>
767 <para>It is useful to import the <literal>obdfilter-survey</literal>
768 script summary data (it is fixed width) into Excel (or any graphing
769 package) and graph the bandwidth versus the number of threads for
770 varying numbers of concurrent regions. This shows how the OSS performs
771 for a given number of concurrently-accessed objects (files) with varying
772 numbers of I/Os in flight.</para>
773 <para>It is also useful to monitor and record average disk I/O sizes
774 during each test using the 'disk io size' histogram in the
775 file <literal>lctl get_param obdfilter.*.brw_stats</literal>
776 (see <xref linkend="dbdoclet.50438271_55057"/> for details). These
777 numbers help identify problems in the system when full-sized I/Os are
778 not submitted to the underlying disk. This may be caused by problems in
779 the device driver or Linux block layer.</para>
780 <para>The <literal>plot-obdfilter</literal> script included in the I/O
781 toolkit is an example of processing output files to a .csv format and
782 plotting a graph using <literal>gnuplot</literal>.</para>
786 <section xml:id="benchmark.ost_io">
788 <primary>benchmarking</primary>
789 <secondary>OST I/O</secondary></indexterm>
790 Testing OST I/O Performance (<literal>ost-survey</literal>)</title>
791 <para>The <literal>ost-survey</literal> tool is a shell script that uses
792 <literal>lfs setstripe</literal> to perform I/O against a single OST. The
793 script writes a file (currently using <literal>dd</literal>) to each OST
794 in the Lustre file system, and compares read and write speeds. The
795 <literal>ost-survey</literal> tool is used to detect anomalies between
796 otherwise identical disk subsystems.</para>
798 <para>We have frequently discovered wide performance variations across
799 all LUNs in a cluster. This may be caused by faulty disks, RAID parity
800 reconstruction during the test, or faulty network hardware.</para>
802 <para>To run the <literal>ost-survey</literal> script, supply a file size
803 (in KB) and the Lustre file system mount point. For example, run:</para>
804 <screen>$ ./ost-survey.sh -s 10 /mnt/lustre</screen>
805 <para>Typical output is:</para>
807 Number of Active OST devices : 4
808 Worst Read OST indx: 2 speed: 2835.272725
809 Best Read OST indx: 3 speed: 2872.889668
810 Read Average: 2852.508999 +/- 16.444792 MB/s
811 Worst Write OST indx: 3 speed: 17.705545
812 Best Write OST indx: 2 speed: 128.172576
813 Write Average: 95.437735 +/- 45.518117 MB/s
814 Ost# Read(MB/s) Write(MB/s) Read-time Write-time
815 ----------------------------------------------------
816 0 2837.440 126.918 0.035 0.788
817 1 2864.433 108.954 0.035 0.918
818 2 2835.273 128.173 0.035 0.780
819 3 2872.890 17.706 0.035 5.648</screen>
821 <section xml:id="benchmark.mds_survey_ref">
822 <title><indexterm><primary>benchmarking</primary>
823 <secondary>MDS performance</secondary></indexterm>
824 Testing MDS Performance (<literal>mds-survey</literal>)</title>
825 <para><literal>mds-survey</literal> is available in Lustre software release
826 2.2 and beyond. The <literal>mds-survey</literal> script tests the local
827 metadata performance using the echo_client to drive different layers of the
828 MDS stack: mdd, mdt, osd (the Lustre software only supports mdd stack). It
829 can be used with the following classes of operations:</para>
832 <para><literal>Open-create/mkdir/create</literal></para>
835 <para><literal>Lookup/getattr/setxattr</literal></para>
838 <para><literal>Delete/destroy</literal></para>
841 <para><literal>Unlink/rmdir</literal></para>
844 <para>These operations will be run by a variable number of concurrent
845 threads and will test with the number of directories specified by the user.
846 The run can be executed such that all threads operate in a single directory
847 (dir_count=1) or in private/unique directory (dir_count=x thrlo=x thrhi=x).
849 <para>The mdd instance is driven directly. The script automatically loads
850 the obdecho module if required and creates instance of echo_client.</para>
851 <para>This script can also create OST objects by providing stripe_count
852 greater than zero.</para>
853 <para><emphasis role="bold">To perform a run:</emphasis></para>
856 <para>Start the Lustre MDT.</para>
857 <para>The Lustre MDT should be mounted on the MDS node to be tested.
861 <para>Start the Lustre OSTs (optional, only required when test with
863 <para>The Lustre OSTs should be mounted on the OSS node(s).</para>
866 <para>Run the <literal>mds-survey</literal> script as explain below
868 <para>The script must be customized according to the components under
869 test and where it should keep its working files. Customization
870 variables are described as followed:</para>
873 <para><literal>thrlo</literal> - threads to start testing. skipped
874 if less than <literal>dir_count</literal></para>
877 <para><literal>thrhi</literal> - maximum number of threads to test
881 <para><literal>targets</literal> - MDT instance</para>
884 <para><literal>file_count</literal> - number of files per thread
888 <para><literal>dir_count</literal> - total number of directories
889 to test. Must be less than or equal to <literal>thrhi</literal>
893 <para><literal>stripe_count </literal>- number stripe on OST
897 <para><literal>tests_str</literal> - test operations. Must have at
898 least "create" and "destroy"</para>
901 <para><literal>start_number</literal> - base number for each
902 thread to prevent name collisions</para>
905 <para><literal>layer</literal> - MDS stack's layer to be tested
909 <para>Run without OST objects creation:</para>
910 <para>Setup the Lustre MDS without OST mounted. Then invoke the
911 <literal>mds-survey</literal> script</para>
912 <screen>$ thrhi=64 file_count=200000 sh mds-survey</screen>
913 <para>Run with OST objects creation:</para>
914 <para>Setup the Lustre MDS with at least one OST mounted. Then invoke
915 the <literal>mds-survey</literal> script with
916 <literal>stripe_count</literal> parameter</para>
917 <screen>$ thrhi=64 file_count=200000 stripe_count=2 sh mds-survey</screen>
918 <para>Note: a specific MDT instance can be specified using targets
920 <screen>$ targets=lustre-MDT0000 thrhi=64 file_count=200000 stripe_count=2 sh mds-survey</screen>
924 <title>Output Files</title>
925 <para>When the <literal>mds-survey</literal> script runs, it creates a
926 number of working files and a pair of result files. All files start with
927 the prefix defined in the variable <literal>${rslt}</literal>.</para>
928 <informaltable frame="all">
930 <colspec colname="c1" colwidth="50*"/>
931 <colspec colname="c2" colwidth="50*"/>
935 <para><emphasis role="bold">File</emphasis></para>
938 <para><emphasis role="bold">Description</emphasis></para>
945 <para> <literal>${rslt}.summary</literal></para>
948 <para> Same as stdout</para>
953 <para> <literal>${rslt}.script_*</literal></para>
956 <para> Per-host test script files</para>
961 <para> <literal>${rslt}.detail_tmp*</literal></para>
964 <para> Per-mdt result files</para>
969 <para> <literal>${rslt}.detail</literal></para>
972 <para> Collected result files for post-mortem</para>
978 <para>The <literal>mds-survey</literal> script iterates over the given
979 number of threads performing the specified tests and checks that all test
980 processes have completed successfully.</para>
982 <para>The <literal>mds-survey</literal> script may not clean up properly
983 if it is aborted or if it encounters an unrecoverable error. In this
984 case, a manual cleanup may be required, possibly including killing any
985 running instances of <literal>lctl</literal>, removing
986 <literal>echo_client</literal> instances created by the script and
987 unloading <literal>obdecho</literal>.</para>
991 <title>Script Output</title>
992 <para>The <literal>.summary</literal> file and <literal>stdout</literal>
993 of the <literal>mds-survey</literal> script contain lines like:</para>
994 <screen>mdt 1 file 100000 dir 4 thr 4 create 5652.05 [ 999.01,46940.48] destroy 5797.79 [ 0.00,52951.55] </screen>
996 <informaltable frame="all">
998 <colspec colname="c1" colwidth="50*"/>
999 <colspec colname="c2" colwidth="50*"/>
1003 <para><emphasis role="bold">Parameter and value</emphasis></para>
1006 <para><emphasis role="bold">Description</emphasis></para>
1016 <para>Total number of MDT under test</para>
1021 <para>file 100000</para>
1024 <para>Total number of files per thread to operate</para>
1032 <para>Total number of directories to operate</para>
1040 <para>Total number of threads operate over all directories
1046 <para>create, destroy</para>
1049 <para>Tests name. More tests will be displayed on the same line.
1058 <para>Aggregate operations over MDT measured by dividing the
1059 total number of operations by the elapsed time.</para>
1064 <para>[999.01,46940.48]</para>
1067 <para>Minimum and maximum instantaneous operation seen on any
1068 individual MDT</para>
1075 <para>If script output has "ERROR", this usually means there is issue
1076 during the run such as running out of space on the MDT and/or OST.
1077 More detailed debug information is available in the ${rslt}.detail
1082 <section xml:id="benchmark.stats-collect">
1083 <title><indexterm><primary>benchmarking</primary>
1084 <secondary>application profiling</secondary></indexterm>
1085 Collecting Application Profiling Information (
1086 <literal>stats-collect</literal>)</title>
1087 <para>The <literal>stats-collect</literal> utility contains the following
1088 scripts used to collect application profiling information from Lustre
1089 clients and servers:</para>
1092 <para><literal>lstat.sh</literal> - Script for a single node that is
1093 run on each profile node.</para>
1096 <para><literal>gather_stats_everywhere.sh</literal> - Script that
1097 collect statistics.</para>
1100 <para><literal>config.sh</literal> - Script that contains customized
1101 configuration descriptions.</para>
1104 <para>The <literal>stats-collect</literal> utility requires:</para>
1107 <para>Lustre software to be installed and set up on your cluster</para>
1110 <para>SSH and SCP access to these nodes without requiring a password</para>
1113 <section remap="h3">
1114 <title>Using <literal>stats-collect</literal></title>
1115 <para>The stats-collect utility is configured by including profiling
1116 configuration variables in the config.sh script. Each configuration
1117 variable takes the following form, where 0 indicates statistics are to be
1118 collected only when the script starts and stops and <emphasis>n</emphasis>
1119 indicates the interval in seconds at which statistics are to be collected:
1121 <screen><replaceable>statistic</replaceable>_INTERVAL=<replaceable>0|n</replaceable></screen>
1122 <para>Statistics that can be collected include:</para>
1125 <para><literal>VMSTAT</literal> - Memory and CPU usage and aggregate
1126 read/write operations</para>
1129 <para><literal>SERVICE</literal> - Lustre OST and MDT RPC service
1133 <para><literal>BRW</literal> - OST bulk read/write statistics
1137 <para><literal>SDIO</literal> - SCSI disk IO statistics (sd_iostats)
1141 <para><literal>MBALLOC</literal> - <literal>ldiskfs</literal> block
1142 allocation statistics</para>
1145 <para><literal>IO</literal> - Lustre target operations statistics
1149 <para><literal>JBD</literal> - ldiskfs journal statistics</para>
1152 <para><literal>CLIENT</literal> - Lustre OSC request statistics
1156 <para>To collect profile information:</para>
1157 <para>Begin collecting statistics on each node specified in the config.sh
1161 <para>Starting the collect profile daemon on each node by entering:
1163 <screen>sh gather_stats_everywhere.sh config.sh start </screen>
1166 <para>Run the test.</para>
1169 <para>Stop collecting statistics on each node, clean up the temporary
1170 file, and create a profiling tarball.</para>
1172 <screen>sh gather_stats_everywhere.sh config.sh stop <replaceable>log_name</replaceable>.tgz</screen>
1173 <para>When <literal><replaceable>log_name</replaceable>.tgz</literal>
1174 is specified, a profile tarball
1175 <literal><replaceable>/tmp/log_name</replaceable>.tgz</literal> is
1179 <para>Analyze the collected statistics and create a csv tarball for
1180 the specified profiling data.</para>
1181 <screen>sh gather_stats_everywhere.sh config.sh analyse log_tarball.tgz csv</screen>