BenchmarkingTests.xml

   1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="benchmarkingtests">
   2   <title xml:id="benchmarkingtests.title">Benchmarking Lustre File System Performance (Lustre I/O
   3     Kit)</title>
   4   <para>This chapter describes the Lustre I/O kit, a collection of I/O benchmarking tools for a
   5     Lustre cluster, and PIOS, a parallel I/O simulator for Linux and
   6       Solaris<superscript>*</superscript> operating systems. It includes:</para>
   7   <itemizedlist>
   8     <listitem>
   9       <para><xref linkend="dbdoclet.50438212_44437"/></para>
  10     </listitem>
  11     <listitem>
  12       <para><xref linkend="dbdoclet.50438212_51053"/></para>
  13     </listitem>
  14     <listitem>
  15       <para><xref linkend="dbdoclet.50438212_26516"/></para>
  16     </listitem>
  17     <listitem>
  18       <para><xref linkend="dbdoclet.50438212_85136"/></para>
  19     </listitem>
  20     <listitem>
  21       <para><xref linkend="mds_survey_ref"/></para>
  22     </listitem>
  23     <listitem>
  24       <para><xref linkend="dbdoclet.50438212_58201"/></para>
  25     </listitem>
  26   </itemizedlist>
  27   <section xml:id="dbdoclet.50438212_44437">
  28       <title>
  29           <indexterm><primary>benchmarking</primary><secondary>with Lustre I/O Kit</secondary></indexterm>
  30           <indexterm><primary>profiling</primary><see>benchmarking</see></indexterm>
  31           <indexterm><primary>tuning</primary><see>benchmarking</see></indexterm>
  32           <indexterm><primary>performance</primary><see>benchmarking</see></indexterm>
  33
  34           Using Lustre I/O Kit Tools</title>
  35     <para>The tools in the Lustre I/O Kit are used to benchmark Lustre file system hardware and
  36       validate that it is working as expected before you install the Lustre software. It can also be
  37       used to to validate the performance of the various hardware and software layers in the cluster
  38       and also to find and troubleshoot I/O issues.</para>
  39     <para>Typically, performance is measured starting with single raw devices and then proceeding to groups of devices. Once raw performance has been established, other software layers are then added incrementally and tested.</para>
  40     <section remap="h3">
  41       <title>Contents of the Lustre I/O Kit</title>
  42       <para>The I/O kit contains three tests, each of which tests a progressively higher layer in
  43         the Lustre software stack:</para>
  44       <itemizedlist>
  45         <listitem>
  46           <para><literal>sgpdd-survey</literal> - Measure basic &apos;bare metal&apos; performance
  47             of devices while bypassing the kernel block device layers, buffer cache, and file
  48             system.</para>
  49         </listitem>
  50         <listitem>
  51           <para><literal>obdfilter-survey</literal> - Measure the performance of one or more OSTs
  52             directly on the OSS node or alternately over the network from a Lustre client.</para>
  53         </listitem>
  54         <listitem>
  55           <para><literal>ost-survey</literal> - Performs I/O against OSTs individually to allow
  56             performance comparisons to detect if an OST is performing suboptimally due to hardware
  57             issues.</para>
  58         </listitem>
  59       </itemizedlist>
  60       <para>Typically with these tests, a Lustre file system should deliver 85-90% of the raw device
  61         performance.</para>
  62       <para>A utility <literal>stats-collect</literal> is also provided to collect application profiling information from Lustre clients and servers. See <xref linkend="dbdoclet.50438212_58201"/> for more information.</para>
  63     </section>
  64     <section remap="h3">
  65       <title>Preparing to Use the Lustre I/O Kit</title>
  66       <para>The following prerequisites must be met to use the tests in the Lustre I/O kit:</para>
  67       <itemizedlist>
  68         <listitem>
  69           <para>Password-free remote access to nodes in the system (provided by <literal>ssh</literal> or <literal>rsh</literal>).</para>
  70         </listitem>
  71         <listitem>
  72           <para>LNET self-test completed to test that Lustre networking has been properly installed
  73             and configured. See <xref linkend="lnetselftest"/>.</para>
  74         </listitem>
  75         <listitem>
  76           <para>Lustre file system software installed.</para>
  77         </listitem>
  78         <listitem>
  79           <para><literal>sg3_utils</literal>  package providing the <literal>sgp_dd</literal> tool (<literal>sg3_utils</literal> is a separate RPM package available online using YUM).</para>
  80         </listitem>
  81       </itemizedlist>
  82       <para>Download the Lustre I/O kit (<literal>lustre-iokit</literal>)from:</para>
  83       <para><link xl:href="http://downloads.hpdd.intel.com/">http://downloads.hpdd.intel.com/</link></para>
  84     </section>
  85   </section>
  86   <section xml:id="dbdoclet.50438212_51053">
  87     <title><indexterm>
  88         <primary>benchmarking</primary>
  89         <secondary>raw hardware with sgpdd-survey</secondary>
  90       </indexterm>Testing I/O Performance of Raw Hardware (<literal>sgpdd-survey</literal>)</title>
  91     <para>The <literal>sgpdd-survey</literal> tool is used to test bare metal I/O performance of the
  92       raw hardware, while bypassing as much of the kernel as possible. This survey may be used to
  93       characterize the performance of a SCSI device by simulating an OST serving multiple stripe
  94       files. The data gathered by this survey can help set expectations for the performance of a
  95       Lustre OST using this device.</para>
  96     <para>The script uses <literal>sgp_dd</literal> to carry out raw sequential disk I/O. It runs with variable numbers of <literal>sgp_dd</literal> threads to show how performance varies with different request queue depths.</para>
  97     <para>The script spawns variable numbers of <literal>sgp_dd</literal> instances, each reading or writing a separate area of the disk to demonstrate performance variance within a number of concurrent stripe files.</para>
  98     <para>Several tips and insights for disk performance measurement are described below. Some of this information is specific to RAID arrays and/or the Linux RAID implementation.</para>
  99     <itemizedlist>
 100       <listitem>
 101         <para><emphasis>Performance is limited by the slowest disk.</emphasis></para>
 102         <para>Before creating a RAID array, benchmark all disks individually. We have frequently encountered situations where drive performance was not consistent for all devices in the array. Replace any disks that are significantly slower than the rest.</para>
 103       </listitem>
 104       <listitem>
 105         <para><emphasis>Disks and arrays are very sensitive to request size.</emphasis></para>
 106         <para>To identify the optimal request size for a given disk, benchmark the disk with different record sizes ranging from 4 KB to 1 to 2 MB.</para>
 107       </listitem>
 108     </itemizedlist>
 109     <caution>
 110       <para>The <literal>sgpdd-survey</literal> script overwrites the device being tested, which
 111         results in the <emphasis>
 112           <emphasis role="bold">LOSS OF ALL DATA</emphasis>
 113         </emphasis> on that device. Exercise caution when selecting the device to be tested.</para>
 114     </caution>
 115     <note>
 116       <para>Array performance with all LUNs loaded does not always match the performance of a single LUN when tested in isolation.</para>
 117     </note>
 118     <para><emphasis role="bold">Prerequisites:</emphasis></para>
 119     <itemizedlist>
 120       <listitem>
 121         <para><literal>sgp_dd</literal>  tool in the <literal>sg3_utils</literal> package</para>
 122       </listitem>
 123       <listitem>
 124         <para>Lustre software is <emphasis>NOT</emphasis> required</para>
 125       </listitem>
 126     </itemizedlist>
 127     <para>The device(s) being tested must meet one of these two requirements:</para>
 128     <itemizedlist>
 129       <listitem>
 130         <para>If the device is a SCSI device, it must appear in the output of <literal>sg_map</literal> (make sure the kernel module <literal>sg</literal> is loaded).</para>
 131       </listitem>
 132       <listitem>
 133         <para>If the device is a raw device, it must appear in the output of <literal>raw -qa</literal>.</para>
 134       </listitem>
 135     </itemizedlist>
 136     <para>Raw and SCSI devices cannot be mixed in the test specification.</para>
 137     <note>
 138       <para>If you need to create raw devices to use the <literal>sgpdd-survey</literal> tool, note
 139         that raw device 0 cannot be used due to a bug in certain versions of the &quot;raw&quot;
 140         utility (including the version shipped with Red Hat Enterprise Linux 4U4.)</para>
 141     </note>
 142     <section remap="h3">
 143       <title><indexterm><primary>benchmarking</primary><secondary>tuning storage</secondary></indexterm>Tuning Linux Storage Devices</title>
 144       <para>To get large I/O transfers (1 MB) to disk, it may be necessary to tune several kernel parameters as specified:</para>
 145       <screen>/sys/block/sdN/queue/max_sectors_kb = 4096
 146 /sys/block/sdN/queue/max_phys_segments = 256
 147 /proc/scsi/sg/allow_dio = 1
 148 /sys/module/ib_srp/parameters/srp_sg_tablesize = 255
 149 /sys/block/sdN/queue/scheduler</screen>
 150       <note>
 151         <para>Recommended schedulers are <emphasis role="bold">deadline</emphasis> and <emphasis
 152             role="bold">noop</emphasis>. The  scheduler is set by default to <emphasis role="bold"
 153             >deadline</emphasis>, unless it has already been set to <emphasis role="bold"
 154             >noop</emphasis>.</para>
 155       </note>
 156     </section>
 157     <section remap="h3">
 158       <title>Running sgpdd-survey</title>
 159       <para>The <literal>sgpdd-survey</literal> script must be customized for the particular device
 160         being tested and for the location where the script saves its working and result files (by
 161         specifying the <literal>${rslt}</literal> variable). Customization variables are described
 162         at the beginning of the script.</para>
 163       <para>When the <literal>sgpdd-survey</literal> script runs, it creates a number of working
 164         files and a pair of result files. The names of all the files created start with the prefix
 165         defined in the variable <literal>${rslt}</literal>. (The default value is
 166           <literal>/tmp</literal>.) The files include:</para>
 167       <itemizedlist>
 168         <listitem>
 169           <para>File containing standard output data (same as <literal>stdout</literal>)</para>
 170           <screen><replaceable>rslt_date_time</replaceable>.summary</screen>
 171         </listitem>
 172         <listitem>
 173           <para>Temporary (tmp) files</para>
 174           <screen><replaceable>rslt_date_time</replaceable>_*
 175 </screen>
 176         </listitem>
 177         <listitem>
 178           <para>Collected tmp files for post-mortem</para>
 179           <screen><replaceable>rslt_date_time</replaceable>.detail
 180 </screen>
 181         </listitem>
 182       </itemizedlist>
 183       <para>The <literal>stdout</literal> and the <literal>.summary</literal> file will contain lines like this:</para>
 184       <screen>total_size 8388608K rsz 1024 thr 1 crg 1 180.45 MB/s 1 x 180.50 \
 185         = 180.50 MB/s
 186 </screen>
 187       <para>Each line corresponds to a run of the test. Each test run will have a different number of threads, record size, or number of regions.</para>
 188       <itemizedlist>
 189         <listitem>
 190           <para><literal>total_size</literal>  - Size of file being tested in KBs (8 GB in above example).</para>
 191         </listitem>
 192         <listitem>
 193           <para><literal>rsz</literal>  - Record size in KBs (1 MB in above example).</para>
 194         </listitem>
 195         <listitem>
 196           <para><literal>thr</literal>  - Number of threads generating I/O (1 thread in above example).</para>
 197         </listitem>
 198         <listitem>
 199           <para><literal>crg</literal> - Current regions, the number of disjoint areas on the disk to which I/O is being sent (1 region in above example, indicating that no seeking is done).</para>
 200         </listitem>
 201         <listitem>
 202           <para><literal>MB/s</literal>  - Aggregate bandwidth measured by dividing the total amount of data by the elapsed time (180.45 MB/s in the above example).</para>
 203         </listitem>
 204         <listitem>
 205           <para><literal>MB/s</literal>  - The remaining numbers show the number of regions X performance of the slowest disk as a sanity check on the aggregate bandwidth.</para>
 206         </listitem>
 207       </itemizedlist>
 208       <para>If there are so many threads that the <literal>sgp_dd</literal> script is unlikely to be able to allocate I/O buffers, then <literal>ENOMEM</literal> is printed in place of the aggregate bandwidth result.</para>
 209       <para>If one or more <literal>sgp_dd</literal> instances do not successfully report a bandwidth number, then <literal>FAILED</literal> is printed in place of the aggregate bandwidth result.</para>
 210     </section>
 211   </section>
 212   <section xml:id="dbdoclet.50438212_26516">
 213     <title><indexterm>
 214         <primary>benchmarking</primary>
 215         <secondary>OST performance</secondary>
 216       </indexterm>Testing OST Performance (<literal>obdfilter-survey</literal>)</title>
 217     <para>The <literal>obdfilter-survey</literal> script generates sequential I/O from varying
 218       numbers of threads and objects (files) to simulate the I/O patterns of a Lustre client.</para>
 219     <para>The <literal>obdfilter-survey</literal> script can be run directly on the OSS node to
 220       measure the OST storage performance without any intervening network, or it can be run remotely
 221       on a Lustre client to measure the OST performance including network overhead.</para>
 222     <para>The <literal>obdfilter-survey</literal> is used to characterize the performance of the
 223       following:</para>
 224     <itemizedlist>
 225       <listitem>
 226         <para><emphasis role="bold">Local file system</emphasis> - In this mode, the
 227             <literal>obdfilter-survey</literal> script exercises one or more instances of the
 228           obdfilter directly. The script may run on one or more OSS nodes, for example, when the
 229           OSSs are all attached to the same multi-ported disk subsystem.</para>
 230         <para>Run the script using the <literal>case=disk</literal> parameter to run the test against all the local OSTs. The script automatically detects all local OSTs and includes them in the survey.</para>
 231         <para>To run the test against only specific OSTs, run the script using the <literal>targets=parameter</literal> to list the OSTs to be tested explicitly. If some OSTs are on remote nodes, specify their hostnames in addition to the OST name (for example, <literal>oss2:lustre-OST0004</literal>).</para>
 232         <para>All <literal>obdfilter</literal> instances are driven directly. The script automatically loads the <literal>obdecho</literal> module (if required) and creates one instance of <literal>echo_client</literal> for each <literal>obdfilter</literal> instance in order to generate I/O requests directly to the OST.</para>
 233         <para>For more details, see <xref linkend="dbdoclet.50438212_59319"/>.</para>
 234       </listitem>
 235       <listitem>
 236         <para><emphasis role="bold">Network</emphasis>  - In this mode, the Lustre client generates I/O requests over the network but these requests are not sent to the OST file system. The OSS node runs the obdecho server to receive the requests but discards them before they are sent to the disk.</para>
 237         <para>Pass the parameters <literal>case=network</literal> and <literal>targets=<replaceable>hostname|IP_of_server</replaceable></literal> to the script. For each network case, the script does the required setup.</para>
 238         <para>For more details, see <xref linkend="dbdoclet.50438212_36037"/></para>
 239       </listitem>
 240       <listitem>
 241         <para><emphasis role="bold">Remote file system over the network</emphasis> - In this mode
 242           the <literal>obdfilter-survey</literal> script generates I/O from a Lustre client to a
 243           remote OSS to write the data to the file system.</para>
 244         <para>To run the test against all the local OSCs, pass the parameter <literal>case=netdisk</literal> to the script. Alternately you can pass the target= parameter with one or more OSC devices (e.g., <literal>lustre-OST0000-osc-ffff88007754bc00</literal>) against which the tests are to be run.</para>
 245         <para>For more details, see <xref linkend="dbdoclet.50438212_62662"/>.</para>
 246       </listitem>
 247     </itemizedlist>
 248     <caution>
 249       <para>The <literal>obdfilter-survey</literal> script is potentially destructive and there is a
 250         small risk data may be lost. To reduce this risk, <literal>obdfilter-survey</literal> should
 251         not be run on devices that contain data that needs to be preserved. Thus, the best time to
 252         run <literal>obdfilter-survey</literal> is before the Lustre file system is put into
 253         production. The reason <literal>obdfilter-survey</literal> may be safe to run on a
 254         production file system is because it creates objects with object sequence 2. Normal file
 255         system objects are typically created with object sequence 0.</para>
 256     </caution>
 257     <note>
 258       <para>If the <literal>obdfilter-survey</literal> test is terminated before it completes, some
 259         small amount of space is leaked. you can either ignore it or reformat the file
 260         system.</para>
 261     </note>
 262     <note>
 263       <para>The <literal>obdfilter-survey</literal> script is <emphasis>NOT</emphasis> scalable
 264         beyond tens of OSTs since it is only intended to measure the I/O performance of individual
 265         storage subsystems, not the scalability of the entire system.</para>
 266     </note>
 267     <note>
 268       <para>The <literal>obdfilter-survey</literal> script must be customized, depending on the
 269         components under test and where the script&apos;s working files should be kept.
 270         Customization variables are described at the beginning of the
 271           <literal>obdfilter-survey</literal> script. In particular, pay attention to the listed
 272         maximum values listed for each parameter in the script.</para>
 273     </note>
 274     <section xml:id="dbdoclet.50438212_59319">
 275       <title><indexterm><primary>benchmarking</primary><secondary>local disk</secondary></indexterm>Testing Local Disk Performance</title>
 276       <para>The <literal>obdfilter-survey</literal> script can be run automatically or manually
 277         against a local disk. This script profiles the overall throughput of storage hardware,
 278         including the file system and RAID layers managing the storage, by sending workloads to the
 279         OSTs that vary in thread count, object count, and I/O size.</para>
 280       <para>When the <literal>obdfilter-survey</literal> script is run, it provides information
 281         about the performance abilities of the storage hardware and shows the saturation
 282         points.</para>
 283       <para>The <literal>plot-obdfilter</literal> script generates from the output of the
 284           <literal>obdfilter-survey</literal> a CSV file and parameters for importing into a
 285         spreadsheet or gnuplot to visualize the data.</para>
 286       <para>To run the <literal>obdfilter-survey</literal> script, create a standard Lustre file
 287         system configuration; no special setup is needed.</para>
 288       <para><emphasis role="bold">To perform an automatic run:</emphasis></para>
 289       <orderedlist>
 290         <listitem>
 291           <para>Start the Lustre OSTs.</para>
 292           <para>The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.</para>
 293         </listitem>
 294         <listitem>
 295           <para>Verify that the obdecho module is loaded. Run:</para>
 296           <screen>modprobe obdecho</screen>
 297         </listitem>
 298         <listitem>
 299           <para>Run the <literal>obdfilter-survey</literal> script with the parameter
 300               <literal>case=disk</literal>.</para>
 301           <para>For example, to run a local test with up to two objects (nobjhi), up to two threads (thrhi), and 1024 MB transfer size (size):</para>
 302           <screen>$ nobjhi=2 thrhi=2 size=1024 case=disk sh obdfilter-survey</screen>
 303         </listitem>
 304       </orderedlist>
 305       <para><emphasis role="italic">To perform a manual run:</emphasis></para>
 306       <orderedlist>
 307         <listitem>
 308           <para>Start the Lustre OSTs.</para>
 309           <para>The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.</para>
 310         </listitem>
 311         <listitem>
 312           <para>Verify that the <literal>obdecho</literal> module is loaded. Run:</para>
 313           <screen>modprobe obdecho</screen>
 314         </listitem>
 315         <listitem>
 316           <para>Determine the OST names.</para>
 317           <para>On the OSS nodes to be tested, run the <literal>lctl dl</literal> command. The OST device names are listed in the fourth column of the output. For example:</para>
 318           <screen>$ lctl dl |grep obdfilter
 319 0 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 1159
 320 2 UP obdfilter lustre-OST0002 lustre-OST0002_UUID 1159
 321 ...</screen>
 322         </listitem>
 323         <listitem>
 324           <para>List all OSTs you want to test.</para>
 325           <para>Use the <literal>targets=parameter</literal> to list the OSTs separated by spaces. List the individual OSTs by name using the format
 326               <literal><replaceable>fsname</replaceable>-<replaceable>OSTnumber</replaceable></literal>
 327             (for example, <literal>lustre-OST0001</literal>). You do not have to specify an MDS or LOV.</para>
 328         </listitem>
 329         <listitem>
 330           <para>Run the <literal>obdfilter-survey</literal> script with the
 331               <literal>targets=parameter</literal>.</para>
 332           <para>For example, to run a local test with up to two objects (<literal>nobjhi</literal>), up to two threads (<literal>thrhi</literal>), and 1024 Mb (size) transfer size:</para>
 333           <screen>$ nobjhi=2 thrhi=2 size=1024 targets=&quot;lustre-OST0001 \
 334            lustre-OST0002&quot; sh obdfilter-survey</screen>
 335         </listitem>
 336       </orderedlist>
 337     </section>
 338     <section xml:id="dbdoclet.50438212_36037">
 339       <title><indexterm><primary>benchmarking</primary><secondary>network</secondary></indexterm>Testing Network Performance</title>
 340       <para>The <literal>obdfilter-survey</literal> script can only be run automatically against a
 341         network; no manual test is provided.</para>
 342       <para>To run the network test, a specific Lustre file system setup is needed. Make sure that
 343         these configuration requirements have been met.</para>
 344       <para><emphasis role="bold">To perform an automatic run:</emphasis></para>
 345       <orderedlist>
 346         <listitem>
 347           <para>Start the Lustre OSTs.</para>
 348           <para>The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.</para>
 349         </listitem>
 350         <listitem>
 351           <para>Verify that the <literal>obdecho</literal> module is loaded. Run:</para>
 352           <screen>modprobe obdecho</screen>
 353         </listitem>
 354         <listitem>
 355           <para>Start <literal>lctl</literal> and check the device list, which must be empty. Run:</para>
 356           <screen>lctl dl</screen>
 357         </listitem>
 358         <listitem>
 359           <para>Run the <literal>obdfilter-survey</literal> script with the parameters
 360               <literal>case=network</literal> and
 361                 <literal>targets=<replaceable>hostname|ip_of_server</replaceable></literal>. For
 362             example:</para>
 363           <screen>$ nobjhi=2 thrhi=2 size=1024 targets=&quot;oss0 oss1&quot; \
 364            case=network sh obdfilter-survey</screen>
 365         </listitem>
 366         <listitem>
 367           <para>On the server side, view the statistics at:</para>
 368           <screen>/proc/fs/lustre/obdecho/<replaceable>echo_srv</replaceable>/stats</screen>
 369           <para>where <literal><replaceable>echo_srv</replaceable></literal>
 370             is the <literal>obdecho</literal> server created by the script.</para>
 371         </listitem>
 372       </orderedlist>
 373     </section>
 374     <section xml:id="dbdoclet.50438212_62662">
 375       <title><indexterm><primary>benchmarking</primary><secondary>remote disk</secondary></indexterm>Testing Remote Disk Performance</title>
 376       <para>The <literal>obdfilter-survey</literal> script can be run automatically or manually
 377         against a network disk. To run the network disk test, start with a standard Lustre
 378         configuration. No special setup is needed.</para>
 379       <para><emphasis role="bold">To perform an automatic run:</emphasis></para>
 380       <orderedlist>
 381         <listitem>
 382           <para>Start the Lustre OSTs.</para>
 383           <para>The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.</para>
 384         </listitem>
 385         <listitem>
 386           <para>Verify that the <literal>obdecho</literal> module is loaded. Run:</para>
 387           <screen>modprobe obdecho</screen>
 388         </listitem>
 389         <listitem>
 390           <para>Run the <literal>obdfilter-survey</literal> script with the parameter
 391               <literal>case=netdisk</literal>. For example:</para>
 392           <screen>$ nobjhi=2 thrhi=2 size=1024 case=netdisk sh obdfilter-survey
 393 </screen>
 394         </listitem>
 395       </orderedlist>
 396       <para><emphasis role="bold">To perform a manual run:</emphasis></para>
 397       <orderedlist>
 398         <listitem>
 399           <para>Start the Lustre OSTs.</para>
 400           <para>The Lustre OSTs should be mounted on the OSS node(s) to be tested. The Lustre client is not required to be mounted at this time.</para>
 401         </listitem>
 402         <listitem>
 403           <para>Verify that the <literal>obdecho</literal> module is loaded. Run:</para>
 404           <para>modprobe obdecho</para>
 405         </listitem>
 406         <listitem>
 407           <para>Determine the OSC names.</para>
 408           <para>On the OSS nodes to be tested, run the <literal>lctl dl</literal> command. The OSC device names are listed in the fourth column of the output. For example:</para>
 409           <screen>$ lctl dl |grep obdfilter
 410 3 UP osc lustre-OST0000-osc-ffff88007754bc00 \
 411            54b91eab-0ea9-1516-b571-5e6df349592e 5
 412 4 UP osc lustre-OST0001-osc-ffff88007754bc00 \
 413            54b91eab-0ea9-1516-b571-5e6df349592e 5
 414 ...
 415 </screen>
 416         </listitem>
 417         <listitem>
 418           <para>List all OSCs you want to test.</para>
 419           <para>Use the <literal>targets=parameter</literal> to list the OSCs separated by spaces. List the individual OSCs by name separated by spaces using the format <literal><replaceable>fsname</replaceable>-<replaceable>OST_name</replaceable>-osc-<replaceable>instance</replaceable></literal> (for example, <literal>lustre-OST0000-osc-ffff88007754bc00</literal>). You <emphasis>do not have to specify an MDS or LOV.</emphasis></para>
 420         </listitem>
 421         <listitem>
 422           <para>Run the <literal>obdfilter-survey</literal> script with the
 423                 <literal>targets=<replaceable>osc</replaceable></literal> and
 424               <literal>case=netdisk</literal>.</para>
 425           <para>An example of a local test run with up to two objects (<literal>nobjhi</literal>), up to two threads (<literal>thrhi</literal>), and 1024 Mb (size) transfer size is shown below:</para>
 426           <screen>$ nobjhi=2 thrhi=2 size=1024 \
 427            targets=&quot;lustre-OST0000-osc-ffff88007754bc00 \
 428            lustre-OST0001-osc-ffff88007754bc00&quot; sh obdfilter-survey
 429 </screen>
 430         </listitem>
 431       </orderedlist>
 432     </section>
 433     <section remap="h3">
 434       <title>Output Files</title>
 435       <para>When the <literal>obdfilter-survey</literal> script runs, it creates a number of working
 436         files and a pair of result files. All files start with the prefix defined in the variable
 437           <literal>${rslt}</literal>.</para>
 438       <informaltable frame="all">
 439         <tgroup cols="2">
 440           <colspec colname="c1" colwidth="50*"/>
 441           <colspec colname="c2" colwidth="50*"/>
 442           <thead>
 443             <row>
 444               <entry>
 445                 <para><emphasis role="bold">File</emphasis></para>
 446               </entry>
 447               <entry>
 448                 <para><emphasis role="bold">Description</emphasis></para>
 449               </entry>
 450             </row>
 451           </thead>
 452           <tbody>
 453             <row>
 454               <entry>
 455                 <para> <literal>${rslt}.summary</literal></para>
 456               </entry>
 457               <entry>
 458                 <para> Same as stdout</para>
 459               </entry>
 460             </row>
 461             <row>
 462               <entry>
 463                 <para> <literal>${rslt}.script_*</literal></para>
 464               </entry>
 465               <entry>
 466                 <para> Per-host test script files</para>
 467               </entry>
 468             </row>
 469             <row>
 470               <entry>
 471                 <para> <literal>${rslt}.detail_tmp*</literal></para>
 472               </entry>
 473               <entry>
 474                 <para> Per-OST result files</para>
 475               </entry>
 476             </row>
 477             <row>
 478               <entry>
 479                 <para> <literal>${rslt}.detail</literal></para>
 480               </entry>
 481               <entry>
 482                 <para> Collected result files for post-mortem</para>
 483               </entry>
 484             </row>
 485           </tbody>
 486         </tgroup>
 487       </informaltable>
 488       <para>The <literal>obdfilter-survey</literal> script iterates over the given number of threads
 489         and objects performing the specified tests and checks that all test processes have completed
 490         successfully.</para>
 491       <note>
 492         <para>The <literal>obdfilter-survey</literal> script may not clean up properly if it is
 493           aborted or if it encounters an unrecoverable error. In this case, a manual cleanup may be
 494           required, possibly including killing any running instances of <literal>lctl</literal>
 495           (local or remote), removing <literal>echo_client</literal> instances created by the script
 496           and unloading <literal>obdecho</literal>.</para>
 497       </note>
 498       <section remap="h4">
 499         <title>Script Output</title>
 500         <para>The <literal>.summary</literal> file and <literal>stdout</literal> of the
 501             <literal>obdfilter-survey</literal> script contain lines like:</para>
 502         <screen>ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]
 503 </screen>
 504         <para>Where:</para>
 505         <informaltable frame="all">
 506           <tgroup cols="2">
 507             <colspec colname="c1" colwidth="50*"/>
 508             <colspec colname="c2" colwidth="50*"/>
 509             <thead>
 510               <row>
 511                 <entry>
 512                   <para><emphasis role="bold">Parameter and value</emphasis></para>
 513                 </entry>
 514                 <entry>
 515                   <para><emphasis role="bold">Description</emphasis></para>
 516                 </entry>
 517               </row>
 518             </thead>
 519             <tbody>
 520               <row>
 521                 <entry>
 522                   <para> ost 8</para>
 523                 </entry>
 524                 <entry>
 525                   <para> Total number of OSTs being tested.</para>
 526                 </entry>
 527               </row>
 528               <row>
 529                 <entry>
 530                   <para> sz 67108864K</para>
 531                 </entry>
 532                 <entry>
 533                   <para> Total amount of data read or written (in KB).</para>
 534                 </entry>
 535               </row>
 536               <row>
 537                 <entry>
 538                   <para> rsz 1024</para>
 539                 </entry>
 540                 <entry>
 541                   <para> Record size (size of each echo_client I/O, in KB).</para>
 542                 </entry>
 543               </row>
 544               <row>
 545                 <entry>
 546                   <para> obj 8</para>
 547                 </entry>
 548                 <entry>
 549                   <para> Total number of objects over all OSTs.</para>
 550                 </entry>
 551               </row>
 552               <row>
 553                 <entry>
 554                   <para> thr 8</para>
 555                 </entry>
 556                 <entry>
 557                   <para> Total number of threads over all OSTs and objects.</para>
 558                 </entry>
 559               </row>
 560               <row>
 561                 <entry>
 562                   <para> write</para>
 563                 </entry>
 564                 <entry>
 565                   <para> Test name. If more tests have been specified, they all appear on the same line.</para>
 566                 </entry>
 567               </row>
 568               <row>
 569                 <entry>
 570                   <para> 613.54</para>
 571                 </entry>
 572                 <entry>
 573                   <para> Aggregate bandwidth over all OSTs (measured by dividing the total number of MB by the elapsed time).</para>
 574                 </entry>
 575               </row>
 576               <row>
 577                 <entry>
 578                   <para> [64, 82.00]</para>
 579                 </entry>
 580                 <entry>
 581                   <para> Minimum and maximum instantaneous bandwidths on an individual OST.</para>
 582                 </entry>
 583               </row>
 584             </tbody>
 585           </tgroup>
 586         </informaltable>
 587         <note>
 588           <para>Although the numbers of threads and objects are specified per-OST in the customization section of the script, the reported results are aggregated over all OSTs.</para>
 589         </note>
 590       </section>
 591       <section remap="h4">
 592         <title>Visualizing Results</title>
 593         <para>It is useful to import the <literal>obdfilter-survey</literal> script summary data (it
 594           is fixed width) into Excel (or any graphing package) and graph the bandwidth versus the
 595           number of threads for varying numbers of concurrent regions. This shows how the OSS
 596           performs for a given number of concurrently-accessed objects (files) with varying numbers
 597           of I/Os in flight.</para>
 598         <para>It is also useful to monitor and record average disk I/O sizes during each test using the &apos;disk io size&apos; histogram in the file <literal>/proc/fs/lustre/obdfilter/</literal> (see <xref linkend="dbdoclet.50438271_55057"/> for details). These numbers help identify problems in the system when full-sized I/Os are not submitted to the underlying disk. This may be caused by problems in the device driver or Linux block layer.</para>
 599         <screen> */brw_stats</screen>
 600         <para>The <literal>plot-obdfilter</literal> script included in the I/O toolkit is an example of processing output files to a .csv format and plotting a graph using <literal>gnuplot</literal>.</para>
 601       </section>
 602     </section>
 603   </section>
 604   <section xml:id="dbdoclet.50438212_85136">
 605       <title><indexterm>
 606         <primary>benchmarking</primary>
 607         <secondary>OST I/O</secondary>
 608       </indexterm>Testing OST I/O Performance (<literal>ost-survey</literal>)</title>
 609     <para>The <literal>ost-survey</literal> tool is a shell script that uses <literal>lfs
 610         setstripe</literal> to perform I/O against a single OST. The script writes a file (currently
 611       using <literal>dd</literal>) to each OST in the Lustre file system, and compares read and
 612       write speeds. The <literal>ost-survey</literal> tool is used to detect anomalies between
 613       otherwise identical disk subsystems.</para>
 614     <note>
 615       <para>We have frequently discovered wide performance variations across all LUNs in a cluster.
 616         This may be caused by faulty disks, RAID parity reconstruction during the test, or faulty
 617         network hardware.</para>
 618     </note>
 619     <para>To run the <literal>ost-survey</literal> script, supply a file size (in KB) and the Lustre
 620       file system mount point. For example, run:</para>
 621     <screen>$ ./ost-survey.sh -s 10 /mnt/lustre
 622 </screen>
 623     <para>Typical output is:</para>
 624     <screen>
 625 Number of Active OST devices : 4
 626 Worst  Read OST indx: 2 speed: 2835.272725
 627 Best   Read OST indx: 3 speed: 2872.889668
 628 Read Average: 2852.508999 +/- 16.444792 MB/s
 629 Worst  Write OST indx: 3 speed: 17.705545
 630 Best   Write OST indx: 2 speed: 128.172576
 631 Write Average: 95.437735 +/- 45.518117 MB/s
 632 Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
 633 ----------------------------------------------------
 634 0     2837.440       126.918        0.035      0.788
 635 1     2864.433       108.954        0.035      0.918
 636 2     2835.273       128.173        0.035      0.780
 637 3     2872.890       17.706        0.035      5.648
 638 </screen>
 639   </section>
 640   <section xml:id="mds_survey_ref">
 641     <title><indexterm><primary>benchmarking</primary><secondary>MDS
 642 performance</secondary></indexterm>Testing MDS Performance (<literal>mds-survey</literal>)</title>
 643         <para><literal>mds-survey</literal> is available in Lustre software release 2.2 and beyond. The
 644         <literal>mds-survey</literal> script tests the local metadata performance using the
 645       echo_client to drive different layers of the MDS stack: mdd, mdt, osd (the Lustre software
 646       only supports mdd stack). It can be used with the following classes of operations:</para>
 647
 648     <itemizedlist>
 649       <listitem>
 650         <para><literal>Open-create/mkdir/create</literal></para>
 651       </listitem>
 652       <listitem>
 653         <para><literal>Lookup/getattr/setxattr</literal></para>
 654       </listitem>
 655       <listitem>
 656         <para><literal>Delete/destroy</literal></para>
 657       </listitem>
 658       <listitem>
 659         <para><literal>Unlink/rmdir</literal></para>
 660       </listitem>
 661     </itemizedlist>
 662     <para>These operations will be run by a variable number of concurrent threads and will test with the number of directories specified by the user. The run can be executed such that all threads operate in a single directory (dir_count=1) or in private/unique directory (dir_count=x thrlo=x thrhi=x).</para>
 663
 664     <para>The mdd instance is driven directly. The script automatically loads the obdecho module if required and creates instance of echo_client.</para>
 665
 666     <para>This script can also create OST objects by providing stripe_count greater than zero.</para>
 667
 668     <para><emphasis role="bold">To perform a run:</emphasis></para>
 669       <orderedlist>
 670         <listitem>
 671           <para>Start the Lustre MDT.</para>
 672           <para>The Lustre MDT should be mounted on the MDS node to be tested.</para>
 673         </listitem>
 674         <listitem>
 675           <para>Start the Lustre OSTs (optional, only required when test with OST objects)</para>
 676           <para>The Lustre OSTs should be mounted on the OSS node(s).</para>
 677         </listitem>
 678         <listitem>
 679           <para>Run the <literal>mds-survey</literal> script as explain below</para>
 680           <para>The script must be customized according to the components under test and where it should keep its working files. Customization variables are described as followed:</para>
 681           <itemizedlist>
 682             <listitem>
 683               <para><literal>thrlo</literal> - threads to start testing. skipped if less than
 684                 <literal>dir_count</literal></para>
 685             </listitem>
 686             <listitem>
 687               <para><literal>thrhi</literal> - maximum number of threads to test</para>
 688             </listitem>
 689             <listitem>
 690               <para><literal>targets</literal> - MDT instance</para>
 691             </listitem>
 692             <listitem>
 693               <para><literal>file_count</literal> - number of files per thread to test</para>
 694             </listitem>
 695             <listitem>
 696               <para><literal>dir_count</literal> - total number of directories to test. Must be less
 697               than or equal to <literal>thrhi</literal></para>
 698             </listitem>
 699             <listitem>
 700               <para><literal>stripe_count </literal>- number stripe on OST objects</para>
 701             </listitem>
 702             <listitem>
 703               <para><literal>tests_str</literal> - test operations. Must have at least "create" and
 704               "destroy"</para>
 705             </listitem>
 706             <listitem>
 707               <para><literal>start_number</literal> - base number for each thread to prevent name
 708               collisions</para>
 709             </listitem>
 710             <listitem>
 711               <para><literal>layer</literal> - MDS stack's layer to be tested</para>
 712             </listitem>
 713           </itemizedlist>
 714           <para>Run without OST objects creation:</para>
 715           <para>Setup the Lustre MDS without OST mounted. Then invoke the <literal>mds-survey</literal> script</para>
 716           <screen>$ thrhi=64 file_count=200000 sh mds-survey</screen>
 717           <para>Run with OST objects creation:</para>
 718           <para>Setup the Lustre MDS with at least one OST mounted. Then invoke the
 719             <literal>mds-survey</literal> script with <literal>stripe_count</literal>
 720           parameter</para>
 721           <screen>$ thrhi=64 file_count=200000 stripe_count=2 sh mds-survey</screen>
 722           <para>Note: a specific MDT instance can be specified using targets variable.</para>
 723           <screen>$ targets=lustre-MDT0000 thrhi=64 file_count=200000 stripe_count=2 sh mds-survey</screen>
 724         </listitem>
 725       </orderedlist>
 726     <section remap="h3">
 727       <title>Output Files</title>
 728       <para>When the <literal>mds-survey</literal> script runs, it creates a number of working files and a pair of result files. All files start with the prefix defined in the variable <literal>${rslt}</literal>.</para>
 729       <informaltable frame="all">
 730         <tgroup cols="2">
 731           <colspec colname="c1" colwidth="50*"/>
 732           <colspec colname="c2" colwidth="50*"/>
 733           <thead>
 734             <row>
 735               <entry>
 736                 <para><emphasis role="bold">File</emphasis></para>
 737               </entry>
 738               <entry>
 739                 <para><emphasis role="bold">Description</emphasis></para>
 740               </entry>
 741             </row>
 742           </thead>
 743           <tbody>
 744             <row>
 745               <entry>
 746                 <para> <literal>${rslt}.summary</literal></para>
 747               </entry>
 748               <entry>
 749                 <para> Same as stdout</para>
 750               </entry>
 751             </row>
 752             <row>
 753               <entry>
 754                 <para> <literal>${rslt}.script_*</literal></para>
 755               </entry>
 756               <entry>
 757                 <para> Per-host test script files</para>
 758               </entry>
 759             </row>
 760             <row>
 761               <entry>
 762                 <para> <literal>${rslt}.detail_tmp*</literal></para>
 763               </entry>
 764               <entry>
 765                 <para> Per-mdt result files</para>
 766               </entry>
 767             </row>
 768             <row>
 769               <entry>
 770                 <para> <literal>${rslt}.detail</literal></para>
 771               </entry>
 772               <entry>
 773                 <para> Collected result files for post-mortem</para>
 774               </entry>
 775             </row>
 776           </tbody>
 777         </tgroup>
 778       </informaltable>
 779       <para>The <literal>mds-survey</literal> script iterates over the given number of threads performing the specified tests and checks that all test processes have completed successfully.</para>
 780       <note>
 781       <para>The <literal>mds-survey</literal> script may not clean up properly if it is aborted or if it encounters an unrecoverable error. In this case, a manual cleanup may be required, possibly including killing any running instances of <literal>lctl</literal>, removing <literal>echo_client</literal> instances created by the script and unloading <literal>obdecho</literal>.</para>
 782       </note>
 783     </section>
 784       <section remap="h4">
 785         <title>Script Output</title>
 786         <para>The <literal>.summary</literal> file and <literal>stdout</literal> of the <literal>mds-survey</literal> script contain lines like:</para>
 787         <screen>mdt 1 file 100000 dir 4 thr 4 create 5652.05 [ 999.01,46940.48] destroy 5797.79 [ 0.00,52951.55] </screen>
 788         <para>Where:</para>
 789         <informaltable frame="all">
 790           <tgroup cols="2">
 791             <colspec colname="c1" colwidth="50*"/>
 792             <colspec colname="c2" colwidth="50*"/>
 793             <thead>
 794               <row>
 795                 <entry>
 796                   <para><emphasis role="bold">Parameter and value</emphasis></para>
 797                 </entry>
 798                 <entry>
 799                   <para><emphasis role="bold">Description</emphasis></para>
 800                 </entry>
 801               </row>
 802             </thead>
 803             <tbody>
 804               <row>
 805                 <entry>
 806                   <para>mdt 1</para>
 807                 </entry>
 808                 <entry>
 809                   <para>Total number of MDT under test</para>
 810                 </entry>
 811               </row>
 812               <row>
 813                 <entry>
 814                   <para>file 100000</para>
 815                 </entry>
 816                 <entry>
 817                   <para>Total number of files per thread to operate</para>
 818                 </entry>
 819               </row>
 820               <row>
 821                 <entry>
 822                   <para>dir 4</para>
 823                 </entry>
 824                 <entry>
 825                   <para>Total number of directories to operate</para>
 826                 </entry>
 827               </row>
 828               <row>
 829                 <entry>
 830                   <para>thr 4</para>
 831                 </entry>
 832                 <entry>
 833                   <para>Total number of threads operate over all directories</para>
 834                 </entry>
 835               </row>
 836               <row>
 837                 <entry>
 838                   <para>create, destroy</para>
 839                 </entry>
 840                 <entry>
 841                   <para>Tests name. More tests will be displayed on the same line.</para>
 842                 </entry>
 843               </row>
 844               <row>
 845                 <entry>
 846                   <para>565.05</para>
 847                 </entry>
 848                 <entry>
 849                   <para>Aggregate operations over MDT measured by dividing the total number of operations by the elapsed time.</para>
 850                 </entry>
 851               </row>
 852               <row>
 853                 <entry>
 854                   <para>[999.01,46940.48]</para>
 855                 </entry>
 856                 <entry>
 857                   <para>Minimum and maximum instantaneous operation seen on any individual MDT</para>
 858                 </entry>
 859               </row>
 860             </tbody>
 861           </tgroup>
 862         </informaltable>
 863         <note>
 864         <para>If script output has "ERROR", this usually means there is issue during the run such as running out of space on the MDT and/or OST. More detailed debug information is available in the ${rslt}.detail file</para>
 865       </note>
 866       </section>
 867   </section>
 868   <section xml:id="dbdoclet.50438212_58201">
 869     <title><indexterm><primary>benchmarking</primary><secondary>application profiling</secondary></indexterm>Collecting Application Profiling Information (<literal>stats-collect</literal>)</title>
 870     <para>The <literal>stats-collect</literal> utility contains the following scripts used to collect application profiling information from Lustre clients and servers:</para>
 871     <itemizedlist>
 872       <listitem>
 873         <para><literal>lstat.sh</literal>  - Script for a single node that is run on each profile node.</para>
 874       </listitem>
 875       <listitem>
 876         <para><literal>gather_stats_everywhere.sh</literal>  - Script that collect statistics.</para>
 877       </listitem>
 878       <listitem>
 879         <para><literal>config.sh</literal>  - Script that contains customized configuration descriptions.</para>
 880       </listitem>
 881     </itemizedlist>
 882     <para>The <literal>stats-collect</literal> utility requires:</para>
 883     <itemizedlist>
 884       <listitem>
 885         <para>Lustre software to be installed and set up on your cluster</para>
 886       </listitem>
 887       <listitem>
 888         <para>SSH and SCP access to these nodes without requiring a password</para>
 889       </listitem>
 890     </itemizedlist>
 891     <section remap="h3">
 892       <title>Using <literal>stats-collect</literal></title>
 893       <para>The stats-collect utility is configured by including profiling configuration variables in the config.sh script. Each configuration variable takes the following form, where 0 indicates statistics are to be collected only when the script starts and stops and <emphasis>n</emphasis> indicates the interval in seconds at which statistics are to be collected:</para>
 894       <screen><replaceable>statistic</replaceable>_INTERVAL=<replaceable>0|n</replaceable></screen>
 895       <para>Statistics that can be collected include:</para>
 896       <itemizedlist>
 897         <listitem>
 898           <para><literal>VMSTAT</literal>  - Memory and CPU usage and aggregate read/write operations</para>
 899         </listitem>
 900         <listitem>
 901           <para><literal>SERVICE</literal>  - Lustre OST and MDT RPC service statistics</para>
 902         </listitem>
 903         <listitem>
 904           <para><literal>BRW</literal> - OST bulk read/write statistics (brw_stats)</para>
 905         </listitem>
 906         <listitem>
 907           <para><literal>SDIO</literal>  - SCSI disk IO statistics (sd_iostats)</para>
 908         </listitem>
 909         <listitem>
 910           <para><literal>MBALLOC</literal>  - <literal>ldiskfs</literal> block allocation statistics</para>
 911         </listitem>
 912         <listitem>
 913           <para><literal>IO</literal>  - Lustre target operations statistics</para>
 914         </listitem>
 915         <listitem>
 916           <para><literal>JBD</literal>  - ldiskfs journal statistics</para>
 917         </listitem>
 918         <listitem>
 919           <para><literal>CLIENT</literal>  - Lustre OSC request statistics</para>
 920         </listitem>
 921       </itemizedlist>
 922       <para>To collect profile information:</para>
 923       <para>Begin collecting statistics on each node specified in the config.sh script.</para>
 924       <orderedlist>
 925         <listitem>
 926           <para>Starting the collect profile daemon on each node by entering:</para>
 927           <screen>sh gather_stats_everywhere.sh config.sh start
 928 </screen>
 929         </listitem>
 930         <listitem>
 931           <para>Run the test.</para>
 932         </listitem>
 933         <listitem>
 934           <para>Stop collecting statistics on each node, clean up the temporary file, and create a profiling tarball.</para>
 935           <para>Enter:</para>
 936           <screen>sh gather_stats_everywhere.sh config.sh stop <replaceable>log_name</replaceable>.tgz</screen>
 937           <para>When <literal><replaceable>log_name</replaceable>.tgz</literal>
 938             is specified, a profile tarball <literal><replaceable>/tmp/log_name</replaceable>.tgz</literal> is created.</para>
 939         </listitem>
 940         <listitem>
 941           <para>Analyze the collected statistics and create a csv tarball for the specified profiling data.</para>
 942           <screen>sh gather_stats_everywhere.sh config.sh analyse log_tarball.tgz csv
 943 </screen>
 944         </listitem>
 945       </orderedlist>
 946     </section>
 947   </section>
 948 </chapter>