lustre-iokit/obdfilter-survey/README

   1
   2 Requirements
   3 ------------
   4
   5 . lustre OSS up and running
   6
   7
   8 Overview
   9 --------
  10
  11 This survey may be used to characterise the performance of a lustre OSS.
  12 It can exercise the OSS either locally or remotely via the network.
  13
  14 The script uses lctl::test_brw to drive the echo_client doing sequential
  15 I/O with varying numbers of threads and objects (files).  One instance of
  16 lctl is spawned for each OST.
  17
  18
  19 Running
  20 -------
  21
  22 The script must be customised according to the particular device under test
  23 and where it should keep its working files.   Customisation variables are
  24 described clearly at the start of the script.
  25
  26 When the script runs, it creates a number of working files and a pair of
  27 result files.  All files start with the prefix given by ${rslt}.
  28
  29 ${rslt}_<date/time>.summary       same as stdout
  30 ${rslt}_<date/time>.detail_tmp*   tmp files
  31 ${rslt}_<date/time>.detail        collected tmp files for post-mortem
  32
  33 The script iterates over the given numbers of threads and objects
  34 performing all the specified tests and checking that all test processes
  35 completed successfully.
  36
  37
  38 Local OSS
  39 ---------
  40
  41 To test a local OSS, setup 'ost_names' with the names of each OST.  If you
  42 are unsure, do 'lctl device_list' and looks for obdfilter instanced e.g...
  43
  44 [root@ns9 root]# lctl device_list
  45   0 UP confobd conf_ost3 OSD_ost3_ns9_UUID 1
  46   1 UP obdfilter ost3 ost3_UUID 1
  47   2 UP ost OSS OSS_UUID 1
  48   3 AT confobd conf_ost12 OSD_ost12_ns9_UUID 1
  49 [root@ns9 root]#
  50
  51 Here device number 1 is an obdfilter instance called 'ost3'.
  52
  53 The script configures an instance of echo_client for each name in ost_names
  54 and tears it down on normal completion.  Note that it does NOT clean up
  55 properly (i.e. manual cleanup is required) if it is not allowed to run to
  56 completion.
  57
  58
  59 Remote OSS
  60 ----------
  61
  62 To test OSS performance over the network, you need to create a lustre
  63 configuration that creates echo_client instances for each OST.
  64
  65
  66 Script output
  67 -------------
  68
  69 The summary file and stdout contain lines like...
  70
  71 ost 8 sz 67108864K rsz 1024 obj    8 thr    8 write  613.54 [ 64.00, 82.00]
  72
  73 ost 8          is the total number of OSTs under test.
  74 sz 67108864K   is the total amount of data read or written (in K).
  75 rsz 1024       is the record size (size of each echo_client I/O).
  76 obj    8       is the total number of objects over all OSTs
  77 thr    8       is the total number of threads over all OSTs and objects
  78 write          is the test name.  If more tests have been specified they
  79                all appear on the same line.
  80 613.54         is the aggregate bandwidth over all OSTs measured by
  81                dividing the total number of MB by the elapsed time.
  82 [64.00, 82.00] are the minimum and maximum instantaneous bandwidths seen on
  83                any individual OST.
  84
  85 Note that although the numbers of threads and objects are specifed per-OST
  86 in the customisation section of the script, results are reported aggregated
  87 over all OSTs.
  88
  89
  90 Visualising Results
  91 -------------------
  92
  93 I've found it most useful to import the summary data (it's fixed width)
  94 into Excel (or any graphing package) and graph bandwidth v. # threads for
  95 varying numbers of concurrent regions.  This shows how the OSS performs for
  96 a given number of concurrently accessed objects (i.e. files) with varying
  97 numbers of I/Os in flight.
  98
  99 It is also extremely useful to record average disk I/O sizes during each
 100 test.  These numbers help find pathologies in file the file system block
 101 allocator and the block device elevator.