ManagingFileSystemIO.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="managingfilesystemio">
   5   <title xml:id="managingfilesystemio.title">Managing the File System and
   6   I/O</title>
   7   <section xml:id="dbdoclet.50438211_17536">
   8     <title>
   9     <indexterm>
  10       <primary>I/O</primary>
  11     </indexterm>
  12     <indexterm>
  13       <primary>I/O</primary>
  14       <secondary>full OSTs</secondary>
  15     </indexterm>Handling Full OSTs</title>
  16     <para>Sometimes a Lustre file system becomes unbalanced, often due to
  17     incorrectly-specified stripe settings, or when very large files are created
  18     that are not striped over all of the OSTs. If an OST is full and an attempt
  19     is made to write more information to the file system, an error occurs. The
  20     procedures below describe how to handle a full OST.</para>
  21     <para>The MDS will normally handle space balancing automatically at file
  22     creation time, and this procedure is normally not needed, but may be
  23     desirable in certain circumstances (e.g. when creating very large files
  24     that would consume more than the total free space of the full OSTs).</para>
  25     <section remap="h3">
  26       <title>
  27       <indexterm>
  28         <primary>I/O</primary>
  29         <secondary>OST space usage</secondary>
  30       </indexterm>Checking OST Space Usage</title>
  31       <para>The example below shows an unbalanced file system:</para>
  32       <screen>
  33 client# lfs df -h
  34 UUID                       bytes           Used            Available       \
  35 Use%            Mounted on
  36 testfs-MDT0000_UUID        4.4G            214.5M          3.9G            \
  37 4%              /mnt/testfs[MDT:0]
  38 testfs-OST0000_UUID        2.0G            751.3M          1.1G            \
  39 37%             /mnt/testfs[OST:0]
  40 testfs-OST0001_UUID        2.0G            755.3M          1.1G            \
  41 37%             /mnt/testfs[OST:1]
  42 testfs-OST0002_UUID        2.0G            1.7G            155.1M          \
  43 86%             /mnt/testfs[OST:2] ****
  44 testfs-OST0003_UUID        2.0G            751.3M          1.1G            \
  45 37%             /mnt/testfs[OST:3]
  46 testfs-OST0004_UUID        2.0G            747.3M          1.1G            \
  47 37%             /mnt/testfs[OST:4]
  48 testfs-OST0005_UUID        2.0G            743.3M          1.1G            \
  49 36%             /mnt/testfs[OST:5]
  50
  51 filesystem summary:        11.8G           5.4G            5.8G            \
  52 45%             /mnt/testfs
  53 </screen>
  54       <para>In this case, OST0002 is almost full and when an attempt is made to
  55       write additional information to the file system (even with uniform
  56       striping over all the OSTs), the write command fails as follows:</para>
  57       <screen>
  58 client# lfs setstripe /mnt/testfs 4M 0 -1
  59 client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100
  60 dd: writing '/mnt/testfs/test_3': No space left on device
  61 98+0 records in
  62 97+0 records out
  63 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s
  64 </screen>
  65     </section>
  66     <section remap="h3">
  67       <title>
  68       <indexterm>
  69         <primary>I/O</primary>
  70         <secondary>taking OST offline</secondary>
  71       </indexterm>Taking a Full OST Offline</title>
  72       <para>To avoid running out of space in the file system, if the OST usage
  73       is imbalanced and one or more OSTs are close to being full while there
  74       are others that have a lot of space, the full OSTs may optionally be
  75       deactivated at the MDS to prevent the MDS from allocating new objects
  76       there.</para>
  77       <orderedlist>
  78         <listitem>
  79           <para>Log into the MDS server:</para>
  80           <screen>
  81 client# ssh root@192.168.0.10
  82 root@192.168.0.10's password:
  83 Last login: Wed Nov 26 13:35:12 2008 from 192.168.0.6
  84 </screen>
  85         </listitem>
  86         <listitem>
  87           <para>Use the
  88           <literal>lctl dl</literal> command to show the status of all file
  89           system components:</para>
  90           <screen>
  91 mds# lctl dl
  92 0 UP mgs MGS MGS 9
  93 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
  94 2 UP mdt MDS MDS_uuid 3
  95 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
  96 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
  97 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
  98 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
  99 7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5
 100 8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
 101 9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
 102 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5
 103 </screen>
 104         </listitem>
 105         <listitem>
 106           <para>Use
 107           <literal>lctl</literal> deactivate to take the full OST
 108           offline:</para>
 109           <screen>
 110 mds# lctl --device 7 deactivate
 111 </screen>
 112         </listitem>
 113         <listitem>
 114           <para>Display the status of the file system components:</para>
 115           <screen>
 116 mds# lctl dl
 117 0 UP mgs MGS MGS 9
 118 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
 119 2 UP mdt MDS MDS_uuid 3
 120 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
 121 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
 122 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
 123 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
 124 7 IN osc testfs-OST0002-osc testfs-mdtlov_UUID 5
 125 8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
 126 9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
 127 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5
 128 </screen>
 129         </listitem>
 130       </orderedlist>
 131       <para>The device list shows that OST0002 is now inactive. When new files
 132       are created in the file system, they will only use the remaining active
 133       OSTs. Either manual space rebalancing can be done by migrating data to
 134       other OSTs, as shown in the next section, or normal file deletion and
 135       creation can be allowed to passively rebalance the space usage.</para>
 136     </section>
 137     <section remap="h3">
 138       <title>
 139       <indexterm>
 140         <primary>I/O</primary>
 141         <secondary>migrating data</secondary>
 142       </indexterm>
 143       <indexterm>
 144         <primary>migrating metadata</primary>
 145       </indexterm>
 146       <indexterm>
 147         <primary>maintenance</primary>
 148         <secondary>full OSTs</secondary>
 149       </indexterm>Migrating Data within a File System</title>
 150
 151       <para condition='l28'>Lustre software version 2.8 includes a feature
 152       to migrate metadata (directories and inodes therein) between MDTs.
 153       This migration can only be performed on whole directories. For example,
 154       to migrate the contents of the <literal>/testfs/testremote</literal>
 155       directory from the MDT it currently resides on to MDT0000, the
 156       sequence of commands is as follows:</para>
 157       <screen>$ cd /testfs
 158 lfs getdirstripe -M ./testremote <lineannotation>which MDT is dir on?</lineannotation>
 159 1
 160 $ for i in $(seq 3); do touch ./testremote/${i}.txt; done <lineannotation>create test files</lineannotation>
 161 $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done <lineannotation>check files are on MDT 1</lineannotation>
 162 1
 163 1
 164 1
 165 $ lfs migrate -m 0 ./testremote <lineannotation>migrate testremote to MDT 0</lineannotation>
 166 $ lfs getdirstripe -M ./testremote <lineannotation>which MDT is dir on now?</lineannotation>
 167 0
 168 $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done <lineannotation>check files are on MDT 0 too</lineannotation>
 169 0
 170 0
 171 0</screen>
 172       <para>For more information, see <literal>man lfs</literal></para>
 173       <warning><para>Currently, only whole directories can be migrated
 174       between MDTs. During migration each file receives a new identifier
 175       (FID). As a consequence, the file receives a new inode number. Some
 176       system tools (for example, backup and archiving tools) may consider
 177       the migrated files to be new, even though the contents are unchanged.
 178       </para></warning>
 179       <para>If there is a need to migrate the file data from the current
 180       OST(s) to new OSTs, the data must be migrated (copied) to the new
 181       location.  The simplest way to do this is to use the
 182       <literal>lfs_migrate</literal> command (see
 183       <xref linkend="dbdoclet.50438206_42260" />). However, the steps for
 184       migrating a file by hand are also shown here for reference.</para>
 185       <orderedlist>
 186         <listitem>
 187           <para>Identify the file(s) to be moved.</para>
 188           <para>In the example below, the object information portion of the output from the
 189           <literal>lfs getstripe</literal> command below shows that the
 190           <literal>test_2</literal>file is located entirely on OST0002:</para>
 191           <screen>
 192 client# lfs getstripe /mnt/testfs/test_2
 193 /mnt/testfs/test_2
 194 obdidx     objid   objid   group
 195      2      8     0x8       0
 196 </screen>
 197         </listitem>
 198         <listitem>
 199           <para>To move the data, create a copy and remove the original:</para>
 200           <screen>
 201 client# cp -a /mnt/testfs/test_2 /mnt/testfs/test_2.tmp
 202 client# mv /mnt/testfs/test_2.tmp /mnt/testfs/test_2
 203 </screen>
 204         </listitem>
 205         <listitem>
 206           <para>If the space usage of OSTs is severely imbalanced, it is
 207           possible to find and migrate large files from their current location
 208           onto OSTs that have more space, one could run:</para>
 209           <screen>
 210 client# lfs find --ost
 211 <replaceable>ost_name</replaceable> -size +1G | lfs_migrate -y
 212 </screen>
 213         </listitem>
 214         <listitem>
 215           <para>Check the file system balance.</para>
 216           <para>The
 217           <literal>lfs df</literal> output in the example below shows a more
 218           balanced system compared to the
 219           <literal>lfs df</literal> output in the example in
 220           <xref linkend="dbdoclet.50438211_17536" />.</para>
 221           <screen>
 222 client# lfs df -h
 223 UUID                 bytes         Used            Available       Use%    \
 224         Mounted on
 225 testfs-MDT0000_UUID   4.4G         214.5M          3.9G            4%      \
 226         /mnt/testfs[MDT:0]
 227 testfs-OST0000_UUID   2.0G         1.3G            598.1M          65%     \
 228         /mnt/testfs[OST:0]
 229 testfs-OST0001_UUID   2.0G         1.3G            594.1M          65%     \
 230         /mnt/testfs[OST:1]
 231 testfs-OST0002_UUID   2.0G         913.4M          1000.0M         45%     \
 232         /mnt/testfs[OST:2]
 233 testfs-OST0003_UUID   2.0G         1.3G            602.1M          65%     \
 234         /mnt/testfs[OST:3]
 235 testfs-OST0004_UUID   2.0G         1.3G            606.1M          64%     \
 236         /mnt/testfs[OST:4]
 237 testfs-OST0005_UUID   2.0G         1.3G            610.1M          64%     \
 238         /mnt/testfs[OST:5]
 239
 240 filesystem summary:  11.8G 7.3G            3.9G    61%                     \
 241 /mnt/testfs
 242 </screen>
 243         </listitem>
 244       </orderedlist>
 245     </section>
 246     <section remap="h3">
 247       <title>
 248       <indexterm>
 249         <primary>I/O</primary>
 250         <secondary>bringing OST online</secondary>
 251       </indexterm>
 252       <indexterm>
 253         <primary>maintenance</primary>
 254         <secondary>bringing OST online</secondary>
 255       </indexterm>Returning an Inactive OST Back Online</title>
 256       <para>Once the deactivated OST(s) no longer are severely imbalanced, due
 257       to either active or passive data redistribution, they should be
 258       reactivated so they will again have new files allocated on them.</para>
 259       <screen>
 260 [mds]# lctl --device 7 activate
 261 [mds]# lctl dl
 262   0 UP mgs MGS MGS 9
 263   1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-816dd1e813 5
 264   2 UP mdt MDS MDS_uuid 3
 265   3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
 266   4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
 267   5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
 268   6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
 269   7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5
 270   8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
 271   9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
 272  10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID
 273 </screen>
 274     </section>
 275   </section>
 276   <section xml:id="dbdoclet.50438211_75549">
 277     <title>
 278     <indexterm>
 279       <primary>I/O</primary>
 280       <secondary>pools</secondary>
 281     </indexterm>
 282     <indexterm>
 283       <primary>maintenance</primary>
 284       <secondary>pools</secondary>
 285     </indexterm>
 286     <indexterm>
 287       <primary>pools</primary>
 288     </indexterm>Creating and Managing OST Pools</title>
 289     <para>The OST pools feature enables users to group OSTs together to make
 290     object placement more flexible. A 'pool' is the name associated with an
 291     arbitrary subset of OSTs in a Lustre cluster.</para>
 292     <para>OST pools follow these rules:</para>
 293     <itemizedlist>
 294       <listitem>
 295         <para>An OST can be a member of multiple pools.</para>
 296       </listitem>
 297       <listitem>
 298         <para>No ordering of OSTs in a pool is defined or implied.</para>
 299       </listitem>
 300       <listitem>
 301         <para>Stripe allocation within a pool follows the same rules as the
 302         normal stripe allocator.</para>
 303       </listitem>
 304       <listitem>
 305         <para>OST membership in a pool is flexible, and can change over
 306         time.</para>
 307       </listitem>
 308     </itemizedlist>
 309     <para>When an OST pool is defined, it can be used to allocate files. When
 310     file or directory striping is set to a pool, only OSTs in the pool are
 311     candidates for striping. If a stripe_index is specified which refers to an
 312     OST that is not a member of the pool, an error is returned.</para>
 313     <para>OST pools are used only at file creation. If the definition of a pool
 314     changes (an OST is added or removed or the pool is destroyed),
 315     already-created files are not affected.</para>
 316     <note>
 317       <para>An error (
 318       <literal>EINVAL</literal>) results if you create a file using an empty
 319       pool.</para>
 320     </note>
 321     <note>
 322       <para>If a directory has pool striping set and the pool is subsequently
 323       removed, the new files created in this directory have the (non-pool)
 324       default striping pattern for that directory applied and no error is
 325       returned.</para>
 326     </note>
 327     <section remap="h3">
 328       <title>Working with OST Pools</title>
 329       <para>OST pools are defined in the configuration log on the MGS. Use the
 330       lctl command to:</para>
 331       <itemizedlist>
 332         <listitem>
 333           <para>Create/destroy a pool</para>
 334         </listitem>
 335         <listitem>
 336           <para>Add/remove OSTs in a pool</para>
 337         </listitem>
 338         <listitem>
 339           <para>List pools and OSTs in a specific pool</para>
 340         </listitem>
 341       </itemizedlist>
 342       <para>The lctl command MUST be run on the MGS. Another requirement for
 343       managing OST pools is to either have the MDT and MGS on the same node or
 344       have a Lustre client mounted on the MGS node, if it is separate from the
 345       MDS. This is needed to validate the pool commands being run are
 346       correct.</para>
 347       <caution>
 348         <para>Running the
 349         <literal>writeconf</literal> command on the MDS erases all pools
 350         information (as well as any other parameters set using
 351         <literal>lctl conf_param</literal>). We recommend that the pools
 352         definitions (and
 353         <literal>conf_param</literal> settings) be executed using a script, so
 354         they can be reproduced easily after a
 355         <literal>writeconf</literal> is performed.</para>
 356       </caution>
 357       <para>To create a new pool, run:</para>
 358       <screen>
 359 mgs# lctl pool_new
 360 <replaceable>fsname</replaceable>.
 361 <replaceable>poolname</replaceable>
 362 </screen>
 363       <note>
 364         <para>The pool name is an ASCII string up to 15 characters.</para>
 365       </note>
 366       <para>To add the named OST to a pool, run:</para>
 367       <screen>
 368 mgs# lctl pool_add
 369 <replaceable>fsname</replaceable>.
 370 <replaceable>poolname</replaceable>
 371 <replaceable>ost_list</replaceable>
 372 </screen>
 373       <para>Where:</para>
 374       <itemizedlist>
 375         <listitem>
 376           <para>
 377             <literal>
 378             <replaceable>ost_list</replaceable>is
 379             <replaceable>fsname</replaceable>-OST
 380             <replaceable>index_range</replaceable></literal>
 381           </para>
 382         </listitem>
 383         <listitem>
 384           <para>
 385           <literal>
 386           <replaceable>index_range</replaceable>is
 387           <replaceable>ost_index_start</replaceable>-
 388           <replaceable>ost_index_end[,index_range]</replaceable></literal> or
 389           <literal>
 390           <replaceable>ost_index_start</replaceable>-
 391           <replaceable>ost_index_end/step</replaceable></literal></para>
 392         </listitem>
 393       </itemizedlist>
 394       <para>If the leading
 395       <literal>
 396         <replaceable>fsname</replaceable>
 397       </literal> and/or ending
 398       <literal>_UUID</literal> are missing, they are automatically added.</para>
 399       <para>For example, to add even-numbered OSTs to
 400       <literal>pool1</literal> on file system
 401       <literal>testfs</literal>, run a single command (
 402       <literal>pool_add</literal>) to add many OSTs to the pool at one
 403       time:</para>
 404       <para>
 405         <screen>
 406 lctl pool_add testfs.pool1 OST[0-10/2]
 407 </screen>
 408       </para>
 409       <note>
 410         <para>Each time an OST is added to a pool, a new
 411         <literal>llog</literal> configuration record is created. For
 412         convenience, you can run a single command.</para>
 413       </note>
 414       <para>To remove a named OST from a pool, run:</para>
 415       <screen>
 416 mgs# lctl pool_remove
 417 <replaceable>fsname</replaceable>.
 418 <replaceable>poolname</replaceable>
 419 <replaceable>ost_list</replaceable>
 420 </screen>
 421       <para>To destroy a pool, run:</para>
 422       <screen>
 423 mgs# lctl pool_destroy
 424 <replaceable>fsname</replaceable>.
 425 <replaceable>poolname</replaceable>
 426 </screen>
 427       <note>
 428         <para>All OSTs must be removed from a pool before it can be
 429         destroyed.</para>
 430       </note>
 431       <para>To list pools in the named file system, run:</para>
 432       <screen>
 433 mgs# lctl pool_list
 434 <replaceable>fsname|pathname</replaceable>
 435 </screen>
 436       <para>To list OSTs in a named pool, run:</para>
 437       <screen>
 438 lctl pool_list
 439 <replaceable>fsname</replaceable>.
 440 <replaceable>poolname</replaceable>
 441 </screen>
 442       <section remap="h4">
 443         <title>Using the lfs Command with OST Pools</title>
 444         <para>Several lfs commands can be run with OST pools. Use the
 445         <literal>lfs setstripe</literal> command to associate a directory with
 446         an OST pool. This causes all new regular files and directories in the
 447         directory to be created in the pool. The lfs command can be used to
 448         list pools in a file system and OSTs in a named pool.</para>
 449         <para>To associate a directory with a pool, so all new files and
 450         directories will be created in the pool, run:</para>
 451         <screen>
 452 client# lfs setstripe --pool|-p pool_name
 453 <replaceable>filename|dirname</replaceable>
 454 </screen>
 455         <para>To set striping patterns, run:</para>
 456         <screen>
 457 client# lfs setstripe [--size|-s stripe_size] [--offset|-o start_ost]
 458            [--count|-c stripe_count] [--pool|-p pool_name]
 459
 460 <replaceable>dir|filename</replaceable>
 461 </screen>
 462         <note>
 463           <para>If you specify striping with an invalid pool name, because the
 464           pool does not exist or the pool name was mistyped,
 465           <literal>lfs setstripe</literal> returns an error. Run
 466           <literal>lfs pool_list</literal> to make sure the pool exists and the
 467           pool name is entered correctly.</para>
 468         </note>
 469         <note>
 470           <para>The
 471           <literal>--pool</literal> option for lfs setstripe is compatible with
 472           other modifiers. For example, you can set striping on a directory to
 473           use an explicit starting index.</para>
 474         </note>
 475       </section>
 476     </section>
 477     <section remap="h3">
 478       <title>
 479       <indexterm>
 480         <primary>pools</primary>
 481         <secondary>usage tips</secondary>
 482       </indexterm>Tips for Using OST Pools</title>
 483       <para>Here are several suggestions for using OST pools.</para>
 484       <itemizedlist>
 485         <listitem>
 486           <para>A directory or file can be given an extended attribute (EA),
 487           that restricts striping to a pool.</para>
 488         </listitem>
 489         <listitem>
 490           <para>Pools can be used to group OSTs with the same technology or
 491           performance (slower or faster), or that are preferred for certain
 492           jobs. Examples are SATA OSTs versus SAS OSTs or remote OSTs versus
 493           local OSTs.</para>
 494         </listitem>
 495         <listitem>
 496           <para>A file created in an OST pool tracks the pool by keeping the
 497           pool name in the file LOV EA.</para>
 498         </listitem>
 499       </itemizedlist>
 500     </section>
 501   </section>
 502   <section xml:id="dbdoclet.50438211_11204">
 503     <title>
 504     <indexterm>
 505       <primary>I/O</primary>
 506       <secondary>adding an OST</secondary>
 507     </indexterm>Adding an OST to a Lustre File System</title>
 508     <para>To add an OST to existing Lustre file system:</para>
 509     <orderedlist>
 510       <listitem>
 511         <para>Add a new OST by passing on the following commands, run:</para>
 512         <screen>
 513 oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
 514 oss# mkdir -p /mnt/testfs/ost12
 515 oss# mount -t lustre /dev/sda /mnt/testfs/ost12
 516 </screen>
 517       </listitem>
 518       <listitem>
 519         <para>Migrate the data (possibly).</para>
 520         <para>The file system is quite unbalanced when new empty OSTs are
 521         added. New file creations are automatically balanced. If this is a
 522         scratch file system or files are pruned at a regular interval, then no
 523         further work may be needed. Files existing prior to the expansion can
 524         be rebalanced with an in-place copy, which can be done with a simple
 525         script.</para>
 526         <para>The basic method is to copy existing files to a temporary file,
 527         then move the temp file over the old one. This should not be attempted
 528         with files which are currently being written to by users or
 529         applications. This operation redistributes the stripes over the entire
 530         set of OSTs.</para>
 531         <para>A very clever migration script would do the following:</para>
 532         <itemizedlist>
 533           <listitem>
 534             <para>Examine the current distribution of data.</para>
 535           </listitem>
 536           <listitem>
 537             <para>Calculate how much data should move from each full OST to the
 538             empty ones.</para>
 539           </listitem>
 540           <listitem>
 541             <para>Search for files on a given full OST (using
 542             <literal>lfs getstripe</literal>).</para>
 543           </listitem>
 544           <listitem>
 545             <para>Force the new destination OST (using
 546             <literal>lfs setstripe</literal>).</para>
 547           </listitem>
 548           <listitem>
 549             <para>Copy only enough files to address the imbalance.</para>
 550           </listitem>
 551         </itemizedlist>
 552       </listitem>
 553     </orderedlist>
 554     <para>If a Lustre file system administrator wants to explore this approach
 555     further, per-OST disk-usage statistics can be found under
 556     <literal>/proc/fs/lustre/osc/*/rpc_stats</literal></para>
 557   </section>
 558   <section xml:id="dbdoclet.50438211_80295">
 559     <title>
 560     <indexterm>
 561       <primary>I/O</primary>
 562       <secondary>direct</secondary>
 563     </indexterm>Performing Direct I/O</title>
 564     <para>The Lustre software supports the
 565     <literal>O_DIRECT</literal> flag to open.</para>
 566     <para>Applications using the
 567     <literal>read()</literal> and
 568     <literal>write()</literal> calls must supply buffers aligned on a page
 569     boundary (usually 4 K). If the alignment is not correct, the call returns
 570     <literal>-EINVAL</literal>. Direct I/O may help performance in cases where
 571     the client is doing a large amount of I/O and is CPU-bound (CPU utilization
 572     100%).</para>
 573     <section remap="h3">
 574       <title>Making File System Objects Immutable</title>
 575       <para>An immutable file or directory is one that cannot be modified,
 576       renamed or removed. To do this:</para>
 577       <screen>
 578 chattr +i
 579 <replaceable>file</replaceable>
 580 </screen>
 581       <para>To remove this flag, use
 582       <literal>chattr -i</literal></para>
 583     </section>
 584   </section>
 585   <section xml:id="dbdoclet.50438211_61024">
 586     <title>Other I/O Options</title>
 587     <para>This section describes other I/O options, including checksums, and
 588     the ptlrpcd thread pool.</para>
 589     <section remap="h3">
 590       <title>Lustre Checksums</title>
 591       <para>To guard against network data corruption, a Lustre client can
 592       perform two types of data checksums: in-memory (for data in client
 593       memory) and wire (for data sent over the network). For each checksum
 594       type, a 32-bit checksum of the data read or written on both the client
 595       and server is computed, to ensure that the data has not been corrupted in
 596       transit over the network. The
 597       <literal>ldiskfs</literal> backing file system does NOT do any persistent
 598       checksumming, so it does not detect corruption of data in the OST file
 599       system.</para>
 600       <para>The checksumming feature is enabled, by default, on individual
 601       client nodes. If the client or OST detects a checksum mismatch, then an
 602       error is logged in the syslog of the form:</para>
 603       <screen>
 604 LustreError: BAD WRITE CHECKSUM: changed in transit before arrival at OST: \
 605 from 192.168.1.1@tcp inum 8991479/2386814769 object 1127239/0 extent [10240\
 606 0-106495]
 607 </screen>
 608       <para>If this happens, the client will re-read or re-write the affected
 609       data up to five times to get a good copy of the data over the network. If
 610       it is still not possible, then an I/O error is returned to the
 611       application.</para>
 612       <para>To enable both types of checksums (in-memory and wire), run:</para>
 613       <screen>
 614 lctl set_param llite.*.checksum_pages=1
 615 </screen>
 616       <para>To disable both types of checksums (in-memory and wire),
 617       run:</para>
 618       <screen>
 619 lctl set_param llite.*.checksum_pages=0
 620 </screen>
 621       <para>To check the status of a wire checksum, run:</para>
 622       <screen>
 623 lctl get_param osc.*.checksums
 624 </screen>
 625       <section remap="h4">
 626         <title>Changing Checksum Algorithms</title>
 627         <para>By default, the Lustre software uses the adler32 checksum
 628         algorithm, because it is robust and has a lower impact on performance
 629         than crc32. The Lustre file system administrator can change the
 630         checksum algorithm via
 631         <literal>lctl get_param</literal>, depending on what is supported in
 632         the kernel.</para>
 633         <para>To check which checksum algorithm is being used by the Lustre
 634         software, run:</para>
 635         <screen>
 636 $ lctl get_param osc.*.checksum_type
 637 </screen>
 638         <para>To change the wire checksum algorithm, run:</para>
 639         <screen>
 640 $ lctl set_param osc.*.checksum_type=
 641 <replaceable>algorithm</replaceable>
 642 </screen>
 643         <note>
 644           <para>The in-memory checksum always uses the adler32 algorithm, if
 645           available, and only falls back to crc32 if adler32 cannot be
 646           used.</para>
 647         </note>
 648         <para>In the following example, the
 649         <literal>lctl get_param</literal> command is used to determine that the
 650         Lustre software is using the adler32 checksum algorithm. Then the
 651         <literal>lctl set_param</literal> command is used to change the checksum
 652         algorithm to crc32. A second
 653         <literal>lctl get_param</literal> command confirms that the crc32
 654         checksum algorithm is now in use.</para>
 655         <screen>
 656 $ lctl get_param osc.*.checksum_type
 657 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler]
 658 $ lctl set_param osc.*.checksum_type=crc32
 659 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32
 660 $ lctl get_param osc.*.checksum_type
 661 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler
 662 </screen>
 663       </section>
 664     </section>
 665     <section remap="h3">
 666       <title>Ptlrpc Thread Pool</title>
 667       <para>Releases prior to Lustre software release 2.2 used two portal RPC
 668       daemons for each client/server pair. One daemon handled all synchronous
 669       IO requests, and the second daemon handled all asynchronous (non-IO)
 670       RPCs. The increasing use of large SMP nodes for Lustre servers exposed
 671       some scaling issues. The lack of threads for large SMP nodes resulted in
 672       cases where a single CPU would be 100% utilized and other CPUs would be
 673       relativity idle. This is especially noticeable when a single client
 674       traverses a large directory.</para>
 675       <para>Lustre software release 2.2.x implements a ptlrpc thread pool, so
 676       that multiple threads can be created to serve asynchronous RPC requests.
 677       The number of threads spawned is controlled at module load time using
 678       module options. By default one thread is spawned per CPU, with a minimum
 679       of 2 threads spawned irrespective of module options.</para>
 680       <para>One of the issues with thread operations is the cost of moving a
 681       thread context from one CPU to another with the resulting loss of CPU
 682       cache warmth. To reduce this cost, ptlrpc threads can be bound to a CPU.
 683       However, if the CPUs are busy, a bound thread may not be able to respond
 684       quickly, as the bound CPU may be busy with other tasks and the thread
 685       must wait to schedule.</para>
 686       <para>Because of these considerations, the pool of ptlrpc threads can be
 687       a mixture of bound and unbound threads. The system operator can balance
 688       the thread mixture based on system size and workload.</para>
 689       <section>
 690         <title>ptlrpcd parameters</title>
 691         <para>These parameters should be set in
 692         <literal>/etc/modprobe.conf</literal> or in the
 693         <literal>etc/modprobe.d</literal> directory, as options for the ptlrpc
 694         module.
 695         <screen>
 696 options ptlrpcd max_ptlrpcds=XXX
 697 </screen></para>
 698         <para>Sets the number of ptlrpcd threads created at module load time.
 699         The default if not specified is one thread per CPU, including
 700         hyper-threaded CPUs. The lower bound is 2 (old prlrpcd behaviour)
 701         <screen>
 702 options ptlrpcd ptlrpcd_bind_policy=[1-4]
 703 </screen></para>
 704         <para>Controls the binding of threads to CPUs. There are four policy
 705         options.</para>
 706         <itemizedlist>
 707           <listitem>
 708             <para>
 709             <literal role="bold">
 710             PDB_POLICY_NONE</literal>(ptlrpcd_bind_policy=1) All threads are
 711             unbound.</para>
 712           </listitem>
 713           <listitem>
 714             <para>
 715             <literal role="bold">
 716             PDB_POLICY_FULL</literal>(ptlrpcd_bind_policy=2) All threads
 717             attempt to bind to a CPU.</para>
 718           </listitem>
 719           <listitem>
 720             <para>
 721             <literal role="bold">
 722             PDB_POLICY_PAIR</literal>(ptlrpcd_bind_policy=3) This is the
 723             default policy. Threads are allocated as a bound/unbound pair. Each
 724             thread (bound or free) has a partner thread. The partnering is used
 725             by the ptlrpcd load policy, which determines how threads are
 726             allocated to CPUs.</para>
 727           </listitem>
 728           <listitem>
 729             <para>
 730             <literal role="bold">
 731             PDB_POLICY_NEIGHBOR</literal>(ptlrpcd_bind_policy=4) Threads are
 732             allocated as a bound/unbound pair. Each thread (bound or free) has
 733             two partner threads.</para>
 734           </listitem>
 735         </itemizedlist>
 736       </section>
 737     </section>
 738   </section>
 739 </chapter>