ManagingFileSystemIO.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="managingfilesystemio">
   5   <title xml:id="managingfilesystemio.title">Managing the File System and
   6   I/O</title>
   7   <section xml:id="dbdoclet.50438211_17536">
   8     <title>
   9     <indexterm>
  10       <primary>I/O</primary>
  11     </indexterm>
  12     <indexterm>
  13       <primary>I/O</primary>
  14       <secondary>full OSTs</secondary>
  15     </indexterm>Handling Full OSTs</title>
  16     <para>Sometimes a Lustre file system becomes unbalanced, often due to
  17     incorrectly-specified stripe settings, or when very large files are created
  18     that are not striped over all of the OSTs. If an OST is full and an attempt
  19     is made to write more information to the file system, an error occurs. The
  20     procedures below describe how to handle a full OST.</para>
  21     <para>The MDS will normally handle space balancing automatically at file
  22     creation time, and this procedure is normally not needed, but may be
  23     desirable in certain circumstances (e.g. when creating very large files
  24     that would consume more than the total free space of the full OSTs).</para>
  25     <section remap="h3">
  26       <title>
  27       <indexterm>
  28         <primary>I/O</primary>
  29         <secondary>OST space usage</secondary>
  30       </indexterm>Checking OST Space Usage</title>
  31       <para>The example below shows an unbalanced file system:</para>
  32       <screen>
  33 client# lfs df -h
  34 UUID                       bytes           Used            Available       \
  35 Use%            Mounted on
  36 lustre-MDT0000_UUID        4.4G            214.5M          3.9G            \
  37 4%              /mnt/lustre[MDT:0]
  38 lustre-OST0000_UUID        2.0G            751.3M          1.1G            \
  39 37%             /mnt/lustre[OST:0]
  40 lustre-OST0001_UUID        2.0G            755.3M          1.1G            \
  41 37%             /mnt/lustre[OST:1]
  42 lustre-OST0002_UUID        2.0G            1.7G            155.1M          \
  43 86%             /mnt/lustre[OST:2] &lt;-
  44 lustre-OST0003_UUID        2.0G            751.3M          1.1G            \
  45 37%             /mnt/lustre[OST:3]
  46 lustre-OST0004_UUID        2.0G            747.3M          1.1G            \
  47 37%             /mnt/lustre[OST:4]
  48 lustre-OST0005_UUID        2.0G            743.3M          1.1G            \
  49 36%             /mnt/lustre[OST:5]
  50
  51 filesystem summary:        11.8G           5.4G            5.8G            \
  52 45%             /mnt/lustre
  53 </screen>
  54       <para>In this case, OST:2 is almost full and when an attempt is made to
  55       write additional information to the file system (even with uniform
  56       striping over all the OSTs), the write command fails as follows:</para>
  57       <screen>
  58 client# lfs setstripe /mnt/lustre 4M 0 -1
  59 client# dd if=/dev/zero of=/mnt/lustre/test_3 bs=10M count=100
  60 dd: writing '/mnt/lustre/test_3': No space left on device
  61 98+0 records in
  62 97+0 records out
  63 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s
  64 </screen>
  65     </section>
  66     <section remap="h3">
  67       <title>
  68       <indexterm>
  69         <primary>I/O</primary>
  70         <secondary>taking OST offline</secondary>
  71       </indexterm>Taking a Full OST Offline</title>
  72       <para>To avoid running out of space in the file system, if the OST usage
  73       is imbalanced and one or more OSTs are close to being full while there
  74       are others that have a lot of space, the full OSTs may optionally be
  75       deactivated at the MDS to prevent the MDS from allocating new objects
  76       there.</para>
  77       <orderedlist>
  78         <listitem>
  79           <para>Log into the MDS server:</para>
  80           <screen>
  81 client# ssh root@192.168.0.10
  82 root@192.168.0.10's password:
  83 Last login: Wed Nov 26 13:35:12 2008 from 192.168.0.6
  84 </screen>
  85         </listitem>
  86         <listitem>
  87           <para>Use the
  88           <literal>lctl dl</literal> command to show the status of all file
  89           system components:</para>
  90           <screen>
  91 mds# lctl dl
  92 0 UP mgs MGS MGS 9
  93 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
  94 2 UP mdt MDS MDS_uuid 3
  95 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
  96 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 5
  97 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
  98 6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
  99 7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5
 100 8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5
 101 9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5
 102 10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5
 103 </screen>
 104         </listitem>
 105         <listitem>
 106           <para>Use
 107           <literal>lctl</literal> deactivate to take the full OST
 108           offline:</para>
 109           <screen>
 110 mds# lctl --device 7 deactivate
 111 </screen>
 112         </listitem>
 113         <listitem>
 114           <para>Display the status of the file system components:</para>
 115           <screen>
 116 mds# lctl dl
 117 0 UP mgs MGS MGS 9
 118 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
 119 2 UP mdt MDS MDS_uuid 3
 120 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
 121 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 5
 122 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
 123 6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
 124 7 IN osc lustre-OST0002-osc lustre-mdtlov_UUID 5
 125 8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5
 126 9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5
 127 10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5
 128 </screen>
 129         </listitem>
 130       </orderedlist>
 131       <para>The device list shows that OST0002 is now inactive. When new files
 132       are created in the file system, they will only use the remaining active
 133       OSTs. Either manual space rebalancing can be done by migrating data to
 134       other OSTs, as shown in the next section, or normal file deletion and
 135       creation can be allowed to passively rebalance the space usage.</para>
 136     </section>
 137     <section remap="h3">
 138       <title>
 139       <indexterm>
 140         <primary>I/O</primary>
 141         <secondary>migrating data</secondary>
 142       </indexterm>
 143       <indexterm>
 144         <primary>migrating metadata</primary>
 145       </indexterm>
 146       <indexterm>
 147         <primary>maintenance</primary>
 148         <secondary>full OSTs</secondary>
 149       </indexterm>Migrating Data within a File System</title>
 150
 151           <para condition='l28'>Lustre software version 2.8 includes a
 152       feature to migrate metadata between MDTs. This migration can only be
 153       performed on whole directories. To migrate the contents of
 154       <literal>/lustre/testremote</literal> from the current MDT to
 155       MDT index 0, the sequence of commands is as follows:</para>
 156       <screen>$ cd /lustre
 157 lfs getdirstripe -M ./testremote <lineannotation>which MDT is dir on?</lineannotation>
 158 1
 159 $ for i in $(seq 3); do touch ./testremote/${i}.txt; done <lineannotation>create test files</lineannotation>
 160 $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done <lineannotation>check files are on MDT 1</lineannotation>
 161 1
 162 1
 163 1
 164 $ lfs migrate -m 0 ./testremote <lineannotation>migrate testremote to MDT 0</lineannotation>
 165 $ lfs getdirstripe -M ./testremote <lineannotation>which MDT is dir on now?</lineannotation>
 166 0
 167 $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done <lineannotation>check files are on MDT 0 too</lineannotation>
 168 0
 169 0
 170 0</screen>
 171       <para>For more information, see <literal>man lfs</literal></para>
 172           <warning><para>Currently, only whole directories can be migrated
 173       between MDTs. During migration each file receives a new identifier
 174       (FID). As a consequence, the file receives a new inode number. File
 175       system tools (for example, backup and archiving tools) may behave
 176       incorrectly with files that are unchanged except for a new inode number.
 177       </para></warning>
 178       <para>As stripes cannot be moved within the file system, data must be
 179       migrated manually by copying and renaming the file, removing the original
 180       file, and renaming the new file with the original file name. The simplest
 181       way to do this is to use the
 182       <literal>lfs_migrate</literal> command (see
 183       <xref linkend="dbdoclet.50438206_42260" />). However, the steps for
 184       migrating a file by hand are also shown here for reference.</para>
 185       <orderedlist>
 186         <listitem>
 187           <para>Identify the file(s) to be moved.</para>
 188           <para>In the example below, output from the
 189           <literal>getstripe</literal> command indicates that the file
 190           <literal>test_2</literal> is located entirely on OST2:</para>
 191           <screen>
 192 client# lfs getstripe /mnt/lustre/test_2
 193 /mnt/lustre/test_2
 194 obdidx     objid   objid   group
 195      2      8     0x8       0
 196 </screen>
 197         </listitem>
 198         <listitem>
 199           <para>To move single object(s), create a new copy and remove the
 200           original. Enter:</para>
 201           <screen>
 202 client# cp -a /mnt/lustre/test_2 /mnt/lustre/test_2.tmp
 203 client# mv /mnt/lustre/test_2.tmp /mnt/lustre/test_2
 204 </screen>
 205         </listitem>
 206         <listitem>
 207           <para>To migrate large files from one or more OSTs, enter:</para>
 208           <screen>
 209 client# lfs find --ost
 210 <replaceable>ost_name</replaceable> -size +1G | lfs_migrate -y
 211 </screen>
 212         </listitem>
 213         <listitem>
 214           <para>Check the file system balance.</para>
 215           <para>The
 216           <literal>df</literal> output in the example below shows a more
 217           balanced system compared to the
 218           <literal>df</literal> output in the example in
 219           <xref linkend="dbdoclet.50438211_17536" />.</para>
 220           <screen>
 221 client# lfs df -h
 222 UUID                 bytes         Used            Available       Use%    \
 223         Mounted on
 224 lustre-MDT0000_UUID   4.4G         214.5M          3.9G            4%      \
 225         /mnt/lustre[MDT:0]
 226 lustre-OST0000_UUID   2.0G         1.3G            598.1M          65%     \
 227         /mnt/lustre[OST:0]
 228 lustre-OST0001_UUID   2.0G         1.3G            594.1M          65%     \
 229         /mnt/lustre[OST:1]
 230 lustre-OST0002_UUID   2.0G         913.4M          1000.0M         45%     \
 231         /mnt/lustre[OST:2]
 232 lustre-OST0003_UUID   2.0G         1.3G            602.1M          65%     \
 233         /mnt/lustre[OST:3]
 234 lustre-OST0004_UUID   2.0G         1.3G            606.1M          64%     \
 235         /mnt/lustre[OST:4]
 236 lustre-OST0005_UUID   2.0G         1.3G            610.1M          64%     \
 237         /mnt/lustre[OST:5]
 238
 239 filesystem summary:  11.8G 7.3G            3.9G    61%                     \
 240 /mnt/lustre
 241 </screen>
 242         </listitem>
 243       </orderedlist>
 244     </section>
 245     <section remap="h3">
 246       <title>
 247       <indexterm>
 248         <primary>I/O</primary>
 249         <secondary>bringing OST online</secondary>
 250       </indexterm>
 251       <indexterm>
 252         <primary>maintenance</primary>
 253         <secondary>bringing OST online</secondary>
 254       </indexterm>Returning an Inactive OST Back Online</title>
 255       <para>Once the deactivated OST(s) no longer are severely imbalanced, due
 256       to either active or passive data redistribution, they should be
 257       reactivated so they will again have new files allocated on them.</para>
 258       <screen>
 259 [mds]# lctl --device 7 activate
 260 [mds]# lctl dl
 261   0 UP mgs MGS MGS 9
 262   1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-816dd1e813 5
 263   2 UP mdt MDS MDS_uuid 3
 264   3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
 265   4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 5
 266   5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
 267   6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
 268   7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5
 269   8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5
 270   9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5
 271  10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID
 272 </screen>
 273     </section>
 274   </section>
 275   <section xml:id="dbdoclet.50438211_75549">
 276     <title>
 277     <indexterm>
 278       <primary>I/O</primary>
 279       <secondary>pools</secondary>
 280     </indexterm>
 281     <indexterm>
 282       <primary>maintenance</primary>
 283       <secondary>pools</secondary>
 284     </indexterm>
 285     <indexterm>
 286       <primary>pools</primary>
 287     </indexterm>Creating and Managing OST Pools</title>
 288     <para>The OST pools feature enables users to group OSTs together to make
 289     object placement more flexible. A 'pool' is the name associated with an
 290     arbitrary subset of OSTs in a Lustre cluster.</para>
 291     <para>OST pools follow these rules:</para>
 292     <itemizedlist>
 293       <listitem>
 294         <para>An OST can be a member of multiple pools.</para>
 295       </listitem>
 296       <listitem>
 297         <para>No ordering of OSTs in a pool is defined or implied.</para>
 298       </listitem>
 299       <listitem>
 300         <para>Stripe allocation within a pool follows the same rules as the
 301         normal stripe allocator.</para>
 302       </listitem>
 303       <listitem>
 304         <para>OST membership in a pool is flexible, and can change over
 305         time.</para>
 306       </listitem>
 307     </itemizedlist>
 308     <para>When an OST pool is defined, it can be used to allocate files. When
 309     file or directory striping is set to a pool, only OSTs in the pool are
 310     candidates for striping. If a stripe_index is specified which refers to an
 311     OST that is not a member of the pool, an error is returned.</para>
 312     <para>OST pools are used only at file creation. If the definition of a pool
 313     changes (an OST is added or removed or the pool is destroyed),
 314     already-created files are not affected.</para>
 315     <note>
 316       <para>An error (
 317       <literal>EINVAL</literal>) results if you create a file using an empty
 318       pool.</para>
 319     </note>
 320     <note>
 321       <para>If a directory has pool striping set and the pool is subsequently
 322       removed, the new files created in this directory have the (non-pool)
 323       default striping pattern for that directory applied and no error is
 324       returned.</para>
 325     </note>
 326     <section remap="h3">
 327       <title>Working with OST Pools</title>
 328       <para>OST pools are defined in the configuration log on the MGS. Use the
 329       lctl command to:</para>
 330       <itemizedlist>
 331         <listitem>
 332           <para>Create/destroy a pool</para>
 333         </listitem>
 334         <listitem>
 335           <para>Add/remove OSTs in a pool</para>
 336         </listitem>
 337         <listitem>
 338           <para>List pools and OSTs in a specific pool</para>
 339         </listitem>
 340       </itemizedlist>
 341       <para>The lctl command MUST be run on the MGS. Another requirement for
 342       managing OST pools is to either have the MDT and MGS on the same node or
 343       have a Lustre client mounted on the MGS node, if it is separate from the
 344       MDS. This is needed to validate the pool commands being run are
 345       correct.</para>
 346       <caution>
 347         <para>Running the
 348         <literal>writeconf</literal> command on the MDS erases all pools
 349         information (as well as any other parameters set using
 350         <literal>lctl conf_param</literal>). We recommend that the pools
 351         definitions (and
 352         <literal>conf_param</literal> settings) be executed using a script, so
 353         they can be reproduced easily after a
 354         <literal>writeconf</literal> is performed.</para>
 355       </caution>
 356       <para>To create a new pool, run:</para>
 357       <screen>
 358 mgs# lctl pool_new
 359 <replaceable>fsname</replaceable>.
 360 <replaceable>poolname</replaceable>
 361 </screen>
 362       <note>
 363         <para>The pool name is an ASCII string up to 16 characters.</para>
 364       </note>
 365       <para>To add the named OST to a pool, run:</para>
 366       <screen>
 367 mgs# lctl pool_add
 368 <replaceable>fsname</replaceable>.
 369 <replaceable>poolname</replaceable>
 370 <replaceable>ost_list</replaceable>
 371 </screen>
 372       <para>Where:</para>
 373       <itemizedlist>
 374         <listitem>
 375           <para>
 376             <literal>
 377             <replaceable>ost_list</replaceable>is
 378             <replaceable>fsname</replaceable>-OST
 379             <replaceable>index_range</replaceable></literal>
 380           </para>
 381         </listitem>
 382         <listitem>
 383           <para>
 384           <literal>
 385           <replaceable>index_range</replaceable>is
 386           <replaceable>ost_index_start</replaceable>-
 387           <replaceable>ost_index_end[,index_range]</replaceable></literal> or
 388           <literal>
 389           <replaceable>ost_index_start</replaceable>-
 390           <replaceable>ost_index_end/step</replaceable></literal></para>
 391         </listitem>
 392       </itemizedlist>
 393       <para>If the leading
 394       <literal>
 395         <replaceable>fsname</replaceable>
 396       </literal> and/or ending
 397       <literal>_UUID</literal> are missing, they are automatically added.</para>
 398       <para>For example, to add even-numbered OSTs to
 399       <literal>pool1</literal> on file system
 400       <literal>lustre</literal>, run a single command (
 401       <literal>pool_add</literal>) to add many OSTs to the pool at one
 402       time:</para>
 403       <para>
 404         <screen>
 405 lctl pool_add lustre.pool1 OST[0-10/2]
 406 </screen>
 407       </para>
 408       <note>
 409         <para>Each time an OST is added to a pool, a new
 410         <literal>llog</literal> configuration record is created. For
 411         convenience, you can run a single command.</para>
 412       </note>
 413       <para>To remove a named OST from a pool, run:</para>
 414       <screen>
 415 mgs# lctl pool_remove
 416 <replaceable>fsname</replaceable>.
 417 <replaceable>poolname</replaceable>
 418 <replaceable>ost_list</replaceable>
 419 </screen>
 420       <para>To destroy a pool, run:</para>
 421       <screen>
 422 mgs# lctl pool_destroy
 423 <replaceable>fsname</replaceable>.
 424 <replaceable>poolname</replaceable>
 425 </screen>
 426       <note>
 427         <para>All OSTs must be removed from a pool before it can be
 428         destroyed.</para>
 429       </note>
 430       <para>To list pools in the named file system, run:</para>
 431       <screen>
 432 mgs# lctl pool_list
 433 <replaceable>fsname|pathname</replaceable>
 434 </screen>
 435       <para>To list OSTs in a named pool, run:</para>
 436       <screen>
 437 lctl pool_list
 438 <replaceable>fsname</replaceable>.
 439 <replaceable>poolname</replaceable>
 440 </screen>
 441       <section remap="h4">
 442         <title>Using the lfs Command with OST Pools</title>
 443         <para>Several lfs commands can be run with OST pools. Use the
 444         <literal>lfs setstripe</literal> command to associate a directory with
 445         an OST pool. This causes all new regular files and directories in the
 446         directory to be created in the pool. The lfs command can be used to
 447         list pools in a file system and OSTs in a named pool.</para>
 448         <para>To associate a directory with a pool, so all new files and
 449         directories will be created in the pool, run:</para>
 450         <screen>
 451 client# lfs setstripe --pool|-p pool_name
 452 <replaceable>filename|dirname</replaceable>
 453 </screen>
 454         <para>To set striping patterns, run:</para>
 455         <screen>
 456 client# lfs setstripe [--size|-s stripe_size] [--offset|-o start_ost]
 457            [--count|-c stripe_count] [--pool|-p pool_name]
 458
 459 <replaceable>dir|filename</replaceable>
 460 </screen>
 461         <note>
 462           <para>If you specify striping with an invalid pool name, because the
 463           pool does not exist or the pool name was mistyped,
 464           <literal>lfs setstripe</literal> returns an error. Run
 465           <literal>lfs pool_list</literal> to make sure the pool exists and the
 466           pool name is entered correctly.</para>
 467         </note>
 468         <note>
 469           <para>The
 470           <literal>--pool</literal> option for lfs setstripe is compatible with
 471           other modifiers. For example, you can set striping on a directory to
 472           use an explicit starting index.</para>
 473         </note>
 474       </section>
 475     </section>
 476     <section remap="h3">
 477       <title>
 478       <indexterm>
 479         <primary>pools</primary>
 480         <secondary>usage tips</secondary>
 481       </indexterm>Tips for Using OST Pools</title>
 482       <para>Here are several suggestions for using OST pools.</para>
 483       <itemizedlist>
 484         <listitem>
 485           <para>A directory or file can be given an extended attribute (EA),
 486           that restricts striping to a pool.</para>
 487         </listitem>
 488         <listitem>
 489           <para>Pools can be used to group OSTs with the same technology or
 490           performance (slower or faster), or that are preferred for certain
 491           jobs. Examples are SATA OSTs versus SAS OSTs or remote OSTs versus
 492           local OSTs.</para>
 493         </listitem>
 494         <listitem>
 495           <para>A file created in an OST pool tracks the pool by keeping the
 496           pool name in the file LOV EA.</para>
 497         </listitem>
 498       </itemizedlist>
 499     </section>
 500   </section>
 501   <section xml:id="dbdoclet.50438211_11204">
 502     <title>
 503     <indexterm>
 504       <primary>I/O</primary>
 505       <secondary>adding an OST</secondary>
 506     </indexterm>Adding an OST to a Lustre File System</title>
 507     <para>To add an OST to existing Lustre file system:</para>
 508     <orderedlist>
 509       <listitem>
 510         <para>Add a new OST by passing on the following commands, run:</para>
 511         <screen>
 512 oss# mkfs.lustre --fsname=spfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
 513 oss# mkdir -p /mnt/test/ost12
 514 oss# mount -t lustre /dev/sda /mnt/test/ost12
 515 </screen>
 516       </listitem>
 517       <listitem>
 518         <para>Migrate the data (possibly).</para>
 519         <para>The file system is quite unbalanced when new empty OSTs are
 520         added. New file creations are automatically balanced. If this is a
 521         scratch file system or files are pruned at a regular interval, then no
 522         further work may be needed. Files existing prior to the expansion can
 523         be rebalanced with an in-place copy, which can be done with a simple
 524         script.</para>
 525         <para>The basic method is to copy existing files to a temporary file,
 526         then move the temp file over the old one. This should not be attempted
 527         with files which are currently being written to by users or
 528         applications. This operation redistributes the stripes over the entire
 529         set of OSTs.</para>
 530         <para>A very clever migration script would do the following:</para>
 531         <itemizedlist>
 532           <listitem>
 533             <para>Examine the current distribution of data.</para>
 534           </listitem>
 535           <listitem>
 536             <para>Calculate how much data should move from each full OST to the
 537             empty ones.</para>
 538           </listitem>
 539           <listitem>
 540             <para>Search for files on a given full OST (using
 541             <literal>lfs getstripe</literal>).</para>
 542           </listitem>
 543           <listitem>
 544             <para>Force the new destination OST (using
 545             <literal>lfs setstripe</literal>).</para>
 546           </listitem>
 547           <listitem>
 548             <para>Copy only enough files to address the imbalance.</para>
 549           </listitem>
 550         </itemizedlist>
 551       </listitem>
 552     </orderedlist>
 553     <para>If a Lustre file system administrator wants to explore this approach
 554     further, per-OST disk-usage statistics can be found under
 555     <literal>/proc/fs/lustre/osc/*/rpc_stats</literal></para>
 556   </section>
 557   <section xml:id="dbdoclet.50438211_80295">
 558     <title>
 559     <indexterm>
 560       <primary>I/O</primary>
 561       <secondary>direct</secondary>
 562     </indexterm>Performing Direct I/O</title>
 563     <para>The Lustre software supports the
 564     <literal>O_DIRECT</literal> flag to open.</para>
 565     <para>Applications using the
 566     <literal>read()</literal> and
 567     <literal>write()</literal> calls must supply buffers aligned on a page
 568     boundary (usually 4 K). If the alignment is not correct, the call returns
 569     <literal>-EINVAL</literal>. Direct I/O may help performance in cases where
 570     the client is doing a large amount of I/O and is CPU-bound (CPU utilization
 571     100%).</para>
 572     <section remap="h3">
 573       <title>Making File System Objects Immutable</title>
 574       <para>An immutable file or directory is one that cannot be modified,
 575       renamed or removed. To do this:</para>
 576       <screen>
 577 chattr +i
 578 <replaceable>file</replaceable>
 579 </screen>
 580       <para>To remove this flag, use
 581       <literal>chattr -i</literal></para>
 582     </section>
 583   </section>
 584   <section xml:id="dbdoclet.50438211_61024">
 585     <title>Other I/O Options</title>
 586     <para>This section describes other I/O options, including checksums, and
 587     the ptlrpcd thread pool.</para>
 588     <section remap="h3">
 589       <title>Lustre Checksums</title>
 590       <para>To guard against network data corruption, a Lustre client can
 591       perform two types of data checksums: in-memory (for data in client
 592       memory) and wire (for data sent over the network). For each checksum
 593       type, a 32-bit checksum of the data read or written on both the client
 594       and server is computed, to ensure that the data has not been corrupted in
 595       transit over the network. The
 596       <literal>ldiskfs</literal> backing file system does NOT do any persistent
 597       checksumming, so it does not detect corruption of data in the OST file
 598       system.</para>
 599       <para>The checksumming feature is enabled, by default, on individual
 600       client nodes. If the client or OST detects a checksum mismatch, then an
 601       error is logged in the syslog of the form:</para>
 602       <screen>
 603 LustreError: BAD WRITE CHECKSUM: changed in transit before arrival at OST: \
 604 from 192.168.1.1@tcp inum 8991479/2386814769 object 1127239/0 extent [10240\
 605 0-106495]
 606 </screen>
 607       <para>If this happens, the client will re-read or re-write the affected
 608       data up to five times to get a good copy of the data over the network. If
 609       it is still not possible, then an I/O error is returned to the
 610       application.</para>
 611       <para>To enable both types of checksums (in-memory and wire), run:</para>
 612       <screen>
 613 lctl set_param llite.*.checksum_pages=1
 614 </screen>
 615       <para>To disable both types of checksums (in-memory and wire),
 616       run:</para>
 617       <screen>
 618 lctl set_param llite.*.checksum_pages=0
 619 </screen>
 620       <para>To check the status of a wire checksum, run:</para>
 621       <screen>
 622 lctl get_param osc.*.checksums
 623 </screen>
 624       <section remap="h4">
 625         <title>Changing Checksum Algorithms</title>
 626         <para>By default, the Lustre software uses the adler32 checksum
 627         algorithm, because it is robust and has a lower impact on performance
 628         than crc32. The Lustre file system administrator can change the
 629         checksum algorithm via
 630         <literal>lctl get_param</literal>, depending on what is supported in
 631         the kernel.</para>
 632         <para>To check which checksum algorithm is being used by the Lustre
 633         software, run:</para>
 634         <screen>
 635 $ lctl get_param osc.*.checksum_type
 636 </screen>
 637         <para>To change the wire checksum algorithm, run:</para>
 638         <screen>
 639 $ lctl set_param osc.*.checksum_type=
 640 <replaceable>algorithm</replaceable>
 641 </screen>
 642         <note>
 643           <para>The in-memory checksum always uses the adler32 algorithm, if
 644           available, and only falls back to crc32 if adler32 cannot be
 645           used.</para>
 646         </note>
 647         <para>In the following example, the
 648         <literal>lctl get_param</literal> command is used to determine that the
 649         Lustre software is using the adler32 checksum algorithm. Then the
 650         <literal>lctl set_param</literal> command is used to change the checksum
 651         algorithm to crc32. A second
 652         <literal>lctl get_param</literal> command confirms that the crc32
 653         checksum algorithm is now in use.</para>
 654         <screen>
 655 $ lctl get_param osc.*.checksum_type
 656 osc.lustre-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler]
 657 $ lctl set_param osc.*.checksum_type=crc32
 658 osc.lustre-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32
 659 $ lctl get_param osc.*.checksum_type
 660 osc.lustre-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler
 661 </screen>
 662       </section>
 663     </section>
 664     <section remap="h3">
 665       <title>Ptlrpc Thread Pool</title>
 666       <para>Releases prior to Lustre software release 2.2 used two portal RPC
 667       daemons for each client/server pair. One daemon handled all synchronous
 668       IO requests, and the second daemon handled all asynchronous (non-IO)
 669       RPCs. The increasing use of large SMP nodes for Lustre servers exposed
 670       some scaling issues. The lack of threads for large SMP nodes resulted in
 671       cases where a single CPU would be 100% utilized and other CPUs would be
 672       relativity idle. This is especially noticeable when a single client
 673       traverses a large directory.</para>
 674       <para>Lustre software release 2.2.x implements a ptlrpc thread pool, so
 675       that multiple threads can be created to serve asynchronous RPC requests.
 676       The number of threads spawned is controlled at module load time using
 677       module options. By default one thread is spawned per CPU, with a minimum
 678       of 2 threads spawned irrespective of module options.</para>
 679       <para>One of the issues with thread operations is the cost of moving a
 680       thread context from one CPU to another with the resulting loss of CPU
 681       cache warmth. To reduce this cost, ptlrpc threads can be bound to a CPU.
 682       However, if the CPUs are busy, a bound thread may not be able to respond
 683       quickly, as the bound CPU may be busy with other tasks and the thread
 684       must wait to schedule.</para>
 685       <para>Because of these considerations, the pool of ptlrpc threads can be
 686       a mixture of bound and unbound threads. The system operator can balance
 687       the thread mixture based on system size and workload.</para>
 688       <section>
 689         <title>ptlrpcd parameters</title>
 690         <para>These parameters should be set in
 691         <literal>/etc/modprobe.conf</literal> or in the
 692         <literal>etc/modprobe.d</literal> directory, as options for the ptlrpc
 693         module.
 694         <screen>
 695 options ptlrpcd max_ptlrpcds=XXX
 696 </screen></para>
 697         <para>Sets the number of ptlrpcd threads created at module load time.
 698         The default if not specified is one thread per CPU, including
 699         hyper-threaded CPUs. The lower bound is 2 (old prlrpcd behaviour)
 700         <screen>
 701 options ptlrpcd ptlrpcd_bind_policy=[1-4]
 702 </screen></para>
 703         <para>Controls the binding of threads to CPUs. There are four policy
 704         options.</para>
 705         <itemizedlist>
 706           <listitem>
 707             <para>
 708             <literal role="bold">
 709             PDB_POLICY_NONE</literal>(ptlrpcd_bind_policy=1) All threads are
 710             unbound.</para>
 711           </listitem>
 712           <listitem>
 713             <para>
 714             <literal role="bold">
 715             PDB_POLICY_FULL</literal>(ptlrpcd_bind_policy=2) All threads
 716             attempt to bind to a CPU.</para>
 717           </listitem>
 718           <listitem>
 719             <para>
 720             <literal role="bold">
 721             PDB_POLICY_PAIR</literal>(ptlrpcd_bind_policy=3) This is the
 722             default policy. Threads are allocated as a bound/unbound pair. Each
 723             thread (bound or free) has a partner thread. The partnering is used
 724             by the ptlrpcd load policy, which determines how threads are
 725             allocated to CPUs.</para>
 726           </listitem>
 727           <listitem>
 728             <para>
 729             <literal role="bold">
 730             PDB_POLICY_NEIGHBOR</literal>(ptlrpcd_bind_policy=4) Threads are
 731             allocated as a bound/unbound pair. Each thread (bound or free) has
 732             two partner threads.</para>
 733           </listitem>
 734         </itemizedlist>
 735       </section>
 736     </section>
 737   </section>
 738 </chapter>