LustreOperations.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="lustreoperations">
   5   <title xml:id="lustreoperations.title">Lustre Operations</title>
   6   <para>Once you have the Lustre file system up and running, you can use the
   7   procedures in this section to perform these basic Lustre administration
   8   tasks.</para>
   9   <section xml:id="mount_by_label">
  10     <title>
  11     <indexterm>
  12       <primary>operations</primary>
  13     </indexterm>
  14     <indexterm>
  15       <primary>operations</primary>
  16       <secondary>mounting by label</secondary>
  17     </indexterm>Mounting by Label</title>
  18     <para>The file system name is limited to 8 characters. We have encoded the
  19     file system and target information in the disk label, so you can mount by
  20     label. This allows system administrators to move disks around without
  21     worrying about issues such as SCSI disk reordering or getting the
  22     <literal>/dev/device</literal> wrong for a shared target. Soon, file system
  23     naming will be made as fail-safe as possible. Currently, Linux disk labels
  24     are limited to 16 characters. To identify the target within the file
  25     system, 8 characters are reserved, leaving 8 characters for the file system
  26     name:</para>
  27 <screen>
  28 <replaceable>fsname</replaceable>-MDT0000 or
  29 <replaceable>fsname</replaceable>-OST0a19
  30 </screen>
  31     <para>To mount by label, use this command:</para>
  32 <screen>
  33 mount -t lustre -L <replaceable>file_system_label</replaceable> <replaceable>/mount_point</replaceable>
  34 </screen>
  35     <para>This is an example of mount-by-label:</para>
  36 <screen>
  37 mds# mount -t lustre -L testfs-MDT0000 /mnt/mdt
  38 </screen>
  39     <caution>
  40       <para>Mount-by-label should NOT be used in a multi-path environment or
  41       when snapshots are being created of the device, since multiple block
  42       devices will have the same label.</para>
  43     </caution>
  44     <para>Although the file system name is internally limited to 8 characters,
  45     you can mount the clients at any mount point, so file system users are not
  46     subjected to short names. Here is an example:</para>
  47 <screen>
  48 client# mount -t lustre mds0@tcp0:/short <replaceable>/dev/long_mountpoint_name</replaceable>
  49 </screen>
  50   </section>
  51   <section xml:id="starting_lustre">
  52     <title>
  53     <indexterm>
  54       <primary>operations</primary>
  55       <secondary>starting</secondary>
  56     </indexterm>Starting Lustre</title>
  57     <para>On the first start of a Lustre file system, the components must be
  58     started in the following order:</para>
  59     <orderedlist>
  60       <listitem>
  61         <para>Mount the MGT.</para>
  62         <note>
  63           <para>If a combined MGT/MDT is present, Lustre will correctly mount
  64           the MGT and MDT automatically.</para>
  65         </note>
  66       </listitem>
  67       <listitem>
  68         <para>Mount the MDT.</para>
  69         <note>
  70           <para>Mount all MDTs if multiple MDTs are present.</para>
  71         </note>
  72       </listitem>
  73       <listitem>
  74         <para>Mount the OST(s).</para>
  75       </listitem>
  76       <listitem>
  77         <para>Mount the client(s).</para>
  78       </listitem>
  79     </orderedlist>
  80   </section>
  81   <section xml:id="mounting_server">
  82     <title>
  83     <indexterm>
  84       <primary>operations</primary>
  85       <secondary>mounting</secondary>
  86     </indexterm>Mounting a Server</title>
  87     <para>Starting a Lustre server is straightforward and only involves the
  88     mount command. Lustre servers can be added to <literal>/etc/fstab</literal>:
  89     </para>
  90 <screen>
  91 mount -t lustre
  92 </screen>
  93     <para>The mount command generates output similar to this:</para>
  94 <screen>
  95 /dev/sda1 on /mnt/test/mdt type lustre (rw)
  96 /dev/sda2 on /mnt/test/ost0 type lustre (rw)
  97 192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)
  98 </screen>
  99     <para>In this example, the MDT, an OST (ost0) and file system (testfs) are
 100     mounted.</para>
 101 <screen>
 102 LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0
 103 LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0
 104 </screen>
 105     <para>In general, it is wise to specify noauto and let your
 106     high-availability (HA) package manage when to mount the device. If you are
 107     not using failover, make sure that networking has been started before
 108     mounting a Lustre server. If you are running Red Hat Enterprise Linux, SUSE
 109     Linux Enterprise Server, Debian operating system (and perhaps others), use
 110     the <literal>_netdev</literal> flag to ensure that these disks are mounted
 111     after the network is up, unless you are using systemd 232 or greater, which
 112     recognize <literal>lustre</literal> as a network filesystem.
 113     If you are using <literal>lnet.service</literal>, use
 114     <literal>x-systemd.requires=lnet.service</literal> regardless of systemd
 115     version.</para>
 116     <para>We are mounting by disk label here. The label of a device can be read
 117     with <literal>e2label</literal>. The label of a newly-formatted Lustre
 118     server may end in <literal>FFFF</literal> if the
 119     <literal>--index</literal> option is not specified to
 120     <literal>mkfs.lustre</literal>, meaning that it has yet to be assigned. The
 121     assignment takes place when the server is first started, and the disk label
 122     is updated. It is recommended that the
 123     <literal>--index</literal> option always be used, which will also ensure
 124     that the label is set at format time.</para>
 125     <caution>
 126       <para>Do not do this when the client and OSS are on the same node, as
 127       memory pressure between the client and OSS can lead to deadlocks.</para>
 128     </caution>
 129     <caution>
 130       <para>Mount-by-label should NOT be used in a multi-path
 131       environment.</para>
 132     </caution>
 133   </section>
 134   <section xml:id="shutdownLustre">
 135       <title>
 136           <indexterm>
 137               <primary>operations</primary>
 138               <secondary>shutdownLustre</secondary>
 139           </indexterm>Stopping the Filesystem</title>
 140       <para>A complete Lustre filesystem shutdown occurs by unmounting all
 141       clients and servers in the order shown below.  Please note that unmounting
 142       a block device causes the Lustre software to be shut down on that node.
 143       </para>
 144       <note><para>Please note that the <literal>-a -t lustre</literal> in the
 145           commands below is not the name of a filesystem, but rather is
 146           specifying to unmount all entries in /etc/mtab that are of type
 147           <literal>lustre</literal></para></note>
 148       <orderedlist>
 149           <listitem><para>Unmount the clients</para>
 150               <para>On each client node, unmount the filesystem on that client
 151               using the <literal>umount</literal> command:</para>
 152               <para><literal>umount -a -t lustre</literal></para>
 153               <para>The example below shows the unmount of the
 154               <literal>testfs</literal> filesystem on a client node:</para>
 155               <para>
 156 <screen>
 157 [root@client1 ~]# mount -t lustre
 158 XXX.XXX.0.11@tcp:/testfs on /mnt/testfs type lustre (rw,lazystatfs)
 159
 160 [root@client1 ~]# umount -a -t lustre
 161 [154523.177714] Lustre: Unmounted testfs-client
 162 </screen>
 163             </para>
 164           </listitem>
 165           <listitem>
 166             <para>Unmount the MDT and MGT</para>
 167             <para>On the MGS and MDS node(s), run the
 168               <literal>umount</literal> command:</para>
 169             <para><literal>umount -a -t lustre</literal></para>
 170             <para>The example below shows the unmount of the MDT and MGT for
 171               the <literal>testfs</literal> filesystem on a combined MGS/MDS:
 172             </para>
 173             <para>
 174 <screen>
 175 [root@mds1 ~]# mount -t lustre
 176 /dev/sda on /mnt/mgt type lustre (ro)
 177 /dev/sdb on /mnt/mdt type lustre (ro)
 178
 179 [root@mds1 ~]# umount -a -t lustre
 180 [155263.566230] Lustre: Failing over testfs-MDT0000
 181 [155263.775355] Lustre: server umount testfs-MDT0000 complete
 182 [155269.843862] Lustre: server umount MGS complete
 183 </screen>
 184             </para>
 185             <para>For a seperate MGS and MDS, the same command is used, first on
 186             the MDS and then followed by the MGS.</para>
 187           </listitem>
 188           <listitem><para>Unmount all the OSTs</para>
 189               <para>On each OSS node, use the <literal>umount</literal> command:
 190               </para>
 191               <para><literal>umount -a -t lustre</literal></para>
 192               <para>The example below shows the unmount of all OSTs for the
 193               <literal>testfs</literal> filesystem on server
 194               <literal>OSS1</literal>:
 195               </para>
 196               <para>
 197 <screen>
 198 [root@oss1 ~]# mount |grep lustre
 199 /dev/sda on /mnt/ost0 type lustre (ro)
 200 /dev/sdb on /mnt/ost1 type lustre (ro)
 201 /dev/sdc on /mnt/ost2 type lustre (ro)
 202
 203 [root@oss1 ~]# umount -a -t lustre
 204 Lustre: Failing over testfs-OST0002
 205 Lustre: server umount testfs-OST0002 complete
 206 </screen>
 207             </para>
 208           </listitem>
 209       </orderedlist>
 210       <para>For unmount command syntax for a single OST, MDT, or MGT target
 211       please refer to <xref linkend="umountTarget"/></para>
 212   </section>
 213   <section xml:id="umountTarget">
 214     <title>
 215     <indexterm>
 216       <primary>operations</primary>
 217       <secondary>unmounting</secondary>
 218     </indexterm>Unmounting a Specific Target on a Server</title>
 219     <para>To stop a Lustre OST, MDT, or MGT , use the
 220     <literal>umount
 221     <replaceable>/mount_point</replaceable></literal> command.</para>
 222     <para>The example below stops an OST, <literal>ost0</literal>, on mount
 223     point <literal>/mnt/ost0</literal> for the <literal>testfs</literal>
 224     filesystem:</para>
 225 <screen>
 226 [root@oss1 ~]# umount /mnt/ost0
 227 Lustre: Failing over testfs-OST0000
 228 Lustre: server umount testfs-OST0000 complete
 229 </screen>
 230     <para>Gracefully stopping a server with the
 231     <literal>umount</literal> command preserves the state of the connected
 232     clients. The next time the server is started, it waits for clients to
 233     reconnect, and then goes through the recovery procedure.</para>
 234     <para>If the force (
 235     <literal>-f</literal>) flag is used, then the server evicts all clients and
 236     stops WITHOUT recovery. Upon restart, the server does not wait for
 237     recovery. Any currently connected clients receive I/O errors until they
 238     reconnect.</para>
 239     <note>
 240       <para>If you are using loopback devices, use the
 241       <literal>-d</literal> flag. This flag cleans up loop devices and can
 242       always be safely specified.</para>
 243     </note>
 244   </section>
 245   <section xml:id="failover_ost">
 246     <title>
 247     <indexterm>
 248       <primary>operations</primary>
 249       <secondary>failover</secondary>
 250     </indexterm>Specifying Failout/Failover Mode for OSTs</title>
 251     <para>In a Lustre file system, an OST that has become unreachable because
 252     it fails, is taken off the network, or is unmounted can be handled in one
 253     of two ways:</para>
 254     <itemizedlist>
 255       <listitem>
 256         <para>In <literal>failout</literal> mode, Lustre clients immediately
 257         receive errors (EIOs) after a timeout, instead of waiting for the OST
 258         to recover.</para>
 259       </listitem>
 260       <listitem>
 261         <para>In <literal>failover</literal> mode, Lustre clients wait for the
 262         OST to recover.</para>
 263       </listitem>
 264     </itemizedlist>
 265     <para>By default, the Lustre file system uses
 266     <literal>failover</literal> mode for OSTs. To specify
 267     <literal>failout</literal> mode instead, use the
 268     <literal>--param="failover.mode=failout"</literal> option as shown below
 269     (entered on one line):</para>
 270 <screen>
 271 oss# mkfs.lustre --fsname=<replaceable>fsname</replaceable> --mgsnode=<replaceable>mgs_NID</replaceable> \
 272         --param=failover.mode=failout --ost --index=<replaceable>ost_index</replaceable> <replaceable>/dev/ost_block_device</replaceable>
 273 </screen>
 274     <para>In the example below,
 275     <literal>failout</literal> mode is specified for the OSTs on the MGS
 276     <literal>mds0</literal> in the file system
 277     <literal>testfs</literal>(entered on one line).</para>
 278 <screen>
 279 oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout \
 280       --ost --index=3 /dev/sdb
 281 </screen>
 282     <caution>
 283       <para>Before running this command, unmount all OSTs that will be affected
 284       by a change in <literal>failover</literal>/<literal>failout</literal> mode.
 285       </para>
 286     </caution>
 287     <note>
 288       <para>After initial file system configuration, use the
 289       <literal>tunefs.lustre</literal> utility to change the mode. For example,
 290       to set the <literal>failout</literal> mode, run:</para>
 291       <para>
 292 <screen>
 293 # tunefs.lustre --param failover.mode=failout <replaceable>/dev/ost_device</replaceable>
 294 </screen>
 295       </para>
 296     </note>
 297   </section>
 298   <section xml:id="degraded_ost">
 299     <title>
 300     <indexterm>
 301       <primary>operations</primary>
 302       <secondary>degraded OST RAID</secondary>
 303     </indexterm>Handling Degraded OST RAID Arrays</title>
 304     <para>Lustre includes functionality that notifies Lustre if an external
 305     RAID array has degraded performance (resulting in reduced overall file
 306     system performance), either because a disk has failed and not been
 307     replaced, or because a disk was replaced and is undergoing a rebuild. To
 308     avoid a global performance slowdown due to a degraded OST, the MDS can
 309     avoid the OST for new object allocation if it is notified of the degraded
 310     state.</para>
 311     <para>A parameter for each OST, called
 312     <literal>degraded</literal>, specifies whether the OST is running in
 313     degraded mode or not.</para>
 314     <para>To mark the OST as degraded, use:</para>
 315 <screen>
 316 oss# lctl set_param obdfilter.{OST_name}.degraded=1
 317 </screen>
 318     <para>To mark that the OST is back in normal operation, use:</para>
 319 <screen>
 320 oss# lctl set_param obdfilter.{OST_name}.degraded=0
 321 </screen>
 322     <para>To determine if OSTs are currently in degraded mode, use:</para>
 323 <screen>
 324 oss# lctl get_param obdfilter.*.degraded
 325 </screen>
 326     <para>If the OST is remounted due to a reboot or other condition, the flag
 327     resets to
 328     <literal>0</literal>.</para>
 329     <para>It is recommended that this be implemented by an automated script
 330     that monitors the status of individual RAID devices, such as MD-RAID's
 331     <literal>mdadm(8)</literal> command with the <literal>--monitor</literal>
 332     option to mark an affected device degraded or restored.</para>
 333   </section>
 334   <section xml:id="lustre_configure_multiple_fs">
 335     <title>
 336     <indexterm>
 337       <primary>operations</primary>
 338       <secondary>multiple file systems</secondary>
 339     </indexterm>Running Multiple Lustre File Systems</title>
 340     <para>Lustre supports multiple file systems provided the combination of
 341     <literal>NID:fsname</literal> is unique. Each file system must be allocated
 342     a unique name during creation with the
 343     <literal>--fsname</literal> parameter. Unique names for file systems are
 344     enforced if a single MGS is present. If multiple MGSs are present (for
 345     example if you have an MGS on every MDS) the administrator is responsible
 346     for ensuring file system names are unique. A single MGS and unique file
 347     system names provides a single point of administration and allows commands
 348     to be issued against the file system even if it is not mounted.</para>
 349     <para>Lustre supports multiple file systems on a single MGS. With a single
 350     MGS fsnames are guaranteed to be unique. Lustre also allows multiple MGSs
 351     to co-exist. For example, multiple MGSs will be necessary if multiple file
 352     systems on different Lustre software versions are to be concurrently
 353     available. With multiple MGSs additional care must be taken to ensure file
 354     system names are unique. Each file system should have a unique fsname among
 355     all systems that may interoperate in the future.</para>
 356     <para>By default, the
 357     <literal>mkfs.lustre</literal> command creates a file system named
 358     <literal>lustre</literal>. To specify a different file system name (limited
 359     to 8 characters) at format time, use the
 360     <literal>--fsname</literal> option:</para>
 361     <para>
 362 <screen>
 363 oss# mkfs.lustre --fsname=<replaceable>file_system_name</replaceable>
 364 </screen>
 365     </para>
 366     <note>
 367       <para>The MDT, OSTs and clients in the new file system must use the same
 368       file system name (prepended to the device name). For example, for a new
 369       file system named <literal>foo</literal>, the MDT and two OSTs would be
 370       named  <literal>foo-MDT0000</literal>,
 371       <literal>foo-OST0000</literal>, and
 372       <literal>foo-OST0001</literal>.</para>
 373     </note>
 374     <para>To mount a client on the file system, run:</para>
 375 <screen>
 376 client# mount -t lustre <replaceable>mgsnode</replaceable>:<replaceable>/new_fsname</replaceable> <replaceable>/mount_point</replaceable>
 377 </screen>
 378     <para>For example, to mount a client on file system foo at mount point
 379     /mnt/foo, run:</para>
 380 <screen>
 381 client# mount -t lustre mgsnode:/foo /mnt/foo
 382 </screen>
 383     <note>
 384       <para>If a client(s) will be mounted on several file systems, add the
 385       following line to <literal>/etc/xattr.conf</literal> file to avoid
 386       problems when files are moved between the file systems:
 387       <literal>lustre.* skip</literal></para>
 388     </note>
 389     <note>
 390       <para>To ensure that a new MDT is added to an existing MGS create the MDT
 391       by specifying:
 392       <literal>--mdt --mgsnode=<replaceable>mgs_NID</replaceable></literal>.
 393       </para>
 394     </note>
 395     <para>A Lustre installation with two file systems (
 396     <literal>foo</literal> and
 397     <literal>bar</literal>) could look like this, where the MGS node is
 398     <literal>mgsnode@tcp0</literal> and the mount points are
 399     <literal>/mnt/foo</literal> and
 400     <literal>/mnt/bar</literal>.</para>
 401 <screen>
 402 mgsnode# mkfs.lustre --mgs /dev/sda
 403 mdtfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --mdt --index=0
 404 /dev/sdb
 405 ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=0
 406 /dev/sda
 407 ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=1
 408 /dev/sdb
 409 mdtbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --mdt --index=0
 410 /dev/sda
 411 ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=0
 412 /dev/sdc
 413 ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=1
 414 /dev/sdd
 415 </screen>
 416     <para>To mount a client on file system foo at mount point
 417       <literal>/mnt/foo</literal>, run:
 418     </para>
 419 <screen>
 420 client# mount -t lustre mgsnode@tcp0:/foo /mnt/foo
 421 </screen>
 422     <para>To mount a client on file system bar at mount point
 423     <literal>/mnt/bar</literal>, run:</para>
 424 <screen>
 425 client# mount -t lustre mgsnode@tcp0:/bar /mnt/bar
 426 </screen>
 427   </section>
 428   <section xml:id="lfsmkdir">
 429     <title>
 430     <indexterm>
 431       <primary>operations</primary>
 432       <secondary>remote directory</secondary>
 433     </indexterm>Creating a sub-directory on a specific MDT</title>
 434     <para>It is possible to create individual directories, along with its
 435       files and sub-directories, to be stored on specific MDTs. To create
 436       a sub-directory on a given MDT use the command:
 437     </para>
 438 <screen>
 439 client$ lfs mkdir -i <replaceable>mdt_index</replaceable> <replaceable>/mount_point/remote_dir</replaceable>
 440 </screen>
 441     <para>This command will allocate the sub-directory
 442     <literal>remote_dir</literal> onto the MDT with index
 443     <literal>mdt_index</literal>. For more information on adding additional
 444     MDTs and <literal>mdt_index</literal> see <xref linkend='addmdtindex' />.
 445     </para>
 446     <warning>
 447       <para>An administrator can allocate remote sub-directories to separate
 448       MDTs. Creating remote sub-directories in parent directories not hosted on
 449       MDT0000 is not recommended. This is because the failure of the parent MDT
 450       will leave the namespace below it inaccessible. For this reason, by
 451       default it is only possible to create remote sub-directories off MDT0000.
 452       To relax this restriction and enable remote sub-directories off any MDT,
 453       an administrator must issue the following command on the MGS:
 454 <screen>
 455 mgs# lctl set_param -P mdt.<replaceable>fsname-MDT*</replaceable>.enable_remote_dir=1
 456 </screen>
 457       For Lustre filesystem 'scratch', the command executed is:
 458 <screen>
 459 mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir=1
 460 </screen>
 461       To verify the configuration setting execute the following command on any
 462       MDS:
 463 <screen>
 464 mds# lctl get_param mdt.*.enable_remote_dir
 465 </screen>
 466       </para>
 467     </warning>
 468     <para condition='l28'>With Lustre software version 2.8, a new
 469     tunable is available to allow users with a specific group ID to create
 470     and delete remote and striped directories. This tunable is
 471     <literal>enable_remote_dir_gid</literal>. For example, setting this
 472     parameter to the 'wheel' or 'admin' group ID allows users with that GID
 473     to create and delete remote and striped directories. Setting this
 474     parameter to <literal>-1</literal> on MDT0000 to permanently allow any
 475     non-root users create and delete remote and striped directories.
 476     On the MGS execute the following command:
 477 <screen>
 478 mgs# lctl set_param -P mdt.<replaceable>fsname-*</replaceable>.enable_remote_dir_gid=-1
 479 </screen>
 480     For the Lustre filesystem 'scratch', the commands expands to:
 481 <screen>
 482 mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir_gid=-1
 483 </screen>
 484     The change can be verified by executing the following command on every MDS:
 485 <screen>
 486 mds# lctl get_param mdt.<replaceable>*</replaceable>.enable_remote_dir_gid
 487 </screen>
 488     </para>
 489   </section>
 490   <section xml:id="lfsmkdirdne2" condition='l28'>
 491     <title>
 492     <indexterm>
 493       <primary>operations</primary>
 494       <secondary>striped directory</secondary>
 495     </indexterm>
 496     <indexterm>
 497       <primary>operations</primary>
 498       <secondary>mkdir</secondary>
 499     </indexterm>
 500     <indexterm>
 501       <primary>operations</primary>
 502       <secondary>setdirstripe</secondary>
 503     </indexterm>
 504     <indexterm>
 505       <primary>striping</primary>
 506       <secondary>metadata</secondary>
 507     </indexterm>Creating a directory striped across multiple MDTs</title>
 508     <para>The Lustre 2.8 DNE feature enables files in a single large
 509     directory to be distributed across multiple MDTs (a <emphasis>striped
 510     directory</emphasis>), if there are mutliple MDTs added to the
 511     filesystem, see <xref linkend="lustremaint.adding_new_mdt"/>.
 512     The result is that metadata requests for files in a single large
 513     striped directory are serviced by multiple MDTs and metadata
 514     service load is distributed over all the MDTs that service a given
 515     directory. By distributing metadata service load over multiple MDTs,
 516     performance of very large directories can be improved beyond the limit
 517     of one MDT.  Normally, all files in a directory must be created
 518     on a single MDT.</para>
 519     <para>This command to stripe a directory over
 520     <replaceable>mdt_count</replaceable> MDTs is:
 521 <screen>
 522 client$ lfs mkdir -c <replaceable>mdt_count</replaceable> <replaceable>/mount_point/new_directory</replaceable>
 523 </screen>
 524     </para>
 525     <para>The striped directory feature is most useful for distributing
 526     a single large directory (50k entries or more) across multiple MDTs.
 527     This should be used with discretion since creating and removing striped
 528     directories incurs more overhead than non-striped directories.</para>
 529     <section xml:id="lfsmkdirbyspace" condition='l2D'>
 530       <title>Directory creation by space/inode usage</title>
 531       <para>If the starting MDT is not specified when creating a new directory,
 532       this directory and its stripes will be distributed on MDTs by space usage.
 533       For example the following will create a new directory on an MDT
 534       preferring one that has less space usage:</para>
 535 <screen>
 536 client$ lfs mkdir -c 1 -i -1 <replaceable>dir1</replaceable>
 537 </screen>
 538       <para>Alternatively, if a default directory stripe is set on a directory,
 539       the subsequent use of <literal>mkdir</literal> for subdirectories in
 540       <replaceable>dir1</replaceable> will have the same effect:
 541 <screen>
 542 client$ lfs setdirstripe -D -c 1 -i -1 <replaceable>dir1</replaceable>
 543 </screen>
 544       </para>
 545       <para>The policy is:</para>
 546       <itemizedlist>
 547         <listitem><para>If free inodes/blocks on all MDT are almost the same,
 548         i.e. <literal>max_inodes_avail * 84% &lt; min_inodes_avail</literal> and
 549         <literal>max_blocks_avail * 84% &lt; min_blocks_avail</literal>, then
 550         choose MDT roundrobin.</para></listitem>
 551         <listitem><para>Otherwise, create more subdirectories on MDTs with more
 552         free inodes/blocks.</para></listitem>
 553       </itemizedlist>
 554       <para>Sometime there are many MDTs. But it is not always desirable to
 555         stripe a directory across all MDTs, even if the directory default
 556         <literal>stripe_count=-1</literal> (unlimited).
 557         In this case, the per-filesystem tunable parameter
 558         <literal>lod.*.max_mdt_stripecount</literal> can be used to limit the
 559         actual stripe count of directory to fewer than the full MDT count.
 560         If <literal>lod.*.max_mdt_stripecount</literal> is not 0, and the
 561         directory <literal>stripe_count=-1</literal>, the real directory
 562         stripe count will be the minimum of the number of MDTs and
 563         <literal>max_mdt_stripecount</literal>.
 564         If <literal>lod.*.max_mdt_stripecount=0</literal>, or an explicit
 565         stripe count is given for the directory, it is ignored.
 566       </para>
 567       <para>To set <literal>max_mdt_stripecount</literal>, on all MDSes of
 568         file system, run:
 569 <screen>
 570 mgs# lctl set_param -P lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount=&lt;N&gt;
 571 </screen>
 572       </para>
 573       <para>To check <literal>max_mdt_stripecount</literal>, run:
 574 <screen>
 575 mds# lctl get_param lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount
 576 </screen>
 577       </para>
 578       <para>To reset <literal>max_mdt_stripecount</literal>, run:
 579 <screen>
 580 mgs# lctl set_param -P -d lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount
 581 </screen>
 582       </para>
 583     </section>
 584     <section xml:id="fsdefaultlmv" condition='l2E'>
 585       <title>Filesystem-wide default directory striping</title>
 586       <para>Similar to file objects allocation, the directory objects are
 587       allocated on MDTs by a round-robin algorithm or a weighted algorithm. For
 588       the top three level of directories from the root of the filesystem, if the
 589       amount of free inodes and blocks is well balanced (i.e., by default, when
 590       the free inodes and blocks across MDTs differ by less than 5%), the
 591       round-robin algorithm is used to select the next MDT on which a directory
 592       is to be created.
 593       </para>
 594       <para>If the directory is more than three levels below the root directory,
 595       or MDTs are not balanced, then the weighted algorithm is used to randomly
 596       select an MDT with more free inodes and blocks.
 597       </para>
 598       <para> To avoid creating unnecessary remote directories, if the MDT where
 599       its parent directory is located is not too full (the free inodes and
 600       blocks of the parent MDT is not more than 5% full than average of all
 601       MDTs), this directory will be created on parent MDT.
 602       </para>
 603       <para>If administrator wants to change this default filesystem-wide
 604       directory striping, run the following command to limit this striping to
 605       the top level below the root directory:</para>
 606 <screen>
 607 client$ lfs setdirstripe -D -i -1 -c 1 --max-inherit 0 &lt;mountpoint&gt;
 608 </screen>
 609       <para>To revert to the pre-2.15 behavior of all directories being created
 610       only on MDT0000 by default (deleting this striping won't work because it
 611       will be recreated if missing):</para>
 612 <screen>
 613 client$ lfs setdirstripe -D -i 0 -c 1 --max-inherit 0 &lt;mountpoint&gt;
 614 </screen>
 615     </section>
 616   </section>
 617   <section xml:id="default_dir_stripe_policy">
 618     <title>
 619     <indexterm>
 620       <primary>operations</primary>
 621       <secondary>default dir stripe policy</secondary>
 622     </indexterm>Default Dir Stripe Policy</title>
 623     <para>If default dir stripe policy is set to a directory, it will be
 624       applied to sub directories created later. For example:
 625 <screen>
 626 $ mkdir testdir1
 627 $ lfs setdirstripe testdir1 -D -c 2
 628 $ lfs getdirstripe testdir1 -D
 629 lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 3 lmv_max_inherit_rr: 0
 630 $ mkdir dir1/subdir1
 631 $ lfs getdirstripe testdir1/subdir1
 632 lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: crush
 633 mdtidx       FID[seq:oid:ver]
 634      0       [0x200000400:0x2:0x0]
 635      1       [0x240000401:0x2:0x0]
 636 </screen>
 637     </para>
 638     <para>Default dir stripe can be inherited by sub directory.
 639       This behavior is controlled by <literal>lmv_max_inherit</literal>
 640       parameter. If <literal>lmv_max_inherit</literal> is 0 or 1, sub
 641       directory stops to inherit default dir stripe policy.
 642       Or sub directory decreases its parent's
 643       <literal>lmv_max_inherit</literal> and uses it as its own
 644       <literal>lmv_max_inherit</literal>.
 645       -1 is special because it means unlimited. For example:
 646 <screen>
 647 $ lfs getdirstripe testdir1/subdir1 -D
 648 lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 2 lmv_max_inherit_rr: 0
 649 </screen>
 650     </para>
 651     <para><literal>lmv_max_inherit</literal> can be set explicitly with
 652       <literal>--max-inherit</literal> option in
 653       <literal>lfs setdirstripe -D</literal> command.
 654       If the max-inherit value is not specified, the default value is -1
 655       when <literal>stripe_count</literal> is 0 or 1.
 656       For other values of <literal>stripe_count</literal>, the default value
 657       is 3.
 658     </para>
 659   </section>
 660   <section xml:id="set_get_lustre_params">
 661     <title>
 662     <indexterm>
 663       <primary>operations</primary>
 664       <secondary>parameters</secondary>
 665     </indexterm>Setting and Retrieving Lustre Parameters</title>
 666     <para>Several options are available for setting parameters in
 667     Lustre:</para>
 668     <itemizedlist>
 669       <listitem>
 670         <para>When creating a file system, use mkfs.lustre. See
 671         <xref linkend="tuning_params_mkfs_lustre" />below.</para>
 672       </listitem>
 673       <listitem>
 674         <para>When a server is stopped, use tunefs.lustre. See
 675         <xref linkend="setting_param_tunefs" />below.</para>
 676       </listitem>
 677       <listitem>
 678         <para>When the file system is running, use lctl to set or retrieve
 679         Lustre parameters. See
 680         <xref linkend="setting_param_with_lctl" />and
 681         <xref linkend="reporting_current_param" />below.</para>
 682       </listitem>
 683     </itemizedlist>
 684     <section xml:id="tuning_params_mkfs_lustre">
 685       <title>Setting Tunable Parameters with
 686       <literal>mkfs.lustre</literal></title>
 687       <para>When the file system is first formatted, parameters can simply be
 688       added as a <literal>--param</literal> option to the
 689       <literal>mkfs.lustre</literal> command. For example:</para>
 690 <screen>
 691 mds# mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda
 692 </screen>
 693       <para>For more details about creating a file system,see
 694       <xref linkend="configuringlustre" />. For more details about
 695       <literal>mkfs.lustre</literal>, see
 696       <xref linkend="systemconfigurationutilities" />.</para>
 697     </section>
 698     <section xml:id="setting_param_tunefs">
 699       <title>Setting Parameters with
 700       <literal>tunefs.lustre</literal></title>
 701       <para>If a server (OSS or MDS) is stopped, parameters can be added to an
 702       existing file system using the
 703       <literal>--param</literal> option to the
 704       <literal>tunefs.lustre</literal> command. For example:</para>
 705 <screen>
 706 oss# tunefs.lustre --param=failover.node=192.168.0.13@tcp0 /dev/sda
 707 </screen>
 708       <para>With <literal>tunefs.lustre</literal>, parameters are
 709       <emphasis>additive</emphasis>-- new parameters are specified in addition
 710       to old parameters, they do not replace them. To erase all old
 711       <literal>tunefs.lustre</literal> parameters and just use newly-specified
 712       parameters, run:</para>
 713 <screen>
 714 mds# tunefs.lustre --erase-params --param=<replaceable>new_parameters</replaceable>
 715 </screen>
 716       <para>The tunefs.lustre command can be used to set any parameter settable
 717       via <literal>lctl conf_param</literal> and that has its own OBD device,
 718       so it can be specified as
 719       <literal>
 720       <replaceable>obdname|fsname</replaceable>.
 721       <replaceable>obdtype</replaceable>.
 722       <replaceable>proc_file_name</replaceable>=
 723       <replaceable>value</replaceable></literal>. For example:</para>
 724 <screen>
 725 mds# tunefs.lustre --param mdt.identity_upcall=NONE /dev/sda1
 726 </screen>
 727       <para>For more details about <literal>tunefs.lustre</literal>, see
 728       <xref linkend="systemconfigurationutilities" />.</para>
 729     </section>
 730     <section xml:id="setting_param_with_lctl">
 731       <title>Setting Parameters with
 732       <literal>lctl</literal></title>
 733       <para>When the file system is running, the
 734       <literal>lctl</literal> command can be used to set parameters (temporary
 735       or permanent) and report current parameter values. Temporary parameters
 736       are active as long as the server or client is not shut down. Permanent
 737       parameters live through server and client reboots.</para>
 738       <note>
 739         <para>The <literal>lctl list_param</literal> command enables users to
 740           list all parameters that can be set. See
 741         <xref linkend="list_params" />.</para>
 742       </note>
 743       <para>For more details about the
 744       <literal>lctl</literal> command, see the examples in the sections below
 745       and
 746       <xref linkend="systemconfigurationutilities" />.</para>
 747       <section remap="h4">
 748         <title>Setting Temporary Parameters</title>
 749         <para>Use
 750         <literal>lctl set_param</literal> to set temporary parameters on the
 751         node where it is run. These parameters internally map to corresponding
 752         items in the kernel <literal>/proc/{fs,sys}/{lnet,lustre}</literal> and
 753         <literal>/sys/{fs,kernel/debug}/lustre</literal> virtual filesystems.
 754         However, since the mapping between a particular parameter name and the
 755         underlying virtual pathname may change, it is <emphasis>not</emphasis>
 756         recommended to access the virtual pathname directly. The
 757         <literal>lctl set_param</literal> command uses this syntax:</para>
 758 <screen>
 759 # lctl set_param [-n] [-P] <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>proc_file_name</replaceable>=<replaceable>value</replaceable>
 760 </screen>
 761         <para>For example:</para>
 762 <screen>
 763 # lctl set_param osc.*.max_dirty_mb=1024
 764 osc.myth-OST0000-osc.max_dirty_mb=32
 765 osc.myth-OST0001-osc.max_dirty_mb=32
 766 osc.myth-OST0002-osc.max_dirty_mb=32
 767 osc.myth-OST0003-osc.max_dirty_mb=32
 768 osc.myth-OST0004-osc.max_dirty_mb=32
 769 </screen>
 770       </section>
 771       <section xml:id="setting_permanent_params">
 772         <title>Setting Permanent Parameters</title>
 773         <para>Use <literal>lctl set_param -P</literal> or
 774         <literal>lctl conf_param</literal> command to set permanent parameters.
 775         In general, the <literal>set_param -P</literal> command is preferred
 776         for new parameters, as this isolates the parameter settings from the
 777         MDT and OST device configuration, and is consistent with the common
 778         <literal>lctl get_param</literal> and <literal>lctl set_param</literal>
 779         commands.  The <literal>lctl conf_param</literal> command
 780         was previously used to specify settable parameter, with the following
 781         syntax (the same as the <literal>mkfs.lustre</literal> and
 782         <literal>tunefs.lustre</literal> commands):</para>
 783 <screen>
 784 <replaceable>obdname|fsname</replaceable>.<replaceable>obdtype</replaceable>.<replaceable>proc_file_name</replaceable>=<replaceable>value</replaceable>)
 785 </screen>
 786         <note><para>The <literal>lctl conf_param</literal> and
 787         <literal>lctl set_param</literal> syntax is <emphasis>not</emphasis>
 788         the same.</para></note>
 789         <para>Here are a few examples of
 790         <literal>lctl conf_param</literal> commands:</para>
 791 <screen>
 792 mgs# lctl conf_param testfs-MDT0000.sys.timeout=40
 793 mgs# lctl conf_param testfs-MDT0000.mdt.identity_upcall=NONE
 794 mgs# lctl conf_param testfs.llite.max_read_ahead_mb=16
 795 mgs# lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15
 796 mgs# lctl conf_param testfs-OST0000.ost.client_cache_seconds=15
 797 mgs# lctl conf_param testfs.sys.timeout=40
 798 </screen>
 799         <caution>
 800           <para>Parameters specified with the
 801           <literal>lctl conf_param</literal> command are set permanently in the
 802           file system's configuration file on the MGS.</para>
 803         </caution>
 804       </section>
 805       <section xml:id="setparamp" condition='l25'>
 806         <title>Setting Permanent Parameters with lctl set_param -P</title>
 807         <para>The <literal>lctl set_param -P</literal> command can also
 808           set parameters permanently using the same syntax as
 809           <literal>lctl set_param</literal> and <literal>lctl
 810           get_param</literal> commands. Permanent parameter settings must be
 811           issued on the MGS.  The given parameter is set on every host using
 812           <literal>lctl</literal> upcall.  The <literal>lctl set_param</literal>
 813           command uses the following syntax:</para>
 814 <screen>
 815 lctl set_param -P <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>proc_file_name</replaceable>=<replaceable>value</replaceable>
 816 </screen>
 817         <para>For example:</para>
 818 <screen>
 819 mgs# lctl set_param -P timeout=40
 820 mgs# lctl set_param -P mdt.testfs-MDT*.identity_upcall=NONE
 821 mgs# lctl set_param -P llite.testfs-*.max_read_ahead_mb=16
 822 mgs# lctl set_param -P osc.testfs-OST*.max_dirty_mb=29.15
 823 mgs# lctl set_param -P ost.testfs-OST*.client_cache_seconds=15
 824 </screen>
 825         <para>Use the <literal>-P -d</literal> option to delete permanent
 826         parameters. Syntax:</para>
 827 <screen>
 828 lctl set_param -P -d <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>parameter_name</replaceable>
 829 </screen>
 830         <para>For example:</para>
 831 <screen>
 832 mgs# lctl set_param -P -d osc.*.max_dirty_mb
 833 </screen>
 834         <note condition='l2c'><para>Starting in Lustre 2.12, there is
 835         <literal>lctl get_param</literal> command can provide
 836         <emphasis>tab completion</emphasis> when using an interactive shell
 837         with <literal>bash-completion</literal> installed.  This simplifies
 838         the use of <literal>get_param</literal> significantly, since it
 839         provides an interactive list of available parameters.
 840         </para></note>
 841       </section>
 842       <section xml:id="persistent_params">
 843         <title>Listing Persistent Parameters</title>
 844         <para>To list tunable parameters stored in the <literal>params</literal>
 845         log file by <literal>lctl set_param -P</literal> and applied to nodes at
 846         mount, run the <literal>lctl --device MGS llog_print params</literal>
 847         command on the MGS.  For example:</para>
 848 <screen>
 849 mgs# lctl --device MGS llog_print params
 850 - { index: 2, event: set_param, device: general, parameter: osc.*.max_dirty_mb, value: 1024 }
 851 </screen>
 852       </section>
 853       <section xml:id="list_params">
 854         <title>Listing All Tunable Parameters</title>
 855         <para>To list Lustre or LNet parameters that are available to set, use
 856         the <literal>lctl list_param</literal> command. For example:</para>
 857 <screen>
 858 lctl list_param [-FR] <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>
 859 </screen>
 860         <para>The following arguments are available for the
 861         <literal>lctl list_param</literal> command.</para>
 862         <para>
 863         <literal>-F</literal> Add '
 864         <literal>/</literal>', '
 865         <literal>@</literal>' or '
 866         <literal>=</literal>' for directories, symlinks and writeable files,
 867         respectively</para>
 868         <para>
 869         <literal>-R</literal> Recursively lists all parameters under the
 870         specified path</para>
 871         <para>For example:</para>
 872 <screen>
 873 oss# lctl list_param obdfilter.lustre-OST0000
 874 </screen>
 875       </section>
 876       <section xml:id="reporting_current_param">
 877         <title>Reporting Current Parameter Values</title>
 878         <para>To report current Lustre parameter values, use the
 879         <literal>lctl get_param</literal> command with this syntax:</para>
 880 <screen>
 881 lctl get_param [-n] <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>proc_file_name</replaceable>
 882 </screen>
 883         <note condition='l2c'><para>Starting in Lustre 2.12, there is
 884         <literal>lctl get_param</literal> command can provide
 885         <emphasis>tab completion</emphasis> when using an interactive shell
 886         with <literal>bash-completion</literal> installed.  This simplifies
 887         the use of <literal>get_param</literal> significantly, since it
 888         provides an interactive list of available parameters.
 889         </para></note>
 890         <para>This example reports data on RPC service times.</para>
 891 <screen>
 892 oss# lctl get_param -n ost.*.ost_io.timeouts
 893 service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1
 894 </screen>
 895         <para>This example reports the amount of space this client has reserved
 896         for writeback cache with each OST:</para>
 897 <screen>
 898 client# lctl get_param osc.*.cur_grant_bytes
 899 osc.myth-OST0000-osc-ffff8800376bdc00.cur_grant_bytes=2097152
 900 osc.myth-OST0001-osc-ffff8800376bdc00.cur_grant_bytes=33890304
 901 osc.myth-OST0002-osc-ffff8800376bdc00.cur_grant_bytes=35418112
 902 osc.myth-OST0003-osc-ffff8800376bdc00.cur_grant_bytes=2097152
 903 osc.myth-OST0004-osc-ffff8800376bdc00.cur_grant_bytes=33808384
 904 </screen>
 905       </section>
 906     </section>
 907   </section>
 908   <section xml:id="failover_nids">
 909     <title>
 910     <indexterm>
 911       <primary>operations</primary>
 912       <secondary>failover</secondary>
 913     </indexterm>Specifying NIDs and Failover</title>
 914     <para>If a node has multiple network interfaces, it may have multiple NIDs,
 915     which must all be identified so other nodes can choose the NID that is
 916     appropriate for their network interfaces. Typically, NIDs are specified in
 917     a list delimited by commas (
 918     <literal>,</literal>). However, when failover nodes are specified, the NIDs
 919     are delimited by a colon (
 920     <literal>:</literal>) or by repeating a keyword such as
 921     <literal>--mgsnode=</literal> or
 922     <literal>--servicenode=</literal>).</para>
 923     <para>To display the NIDs of all servers in networks configured to work
 924     with the Lustre file system, run (while LNet is running):</para>
 925 <screen>
 926 # lctl list_nids
 927 </screen>
 928     <para>In the example below,
 929     <literal>mds0</literal> and
 930     <literal>mds1</literal> are configured as a combined MGS/MDT failover pair
 931     and <literal>oss0</literal> and
 932     <literal>oss1</literal> are configured as an OST failover pair. The Ethernet
 933     address for
 934     <literal>mds0</literal> is 192.168.10.1, and for
 935     <literal>mds1</literal> is 192.168.10.2. The Ethernet addresses for
 936     <literal>oss0</literal> and
 937     <literal>oss1</literal> are 192.168.10.20 and 192.168.10.21
 938     respectively.</para>
 939     <screen>
 940 mds0# mkfs.lustre --fsname=testfs --mdt --mgs \
 941         --servicenode=192.168.10.2@tcp0 \
 942         -–servicenode=192.168.10.1@tcp0 /dev/sda1
 943 mds0# mount -t lustre /dev/sda1 /mnt/test/mdt
 944 oss0# mkfs.lustre --fsname=testfs --servicenode=192.168.10.20@tcp0 \
 945         --servicenode=192.168.10.21 --ost --index=0 \
 946         --mgsnode=192.168.10.1@tcp0 --mgsnode=192.168.10.2@tcp0 \
 947         /dev/sdb
 948 oss0# mount -t lustre /dev/sdb /mnt/test/ost0
 949 client# mount -t lustre 192.168.10.1@tcp0:192.168.10.2@tcp0:/testfs \
 950         /mnt/testfs
 951 mds0# umount /mnt/mdt
 952 mds1# mount -t lustre /dev/sda1 /mnt/test/mdt
 953 mds1# lctl get_param mdt.testfs-MDT0000.recovery_status
 954 </screen>
 955     <para>Where multiple NIDs are specified separated by commas (for example,
 956     <literal>10.67.73.200@tcp,192.168.10.1@tcp</literal>), the two NIDs refer
 957     to the same host, and the Lustre software chooses the
 958     <emphasis>best</emphasis> one for communication. When a pair of NIDs is
 959     separated by a colon (for example,
 960     <literal>10.67.73.200@tcp:10.67.73.201@tcp</literal>), the two NIDs refer
 961     to two different hosts and are treated as a failover pair (the Lustre
 962     software tries the first one, and if that fails, it tries the second
 963     one.)</para>
 964     <para>Two options to
 965     <literal>mkfs.lustre</literal> can be used to specify failover nodes.  The
 966     <literal>--servicenode</literal> option is used to specify all service NIDs,
 967     including those for primary nodes and failover nodes. When the
 968     <literal>--servicenode</literal> option is used, the first service node to
 969     load the target device becomes the primary service node, while nodes
 970     corresponding to the other specified NIDs become failover locations for the
 971     target device. An older option, <literal>--failnode</literal>, specifies
 972     just the NIDs of failover nodes.  For more information about the
 973     <literal>--servicenode</literal> and
 974     <literal>--failnode</literal> options, see
 975     <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 976     linkend="configuringfailover" />.</para>
 977   </section>
 978   <section xml:id="erasing_filesystem">
 979     <title>
 980     <indexterm>
 981       <primary>operations</primary>
 982       <secondary>erasing a file system</secondary>
 983     </indexterm>Erasing a File System</title>
 984     <para>If you want to erase a file system and permanently delete all the
 985     data in the file system, run this command on your targets:</para>
 986 <screen>
 987 # mkfs.lustre --reformat
 988 </screen>
 989     <para>If you are using a separate MGS and want to keep other file systems
 990     defined on that MGS, then set the
 991     <literal>writeconf</literal> flag on the MDT for that file system. The
 992     <literal>writeconf</literal> flag causes the configuration logs to be
 993     erased; they are regenerated the next time the servers start.</para>
 994     <para>To set the <literal>writeconf</literal> flag on the MDT:</para>
 995     <orderedlist>
 996       <listitem>
 997         <para>Unmount all clients/servers using this file system, run:</para>
 998         <screen>
 999 client# umount /mnt/lustre
1000 </screen>
1001       </listitem>
1002       <listitem>
1003         <para>Permanently erase the file system and, presumably, replace it
1004         with another file system, run:</para>
1005 <screen>
1006 mgs# mkfs.lustre --reformat --fsname spfs --mgs --mdt --index=0 /dev/<replaceable>mdsdev</replaceable>
1007 </screen>
1008       </listitem>
1009       <listitem>
1010         <para>If you have a separate MGS (that you do not want to reformat),
1011         then add the <literal>--writeconf</literal> flag to
1012         <literal>mkfs.lustre</literal> on the MDT, run:</para>
1013 <screen>
1014 mgs# mkfs.lustre --reformat --writeconf --fsname spfs --mgsnode=<replaceable>mgs_nid</replaceable> \
1015        --mdt --index=0 <replaceable>/dev/mds_device</replaceable>
1016 </screen>
1017       </listitem>
1018     </orderedlist>
1019     <note>
1020       <para>If you have a combined MGS/MDT, reformatting the MDT reformats the
1021       MGS as well, causing all configuration information to be lost; you can
1022       start building your new file system. Nothing needs to be done with old
1023       disks that will not be part of the new file system, just do not mount
1024       them.</para>
1025     </note>
1026   </section>
1027   <section xml:id="reclaiming_reserved_disk_space">
1028     <title>
1029     <indexterm>
1030       <primary>operations</primary>
1031       <secondary>reclaiming space</secondary>
1032     </indexterm>Reclaiming Reserved Disk Space</title>
1033     <para>All current Lustre installations run the ldiskfs file system
1034     internally on service nodes. By default, ldiskfs reserves 5% of the disk
1035     space to avoid file system fragmentation. In order to reclaim this space,
1036     run the following command on your OSS for each OST in the file
1037     system:</para>
1038 <screen>
1039 # tune2fs [-m reserved_blocks_percent] /dev/<replaceable>ostdev</replaceable>
1040 </screen>
1041     <para>You do not need to shut down Lustre before running this command or
1042     restart it afterwards.</para>
1043     <warning>
1044       <para>Reducing the space reservation can cause severe performance
1045       degradation as the OST file system becomes more than 95% full, due to
1046       difficulty in locating large areas of contiguous free space. This
1047       performance degradation may persist even if the space usage drops below
1048       95% again. It is recommended NOT to reduce the reserved disk space below
1049       5%.</para>
1050     </warning>
1051   </section>
1052   <section xml:id="replacing_existing_ost_mdt">
1053     <title>
1054     <indexterm>
1055       <primary>operations</primary>
1056       <secondary>replacing an OST or MDS</secondary>
1057     </indexterm>Replacing an Existing OST or MDT</title>
1058     <para>To copy the contents of an existing OST to a new OST (or an old MDT
1059     to a new MDT), follow the process for either OST/MDT backups in
1060     <xref linkend='backup_device' />or
1061     <xref linkend='backup_fs_level' />.
1062     For more information on removing a MDT, see
1063     <xref linkend='lustremaint.rmremotedir' />.</para>
1064   </section>
1065   <section xml:id="identifying_file_objects">
1066     <title>
1067     <indexterm>
1068       <primary>operations</primary>
1069       <secondary>identifying OSTs</secondary>
1070     </indexterm>Identifying To Which Lustre File an OST Object Belongs</title>
1071     <para>Use this procedure to identify the file containing a given object on
1072     a given OST.</para>
1073     <orderedlist>
1074       <listitem>
1075         <para>On the OST (as root), run
1076         <literal>debugfs</literal> to display the file identifier (
1077         <literal>FID</literal>) of the file associated with the object.</para>
1078         <para>For example, if the object is
1079         <literal>34976</literal> on
1080         <literal>/dev/lustre/ost_test2</literal>, the debug command is:
1081 <screen>
1082 # debugfs -c -R "stat /O/0/d$((34976 % 32))/34976" /dev/lustre/ost_test2
1083 </screen></para>
1084         <para>The command output is:
1085 <screen>
1086 debugfs 1.45.6.wc1 (20-Mar-2020)
1087 /dev/lustre/ost_test2: catastrophic mode - not reading inode or group bitmaps
1088 Inode: 352365   Type: regular    Mode:  0666   Flags: 0x80000
1089 Generation: 2393149953    Version: 0x0000002a:00005f81
1090 User:  1000   Group:  1000   Size: 260096
1091 File ACL: 0    Directory ACL: 0
1092 Links: 1   Blockcount: 512
1093 Fragment:  Address: 0    Number: 0    Size: 0
1094 ctime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
1095 atime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
1096 mtime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
1097 crtime: 0x4a216b3c:975870dc -- Sat May 30 13:22:04 2009
1098 Size of extra inode fields: 24
1099 Extended attributes stored in inode body:
1100   fid = "b9 da 24 00 00 00 00 00 6a fa 0d 3f 01 00 00 00 eb 5b 0b 00 00 00 0000
1101 00 00 00 00 00 00 00 00 " (32)
1102   fid: objid=34976 seq=0 parent=[0x200000400:0x122:0x0] stripe=1
1103 EXTENTS:
1104 (0-64):4620544-4620607
1105 </screen>
1106         </para>
1107       </listitem>
1108       <listitem>
1109         <para>The parent FID will be of the form
1110         <literal>[0x200000400:0x122:0x0]</literal> and can be resolved directly
1111         using the command <literal>lfs fid2path [0x200000404:0x122:0x0]
1112         /mnt/lustre</literal> on any Lustre client, and the process is
1113         complete.</para>
1114       </listitem>
1115       <listitem>
1116         <para>In cases of an upgraded 1.x inode (if the first part of the
1117         FID is below 0x200000400), the MDT inode number is
1118         <literal>0x24dab9</literal> and generation
1119         <literal>0x3f0dfa6a</literal> and the pathname can also be resolved
1120         using <literal>debugfs</literal>.</para>
1121       </listitem>
1122       <listitem>
1123         <para>On the MDS (as root), use
1124         <literal>debugfs</literal> to find the file associated with the
1125         inode:</para>
1126 <screen>
1127 # debugfs -c -R "ncheck 0x24dab9" /dev/lustre/mdt_test
1128 debugfs 1.42.3.wc3 (15-Aug-2012)
1129 /dev/lustre/mdt_test: catastrophic mode - not reading inode or group bitmaps
1130 Inode      Pathname
1131 2415289    /ROOT/brian-laptop-guest/clients/client11/~dmtmp/PWRPNT/ZD16.BMP
1132 </screen>
1133       </listitem>
1134     </orderedlist>
1135     <para>The command lists the inode and pathname associated with the
1136     object.</para>
1137     <note>
1138       <para>
1139       <literal>Debugfs</literal>' ''ncheck'' is a brute-force search that may
1140       take a long time to complete.</para>
1141     </note>
1142     <note>
1143       <para>To find the Lustre file from a disk LBA, follow the steps listed in
1144       the document at this URL:
1145       <link xl:href="https://www.smartmontools.org/wiki/BadBlockHowto">
1146       https://www.smartmontools.org/wiki/BadBlockHowto</link>. Then,
1147       follow the steps above to resolve the Lustre filename.</para>
1148     </note>
1149   </section>
1150 </chapter>
1151 <!--
1152   vim:expandtab:shiftwidth=2:tabstop=8:
1153   -->