1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="lustreoperations">
5 <title xml:id="lustreoperations.title">Lustre Operations</title>
6 <para>Once you have the Lustre file system up and running, you can use the
7 procedures in this section to perform these basic Lustre administration
9 <section xml:id="mount_by_label">
12 <primary>operations</primary>
15 <primary>operations</primary>
16 <secondary>mounting by label</secondary>
17 </indexterm>Mounting by Label</title>
18 <para>The file system name is limited to 8 characters. We have encoded the
19 file system and target information in the disk label, so you can mount by
20 label. This allows system administrators to move disks around without
21 worrying about issues such as SCSI disk reordering or getting the
22 <literal>/dev/device</literal> wrong for a shared target. Soon, file system
23 naming will be made as fail-safe as possible. Currently, Linux disk labels
24 are limited to 16 characters. To identify the target within the file
25 system, 8 characters are reserved, leaving 8 characters for the file system
28 <replaceable>fsname</replaceable>-MDT0000 or
29 <replaceable>fsname</replaceable>-OST0a19
31 <para>To mount by label, use this command:</para>
33 mount -t lustre -L <replaceable>file_system_label</replaceable> <replaceable>/mount_point</replaceable>
35 <para>This is an example of mount-by-label:</para>
37 mds# mount -t lustre -L testfs-MDT0000 /mnt/mdt
40 <para>Mount-by-label should NOT be used in a multi-path environment or
41 when snapshots are being created of the device, since multiple block
42 devices will have the same label.</para>
44 <para>Although the file system name is internally limited to 8 characters,
45 you can mount the clients at any mount point, so file system users are not
46 subjected to short names. Here is an example:</para>
48 client# mount -t lustre mds0@tcp0:/short <replaceable>/dev/long_mountpoint_name</replaceable>
51 <section xml:id="starting_lustre">
54 <primary>operations</primary>
55 <secondary>starting</secondary>
56 </indexterm>Starting Lustre</title>
57 <para>On the first start of a Lustre file system, the components must be
58 started in the following order:</para>
61 <para>Mount the MGT.</para>
63 <para>If a combined MGT/MDT is present, Lustre will correctly mount
64 the MGT and MDT automatically.</para>
68 <para>Mount the MDT.</para>
70 <para>Mount all MDTs if multiple MDTs are present.</para>
74 <para>Mount the OST(s).</para>
77 <para>Mount the client(s).</para>
81 <section xml:id="mounting_server">
84 <primary>operations</primary>
85 <secondary>mounting</secondary>
86 </indexterm>Mounting a Server</title>
87 <para>Starting a Lustre server is straightforward and only involves the
88 mount command. Lustre servers can be added to <literal>/etc/fstab</literal>:
93 <para>The mount command generates output similar to this:</para>
95 /dev/sda1 on /mnt/test/mdt type lustre (rw)
96 /dev/sda2 on /mnt/test/ost0 type lustre (rw)
97 192.168.0.21@tcp:/testfs on /mnt/testfs type lustre (rw)
99 <para>In this example, the MDT, an OST (ost0) and file system (testfs) are
102 LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0
103 LABEL=testfs-OST0000 /mnt/test/ost0 lustre defaults,_netdev,noauto 0 0
105 <para>In general, it is wise to specify noauto and let your
106 high-availability (HA) package manage when to mount the device. If you are
107 not using failover, make sure that networking has been started before
108 mounting a Lustre server. If you are running Red Hat Enterprise Linux, SUSE
109 Linux Enterprise Server, Debian operating system (and perhaps others), use
110 the <literal>_netdev</literal> flag to ensure that these disks are mounted
111 after the network is up, unless you are using systemd 232 or greater, which
112 recognize <literal>lustre</literal> as a network filesystem.
113 If you are using <literal>lnet.service</literal>, use
114 <literal>x-systemd.requires=lnet.service</literal> regardless of systemd
116 <para>We are mounting by disk label here. The label of a device can be read
117 with <literal>e2label</literal>. The label of a newly-formatted Lustre
118 server may end in <literal>FFFF</literal> if the
119 <literal>--index</literal> option is not specified to
120 <literal>mkfs.lustre</literal>, meaning that it has yet to be assigned. The
121 assignment takes place when the server is first started, and the disk label
122 is updated. It is recommended that the
123 <literal>--index</literal> option always be used, which will also ensure
124 that the label is set at format time.</para>
126 <para>Do not do this when the client and OSS are on the same node, as
127 memory pressure between the client and OSS can lead to deadlocks.</para>
130 <para>Mount-by-label should NOT be used in a multi-path
134 <section xml:id="shutdownLustre">
137 <primary>operations</primary>
138 <secondary>shutdownLustre</secondary>
139 </indexterm>Stopping the Filesystem</title>
140 <para>A complete Lustre filesystem shutdown occurs by unmounting all
141 clients and servers in the order shown below. Please note that unmounting
142 a block device causes the Lustre software to be shut down on that node.
144 <note><para>Please note that the <literal>-a -t lustre</literal> in the
145 commands below is not the name of a filesystem, but rather is
146 specifying to unmount all entries in /etc/mtab that are of type
147 <literal>lustre</literal></para></note>
149 <listitem><para>Unmount the clients</para>
150 <para>On each client node, unmount the filesystem on that client
151 using the <literal>umount</literal> command:</para>
152 <para><literal>umount -a -t lustre</literal></para>
153 <para>The example below shows the unmount of the
154 <literal>testfs</literal> filesystem on a client node:</para>
157 [root@client1 ~]# mount -t lustre
158 XXX.XXX.0.11@tcp:/testfs on /mnt/testfs type lustre (rw,lazystatfs)
160 [root@client1 ~]# umount -a -t lustre
161 [154523.177714] Lustre: Unmounted testfs-client
166 <para>Unmount the MDT and MGT</para>
167 <para>On the MGS and MDS node(s), run the
168 <literal>umount</literal> command:</para>
169 <para><literal>umount -a -t lustre</literal></para>
170 <para>The example below shows the unmount of the MDT and MGT for
171 the <literal>testfs</literal> filesystem on a combined MGS/MDS:
175 [root@mds1 ~]# mount -t lustre
176 /dev/sda on /mnt/mgt type lustre (ro)
177 /dev/sdb on /mnt/mdt type lustre (ro)
179 [root@mds1 ~]# umount -a -t lustre
180 [155263.566230] Lustre: Failing over testfs-MDT0000
181 [155263.775355] Lustre: server umount testfs-MDT0000 complete
182 [155269.843862] Lustre: server umount MGS complete
185 <para>For a seperate MGS and MDS, the same command is used, first on
186 the MDS and then followed by the MGS.</para>
188 <listitem><para>Unmount all the OSTs</para>
189 <para>On each OSS node, use the <literal>umount</literal> command:
191 <para><literal>umount -a -t lustre</literal></para>
192 <para>The example below shows the unmount of all OSTs for the
193 <literal>testfs</literal> filesystem on server
194 <literal>OSS1</literal>:
198 [root@oss1 ~]# mount |grep lustre
199 /dev/sda on /mnt/ost0 type lustre (ro)
200 /dev/sdb on /mnt/ost1 type lustre (ro)
201 /dev/sdc on /mnt/ost2 type lustre (ro)
203 [root@oss1 ~]# umount -a -t lustre
204 Lustre: Failing over testfs-OST0002
205 Lustre: server umount testfs-OST0002 complete
210 <para>For unmount command syntax for a single OST, MDT, or MGT target
211 please refer to <xref linkend="umountTarget"/></para>
213 <section xml:id="umountTarget">
216 <primary>operations</primary>
217 <secondary>unmounting</secondary>
218 </indexterm>Unmounting a Specific Target on a Server</title>
219 <para>To stop a Lustre OST, MDT, or MGT , use the
221 <replaceable>/mount_point</replaceable></literal> command.</para>
222 <para>The example below stops an OST, <literal>ost0</literal>, on mount
223 point <literal>/mnt/ost0</literal> for the <literal>testfs</literal>
226 [root@oss1 ~]# umount /mnt/ost0
227 Lustre: Failing over testfs-OST0000
228 Lustre: server umount testfs-OST0000 complete
230 <para>Gracefully stopping a server with the
231 <literal>umount</literal> command preserves the state of the connected
232 clients. The next time the server is started, it waits for clients to
233 reconnect, and then goes through the recovery procedure.</para>
235 <literal>-f</literal>) flag is used, then the server evicts all clients and
236 stops WITHOUT recovery. Upon restart, the server does not wait for
237 recovery. Any currently connected clients receive I/O errors until they
240 <para>If you are using loopback devices, use the
241 <literal>-d</literal> flag. This flag cleans up loop devices and can
242 always be safely specified.</para>
245 <section xml:id="failover_ost">
248 <primary>operations</primary>
249 <secondary>failover</secondary>
250 </indexterm>Specifying Failout/Failover Mode for OSTs</title>
251 <para>In a Lustre file system, an OST that has become unreachable because
252 it fails, is taken off the network, or is unmounted can be handled in one
256 <para>In <literal>failout</literal> mode, Lustre clients immediately
257 receive errors (EIOs) after a timeout, instead of waiting for the OST
261 <para>In <literal>failover</literal> mode, Lustre clients wait for the
262 OST to recover.</para>
265 <para>By default, the Lustre file system uses
266 <literal>failover</literal> mode for OSTs. To specify
267 <literal>failout</literal> mode instead, use the
268 <literal>--param="failover.mode=failout"</literal> option as shown below
269 (entered on one line):</para>
271 oss# mkfs.lustre --fsname=<replaceable>fsname</replaceable> --mgsnode=<replaceable>mgs_NID</replaceable> \
272 --param=failover.mode=failout --ost --index=<replaceable>ost_index</replaceable> <replaceable>/dev/ost_block_device</replaceable>
274 <para>In the example below,
275 <literal>failout</literal> mode is specified for the OSTs on the MGS
276 <literal>mds0</literal> in the file system
277 <literal>testfs</literal>(entered on one line).</para>
279 oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout \
280 --ost --index=3 /dev/sdb
283 <para>Before running this command, unmount all OSTs that will be affected
284 by a change in <literal>failover</literal>/<literal>failout</literal> mode.
288 <para>After initial file system configuration, use the
289 <literal>tunefs.lustre</literal> utility to change the mode. For example,
290 to set the <literal>failout</literal> mode, run:</para>
293 # tunefs.lustre --param failover.mode=failout <replaceable>/dev/ost_device</replaceable>
298 <section xml:id="degraded_ost">
301 <primary>operations</primary>
302 <secondary>degraded OST RAID</secondary>
303 </indexterm>Handling Degraded OST RAID Arrays</title>
304 <para>Lustre includes functionality that notifies Lustre if an external
305 RAID array has degraded performance (resulting in reduced overall file
306 system performance), either because a disk has failed and not been
307 replaced, or because a disk was replaced and is undergoing a rebuild. To
308 avoid a global performance slowdown due to a degraded OST, the MDS can
309 avoid the OST for new object allocation if it is notified of the degraded
311 <para>A parameter for each OST, called
312 <literal>degraded</literal>, specifies whether the OST is running in
313 degraded mode or not.</para>
314 <para>To mark the OST as degraded, use:</para>
316 oss# lctl set_param obdfilter.{OST_name}.degraded=1
318 <para>To mark that the OST is back in normal operation, use:</para>
320 oss# lctl set_param obdfilter.{OST_name}.degraded=0
322 <para>To determine if OSTs are currently in degraded mode, use:</para>
324 oss# lctl get_param obdfilter.*.degraded
326 <para>If the OST is remounted due to a reboot or other condition, the flag
328 <literal>0</literal>.</para>
329 <para>It is recommended that this be implemented by an automated script
330 that monitors the status of individual RAID devices, such as MD-RAID's
331 <literal>mdadm(8)</literal> command with the <literal>--monitor</literal>
332 option to mark an affected device degraded or restored.</para>
334 <section xml:id="lustre_configure_multiple_fs">
337 <primary>operations</primary>
338 <secondary>multiple file systems</secondary>
339 </indexterm>Running Multiple Lustre File Systems</title>
340 <para>Lustre supports multiple file systems provided the combination of
341 <literal>NID:fsname</literal> is unique. Each file system must be allocated
342 a unique name during creation with the
343 <literal>--fsname</literal> parameter. Unique names for file systems are
344 enforced if a single MGS is present. If multiple MGSs are present (for
345 example if you have an MGS on every MDS) the administrator is responsible
346 for ensuring file system names are unique. A single MGS and unique file
347 system names provides a single point of administration and allows commands
348 to be issued against the file system even if it is not mounted.</para>
349 <para>Lustre supports multiple file systems on a single MGS. With a single
350 MGS fsnames are guaranteed to be unique. Lustre also allows multiple MGSs
351 to co-exist. For example, multiple MGSs will be necessary if multiple file
352 systems on different Lustre software versions are to be concurrently
353 available. With multiple MGSs additional care must be taken to ensure file
354 system names are unique. Each file system should have a unique fsname among
355 all systems that may interoperate in the future.</para>
356 <para>By default, the
357 <literal>mkfs.lustre</literal> command creates a file system named
358 <literal>lustre</literal>. To specify a different file system name (limited
359 to 8 characters) at format time, use the
360 <literal>--fsname</literal> option:</para>
363 oss# mkfs.lustre --fsname=<replaceable>file_system_name</replaceable>
367 <para>The MDT, OSTs and clients in the new file system must use the same
368 file system name (prepended to the device name). For example, for a new
369 file system named <literal>foo</literal>, the MDT and two OSTs would be
370 named <literal>foo-MDT0000</literal>,
371 <literal>foo-OST0000</literal>, and
372 <literal>foo-OST0001</literal>.</para>
374 <para>To mount a client on the file system, run:</para>
376 client# mount -t lustre <replaceable>mgsnode</replaceable>:<replaceable>/new_fsname</replaceable> <replaceable>/mount_point</replaceable>
378 <para>For example, to mount a client on file system foo at mount point
379 /mnt/foo, run:</para>
381 client# mount -t lustre mgsnode:/foo /mnt/foo
384 <para>If a client(s) will be mounted on several file systems, add the
385 following line to <literal>/etc/xattr.conf</literal> file to avoid
386 problems when files are moved between the file systems:
387 <literal>lustre.* skip</literal></para>
390 <para>To ensure that a new MDT is added to an existing MGS create the MDT
392 <literal>--mdt --mgsnode=<replaceable>mgs_NID</replaceable></literal>.
395 <para>A Lustre installation with two file systems (
396 <literal>foo</literal> and
397 <literal>bar</literal>) could look like this, where the MGS node is
398 <literal>mgsnode@tcp0</literal> and the mount points are
399 <literal>/mnt/foo</literal> and
400 <literal>/mnt/bar</literal>.</para>
402 mgsnode# mkfs.lustre --mgs /dev/sda
403 mdtfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --mdt --index=0
405 ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=0
407 ossfoonode# mkfs.lustre --fsname=foo --mgsnode=mgsnode@tcp0 --ost --index=1
409 mdtbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --mdt --index=0
411 ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=0
413 ossbarnode# mkfs.lustre --fsname=bar --mgsnode=mgsnode@tcp0 --ost --index=1
416 <para>To mount a client on file system foo at mount point
417 <literal>/mnt/foo</literal>, run:
420 client# mount -t lustre mgsnode@tcp0:/foo /mnt/foo
422 <para>To mount a client on file system bar at mount point
423 <literal>/mnt/bar</literal>, run:</para>
425 client# mount -t lustre mgsnode@tcp0:/bar /mnt/bar
428 <section xml:id="lfsmkdir">
431 <primary>operations</primary>
432 <secondary>remote directory</secondary>
433 </indexterm>Creating a sub-directory on a specific MDT</title>
434 <para>It is possible to create individual directories, along with its
435 files and sub-directories, to be stored on specific MDTs. To create
436 a sub-directory on a given MDT use the command:
439 client$ lfs mkdir -i <replaceable>mdt_index</replaceable> <replaceable>/mount_point/remote_dir</replaceable>
441 <para>This command will allocate the sub-directory
442 <literal>remote_dir</literal> onto the MDT with index
443 <literal>mdt_index</literal>. For more information on adding additional
444 MDTs and <literal>mdt_index</literal> see <xref linkend='addmdtindex' />.
447 <para>An administrator can allocate remote sub-directories to separate
448 MDTs. Creating remote sub-directories in parent directories not hosted on
449 MDT0000 is not recommended. This is because the failure of the parent MDT
450 will leave the namespace below it inaccessible. For this reason, by
451 default it is only possible to create remote sub-directories off MDT0000.
452 To relax this restriction and enable remote sub-directories off any MDT,
453 an administrator must issue the following command on the MGS:
455 mgs# lctl set_param -P mdt.<replaceable>fsname-MDT*</replaceable>.enable_remote_dir=1
457 For Lustre filesystem 'scratch', the command executed is:
459 mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir=1
461 To verify the configuration setting execute the following command on any
464 mds# lctl get_param mdt.*.enable_remote_dir
468 <para condition='l28'>With Lustre software version 2.8, a new
469 tunable is available to allow users with a specific group ID to create
470 and delete remote and striped directories. This tunable is
471 <literal>enable_remote_dir_gid</literal>. For example, setting this
472 parameter to the 'wheel' or 'admin' group ID allows users with that GID
473 to create and delete remote and striped directories. Setting this
474 parameter to <literal>-1</literal> on MDT0000 to permanently allow any
475 non-root users create and delete remote and striped directories.
476 On the MGS execute the following command:
478 mgs# lctl set_param -P mdt.<replaceable>fsname-*</replaceable>.enable_remote_dir_gid=-1
480 For the Lustre filesystem 'scratch', the commands expands to:
482 mgs# lctl set_param -P mdt.scratch-*.enable_remote_dir_gid=-1
484 The change can be verified by executing the following command on every MDS:
486 mds# lctl get_param mdt.<replaceable>*</replaceable>.enable_remote_dir_gid
490 <section xml:id="lfsmkdirdne2" condition='l28'>
493 <primary>operations</primary>
494 <secondary>striped directory</secondary>
497 <primary>operations</primary>
498 <secondary>mkdir</secondary>
501 <primary>operations</primary>
502 <secondary>setdirstripe</secondary>
505 <primary>striping</primary>
506 <secondary>metadata</secondary>
507 </indexterm>Creating a directory striped across multiple MDTs</title>
508 <para>The Lustre 2.8 DNE feature enables files in a single large
509 directory to be distributed across multiple MDTs (a <emphasis>striped
510 directory</emphasis>), if there are mutliple MDTs added to the
511 filesystem, see <xref linkend="lustremaint.adding_new_mdt"/>.
512 The result is that metadata requests for files in a single large
513 striped directory are serviced by multiple MDTs and metadata
514 service load is distributed over all the MDTs that service a given
515 directory. By distributing metadata service load over multiple MDTs,
516 performance of very large directories can be improved beyond the limit
517 of one MDT. Normally, all files in a directory must be created
518 on a single MDT.</para>
519 <para>This command to stripe a directory over
520 <replaceable>mdt_count</replaceable> MDTs is:
522 client$ lfs mkdir -c <replaceable>mdt_count</replaceable> <replaceable>/mount_point/new_directory</replaceable>
525 <para>The striped directory feature is most useful for distributing
526 a single large directory (50k entries or more) across multiple MDTs.
527 This should be used with discretion since creating and removing striped
528 directories incurs more overhead than non-striped directories.</para>
529 <section xml:id="lfsmkdirbyspace" condition='l2D'>
530 <title>Directory creation by space/inode usage</title>
531 <para>If the starting MDT is not specified when creating a new directory,
532 this directory and its stripes will be distributed on MDTs by space usage.
533 For example the following will create a new directory on an MDT
534 preferring one that has less space usage:</para>
536 client$ lfs mkdir -c 1 -i -1 <replaceable>dir1</replaceable>
538 <para>Alternatively, if a default directory stripe is set on a directory,
539 the subsequent use of <literal>mkdir</literal> for subdirectories in
540 <replaceable>dir1</replaceable> will have the same effect:
542 client$ lfs setdirstripe -D -c 1 -i -1 <replaceable>dir1</replaceable>
545 <para>The policy is:</para>
547 <listitem><para>If free inodes/blocks on all MDT are almost the same,
548 i.e. <literal>max_inodes_avail * 84% < min_inodes_avail</literal> and
549 <literal>max_blocks_avail * 84% < min_blocks_avail</literal>, then
550 choose MDT roundrobin.</para></listitem>
551 <listitem><para>Otherwise, create more subdirectories on MDTs with more
552 free inodes/blocks.</para></listitem>
554 <para>Sometime there are many MDTs. But it is not always desirable to
555 stripe a directory across all MDTs, even if the directory default
556 <literal>stripe_count=-1</literal> (unlimited).
557 In this case, the per-filesystem tunable parameter
558 <literal>lod.*.max_mdt_stripecount</literal> can be used to limit the
559 actual stripe count of directory to fewer than the full MDT count.
560 If <literal>lod.*.max_mdt_stripecount</literal> is not 0, and the
561 directory <literal>stripe_count=-1</literal>, the real directory
562 stripe count will be the minimum of the number of MDTs and
563 <literal>max_mdt_stripecount</literal>.
564 If <literal>lod.*.max_mdt_stripecount=0</literal>, or an explicit
565 stripe count is given for the directory, it is ignored.
567 <para>To set <literal>max_mdt_stripecount</literal>, on all MDSes of
570 mgs# lctl set_param -P lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount=<N>
573 <para>To check <literal>max_mdt_stripecount</literal>, run:
575 mds# lctl get_param lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount
578 <para>To reset <literal>max_mdt_stripecount</literal>, run:
580 mgs# lctl set_param -P -d lod.$fsname-MDTxxxx-mdtlov.max_mdt_stripecount
584 <section xml:id="fsdefaultlmv" condition='l2E'>
585 <title>Filesystem-wide default directory striping</title>
586 <para>Similar to file objects allocation, the directory objects are
587 allocated on MDTs by a round-robin algorithm or a weighted algorithm. For
588 the top three level of directories from the root of the filesystem, if the
589 amount of free inodes and blocks is well balanced (i.e., by default, when
590 the free inodes and blocks across MDTs differ by less than 5%), the
591 round-robin algorithm is used to select the next MDT on which a directory
594 <para>If the directory is more than three levels below the root directory,
595 or MDTs are not balanced, then the weighted algorithm is used to randomly
596 select an MDT with more free inodes and blocks.
598 <para> To avoid creating unnecessary remote directories, if the MDT where
599 its parent directory is located is not too full (the free inodes and
600 blocks of the parent MDT is not more than 5% full than average of all
601 MDTs), this directory will be created on parent MDT.
603 <para>If administrator wants to change this default filesystem-wide
604 directory striping, run the following command to limit this striping to
605 the top level below the root directory:</para>
607 client$ lfs setdirstripe -D -i -1 -c 1 --max-inherit 0 <mountpoint>
609 <para>To revert to the pre-2.15 behavior of all directories being created
610 only on MDT0000 by default (deleting this striping won't work because it
611 will be recreated if missing):</para>
613 client$ lfs setdirstripe -D -i 0 -c 1 --max-inherit 0 <mountpoint>
617 <section xml:id="default_dir_stripe_policy">
620 <primary>operations</primary>
621 <secondary>default dir stripe policy</secondary>
622 </indexterm>Default Dir Stripe Policy</title>
623 <para>If default dir stripe policy is set to a directory, it will be
624 applied to sub directories created later. For example:
627 $ lfs setdirstripe testdir1 -D -c 2
628 $ lfs getdirstripe testdir1 -D
629 lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 3 lmv_max_inherit_rr: 0
631 $ lfs getdirstripe testdir1/subdir1
632 lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: crush
633 mdtidx FID[seq:oid:ver]
634 0 [0x200000400:0x2:0x0]
635 1 [0x240000401:0x2:0x0]
638 <para>Default dir stripe can be inherited by sub directory.
639 This behavior is controlled by <literal>lmv_max_inherit</literal>
640 parameter. If <literal>lmv_max_inherit</literal> is 0 or 1, sub
641 directory stops to inherit default dir stripe policy.
642 Or sub directory decreases its parent's
643 <literal>lmv_max_inherit</literal> and uses it as its own
644 <literal>lmv_max_inherit</literal>.
645 -1 is special because it means unlimited. For example:
647 $ lfs getdirstripe testdir1/subdir1 -D
648 lmv_stripe_count: 2 lmv_stripe_offset: -1 lmv_hash_type: none lmv_max_inherit: 2 lmv_max_inherit_rr: 0
651 <para><literal>lmv_max_inherit</literal> can be set explicitly with
652 <literal>--max-inherit</literal> option in
653 <literal>lfs setdirstripe -D</literal> command.
654 If the max-inherit value is not specified, the default value is -1
655 when <literal>stripe_count</literal> is 0 or 1.
656 For other values of <literal>stripe_count</literal>, the default value
660 <section xml:id="set_get_lustre_params">
663 <primary>operations</primary>
664 <secondary>parameters</secondary>
665 </indexterm>Setting and Retrieving Lustre Parameters</title>
666 <para>Several options are available for setting parameters in
670 <para>When creating a file system, use mkfs.lustre. See
671 <xref linkend="tuning_params_mkfs_lustre" />below.</para>
674 <para>When a server is stopped, use tunefs.lustre. See
675 <xref linkend="setting_param_tunefs" />below.</para>
678 <para>When the file system is running, use lctl to set or retrieve
679 Lustre parameters. See
680 <xref linkend="setting_param_with_lctl" />and
681 <xref linkend="reporting_current_param" />below.</para>
684 <section xml:id="tuning_params_mkfs_lustre">
685 <title>Setting Tunable Parameters with
686 <literal>mkfs.lustre</literal></title>
687 <para>When the file system is first formatted, parameters can simply be
688 added as a <literal>--param</literal> option to the
689 <literal>mkfs.lustre</literal> command. For example:</para>
691 mds# mkfs.lustre --mdt --param="sys.timeout=50" /dev/sda
693 <para>For more details about creating a file system,see
694 <xref linkend="configuringlustre" />. For more details about
695 <literal>mkfs.lustre</literal>, see
696 <xref linkend="systemconfigurationutilities" />.</para>
698 <section xml:id="setting_param_tunefs">
699 <title>Setting Parameters with
700 <literal>tunefs.lustre</literal></title>
701 <para>If a server (OSS or MDS) is stopped, parameters can be added to an
702 existing file system using the
703 <literal>--param</literal> option to the
704 <literal>tunefs.lustre</literal> command. For example:</para>
706 oss# tunefs.lustre --param=failover.node=192.168.0.13@tcp0 /dev/sda
708 <para>With <literal>tunefs.lustre</literal>, parameters are
709 <emphasis>additive</emphasis>-- new parameters are specified in addition
710 to old parameters, they do not replace them. To erase all old
711 <literal>tunefs.lustre</literal> parameters and just use newly-specified
712 parameters, run:</para>
714 mds# tunefs.lustre --erase-params --param=<replaceable>new_parameters</replaceable>
716 <para>The tunefs.lustre command can be used to set any parameter settable
717 via <literal>lctl conf_param</literal> and that has its own OBD device,
718 so it can be specified as
720 <replaceable>obdname|fsname</replaceable>.
721 <replaceable>obdtype</replaceable>.
722 <replaceable>proc_file_name</replaceable>=
723 <replaceable>value</replaceable></literal>. For example:</para>
725 mds# tunefs.lustre --param mdt.identity_upcall=NONE /dev/sda1
727 <para>For more details about <literal>tunefs.lustre</literal>, see
728 <xref linkend="systemconfigurationutilities" />.</para>
730 <section xml:id="setting_param_with_lctl">
731 <title>Setting Parameters with
732 <literal>lctl</literal></title>
733 <para>When the file system is running, the
734 <literal>lctl</literal> command can be used to set parameters (temporary
735 or permanent) and report current parameter values. Temporary parameters
736 are active as long as the server or client is not shut down. Permanent
737 parameters live through server and client reboots.</para>
739 <para>The <literal>lctl list_param</literal> command enables users to
740 list all parameters that can be set. See
741 <xref linkend="list_params" />.</para>
743 <para>For more details about the
744 <literal>lctl</literal> command, see the examples in the sections below
746 <xref linkend="systemconfigurationutilities" />.</para>
748 <title>Setting Temporary Parameters</title>
750 <literal>lctl set_param</literal> to set temporary parameters on the
751 node where it is run. These parameters internally map to corresponding
752 items in the kernel <literal>/proc/{fs,sys}/{lnet,lustre}</literal> and
753 <literal>/sys/{fs,kernel/debug}/lustre</literal> virtual filesystems.
754 However, since the mapping between a particular parameter name and the
755 underlying virtual pathname may change, it is <emphasis>not</emphasis>
756 recommended to access the virtual pathname directly. The
757 <literal>lctl set_param</literal> command uses this syntax:</para>
759 # lctl set_param [-n] [-P] <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>proc_file_name</replaceable>=<replaceable>value</replaceable>
761 <para>For example:</para>
763 # lctl set_param osc.*.max_dirty_mb=1024
764 osc.myth-OST0000-osc.max_dirty_mb=32
765 osc.myth-OST0001-osc.max_dirty_mb=32
766 osc.myth-OST0002-osc.max_dirty_mb=32
767 osc.myth-OST0003-osc.max_dirty_mb=32
768 osc.myth-OST0004-osc.max_dirty_mb=32
771 <section xml:id="setting_permanent_params">
772 <title>Setting Permanent Parameters</title>
773 <para>Use <literal>lctl set_param -P</literal> or
774 <literal>lctl conf_param</literal> command to set permanent parameters.
775 In general, the <literal>set_param -P</literal> command is preferred
776 for new parameters, as this isolates the parameter settings from the
777 MDT and OST device configuration, and is consistent with the common
778 <literal>lctl get_param</literal> and <literal>lctl set_param</literal>
779 commands. The <literal>lctl conf_param</literal> command
780 was previously used to specify settable parameter, with the following
781 syntax (the same as the <literal>mkfs.lustre</literal> and
782 <literal>tunefs.lustre</literal> commands):</para>
784 <replaceable>obdname|fsname</replaceable>.<replaceable>obdtype</replaceable>.<replaceable>proc_file_name</replaceable>=<replaceable>value</replaceable>)
786 <note><para>The <literal>lctl conf_param</literal> and
787 <literal>lctl set_param</literal> syntax is <emphasis>not</emphasis>
788 the same.</para></note>
789 <para>Here are a few examples of
790 <literal>lctl conf_param</literal> commands:</para>
792 mgs# lctl conf_param testfs-MDT0000.sys.timeout=40
793 mgs# lctl conf_param testfs-MDT0000.mdt.identity_upcall=NONE
794 mgs# lctl conf_param testfs.llite.max_read_ahead_mb=16
795 mgs# lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15
796 mgs# lctl conf_param testfs-OST0000.ost.client_cache_seconds=15
797 mgs# lctl conf_param testfs.sys.timeout=40
800 <para>Parameters specified with the
801 <literal>lctl conf_param</literal> command are set permanently in the
802 file system's configuration file on the MGS.</para>
805 <section xml:id="setparamp" condition='l25'>
806 <title>Setting Permanent Parameters with lctl set_param -P</title>
807 <para>The <literal>lctl set_param -P</literal> command can also
808 set parameters permanently using the same syntax as
809 <literal>lctl set_param</literal> and <literal>lctl
810 get_param</literal> commands. Permanent parameter settings must be
811 issued on the MGS. The given parameter is set on every host using
812 <literal>lctl</literal> upcall. The <literal>lctl set_param</literal>
813 command uses the following syntax:</para>
815 lctl set_param -P <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>proc_file_name</replaceable>=<replaceable>value</replaceable>
817 <para>For example:</para>
819 mgs# lctl set_param -P timeout=40
820 mgs# lctl set_param -P mdt.testfs-MDT*.identity_upcall=NONE
821 mgs# lctl set_param -P llite.testfs-*.max_read_ahead_mb=16
822 mgs# lctl set_param -P osc.testfs-OST*.max_dirty_mb=29.15
823 mgs# lctl set_param -P ost.testfs-OST*.client_cache_seconds=15
825 <para>Use the <literal>-P -d</literal> option to delete permanent
826 parameters. Syntax:</para>
828 lctl set_param -P -d <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>parameter_name</replaceable>
830 <para>For example:</para>
832 mgs# lctl set_param -P -d osc.*.max_dirty_mb
834 <note condition='l2c'><para>Starting in Lustre 2.12, there is
835 <literal>lctl get_param</literal> command can provide
836 <emphasis>tab completion</emphasis> when using an interactive shell
837 with <literal>bash-completion</literal> installed. This simplifies
838 the use of <literal>get_param</literal> significantly, since it
839 provides an interactive list of available parameters.
842 <section xml:id="persistent_params">
843 <title>Listing Persistent Parameters</title>
844 <para>To list tunable parameters stored in the <literal>params</literal>
845 log file by <literal>lctl set_param -P</literal> and applied to nodes at
846 mount, run the <literal>lctl --device MGS llog_print params</literal>
847 command on the MGS. For example:</para>
849 mgs# lctl --device MGS llog_print params
850 - { index: 2, event: set_param, device: general, parameter: osc.*.max_dirty_mb, value: 1024 }
853 <section xml:id="list_params">
854 <title>Listing All Tunable Parameters</title>
855 <para>To list Lustre or LNet parameters that are available to set, use
856 the <literal>lctl list_param</literal> command. For example:</para>
858 lctl list_param [-FR] <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>
860 <para>The following arguments are available for the
861 <literal>lctl list_param</literal> command.</para>
863 <literal>-F</literal> Add '
864 <literal>/</literal>', '
865 <literal>@</literal>' or '
866 <literal>=</literal>' for directories, symlinks and writeable files,
869 <literal>-R</literal> Recursively lists all parameters under the
870 specified path</para>
871 <para>For example:</para>
873 oss# lctl list_param obdfilter.lustre-OST0000
876 <section xml:id="reporting_current_param">
877 <title>Reporting Current Parameter Values</title>
878 <para>To report current Lustre parameter values, use the
879 <literal>lctl get_param</literal> command with this syntax:</para>
881 lctl get_param [-n] <replaceable>obdtype</replaceable>.<replaceable>obdname</replaceable>.<replaceable>proc_file_name</replaceable>
883 <note condition='l2c'><para>Starting in Lustre 2.12, there is
884 <literal>lctl get_param</literal> command can provide
885 <emphasis>tab completion</emphasis> when using an interactive shell
886 with <literal>bash-completion</literal> installed. This simplifies
887 the use of <literal>get_param</literal> significantly, since it
888 provides an interactive list of available parameters.
890 <para>This example reports data on RPC service times.</para>
892 oss# lctl get_param -n ost.*.ost_io.timeouts
893 service : cur 1 worst 30 (at 1257150393, 85d23h58m54s ago) 1 1 1 1
895 <para>This example reports the amount of space this client has reserved
896 for writeback cache with each OST:</para>
898 client# lctl get_param osc.*.cur_grant_bytes
899 osc.myth-OST0000-osc-ffff8800376bdc00.cur_grant_bytes=2097152
900 osc.myth-OST0001-osc-ffff8800376bdc00.cur_grant_bytes=33890304
901 osc.myth-OST0002-osc-ffff8800376bdc00.cur_grant_bytes=35418112
902 osc.myth-OST0003-osc-ffff8800376bdc00.cur_grant_bytes=2097152
903 osc.myth-OST0004-osc-ffff8800376bdc00.cur_grant_bytes=33808384
908 <section xml:id="failover_nids">
911 <primary>operations</primary>
912 <secondary>failover</secondary>
913 </indexterm>Specifying NIDs and Failover</title>
914 <para>If a node has multiple network interfaces, it may have multiple NIDs,
915 which must all be identified so other nodes can choose the NID that is
916 appropriate for their network interfaces. Typically, NIDs are specified in
917 a list delimited by commas (
918 <literal>,</literal>). However, when failover nodes are specified, the NIDs
919 are delimited by a colon (
920 <literal>:</literal>) or by repeating a keyword such as
921 <literal>--mgsnode=</literal> or
922 <literal>--servicenode=</literal>).</para>
923 <para>To display the NIDs of all servers in networks configured to work
924 with the Lustre file system, run (while LNet is running):</para>
928 <para>In the example below,
929 <literal>mds0</literal> and
930 <literal>mds1</literal> are configured as a combined MGS/MDT failover pair
931 and <literal>oss0</literal> and
932 <literal>oss1</literal> are configured as an OST failover pair. The Ethernet
934 <literal>mds0</literal> is 192.168.10.1, and for
935 <literal>mds1</literal> is 192.168.10.2. The Ethernet addresses for
936 <literal>oss0</literal> and
937 <literal>oss1</literal> are 192.168.10.20 and 192.168.10.21
940 mds0# mkfs.lustre --fsname=testfs --mdt --mgs \
941 --servicenode=192.168.10.2@tcp0 \
942 -–servicenode=192.168.10.1@tcp0 /dev/sda1
943 mds0# mount -t lustre /dev/sda1 /mnt/test/mdt
944 oss0# mkfs.lustre --fsname=testfs --servicenode=192.168.10.20@tcp0 \
945 --servicenode=192.168.10.21 --ost --index=0 \
946 --mgsnode=192.168.10.1@tcp0 --mgsnode=192.168.10.2@tcp0 \
948 oss0# mount -t lustre /dev/sdb /mnt/test/ost0
949 client# mount -t lustre 192.168.10.1@tcp0:192.168.10.2@tcp0:/testfs \
951 mds0# umount /mnt/mdt
952 mds1# mount -t lustre /dev/sda1 /mnt/test/mdt
953 mds1# lctl get_param mdt.testfs-MDT0000.recovery_status
955 <para>Where multiple NIDs are specified separated by commas (for example,
956 <literal>10.67.73.200@tcp,192.168.10.1@tcp</literal>), the two NIDs refer
957 to the same host, and the Lustre software chooses the
958 <emphasis>best</emphasis> one for communication. When a pair of NIDs is
959 separated by a colon (for example,
960 <literal>10.67.73.200@tcp:10.67.73.201@tcp</literal>), the two NIDs refer
961 to two different hosts and are treated as a failover pair (the Lustre
962 software tries the first one, and if that fails, it tries the second
965 <literal>mkfs.lustre</literal> can be used to specify failover nodes. The
966 <literal>--servicenode</literal> option is used to specify all service NIDs,
967 including those for primary nodes and failover nodes. When the
968 <literal>--servicenode</literal> option is used, the first service node to
969 load the target device becomes the primary service node, while nodes
970 corresponding to the other specified NIDs become failover locations for the
971 target device. An older option, <literal>--failnode</literal>, specifies
972 just the NIDs of failover nodes. For more information about the
973 <literal>--servicenode</literal> and
974 <literal>--failnode</literal> options, see
975 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
976 linkend="configuringfailover" />.</para>
978 <section xml:id="erasing_filesystem">
981 <primary>operations</primary>
982 <secondary>erasing a file system</secondary>
983 </indexterm>Erasing a File System</title>
984 <para>If you want to erase a file system and permanently delete all the
985 data in the file system, run this command on your targets:</para>
987 # mkfs.lustre --reformat
989 <para>If you are using a separate MGS and want to keep other file systems
990 defined on that MGS, then set the
991 <literal>writeconf</literal> flag on the MDT for that file system. The
992 <literal>writeconf</literal> flag causes the configuration logs to be
993 erased; they are regenerated the next time the servers start.</para>
994 <para>To set the <literal>writeconf</literal> flag on the MDT:</para>
997 <para>Unmount all clients/servers using this file system, run:</para>
999 client# umount /mnt/lustre
1003 <para>Permanently erase the file system and, presumably, replace it
1004 with another file system, run:</para>
1006 mgs# mkfs.lustre --reformat --fsname spfs --mgs --mdt --index=0 /dev/<replaceable>mdsdev</replaceable>
1010 <para>If you have a separate MGS (that you do not want to reformat),
1011 then add the <literal>--writeconf</literal> flag to
1012 <literal>mkfs.lustre</literal> on the MDT, run:</para>
1014 mgs# mkfs.lustre --reformat --writeconf --fsname spfs --mgsnode=<replaceable>mgs_nid</replaceable> \
1015 --mdt --index=0 <replaceable>/dev/mds_device</replaceable>
1020 <para>If you have a combined MGS/MDT, reformatting the MDT reformats the
1021 MGS as well, causing all configuration information to be lost; you can
1022 start building your new file system. Nothing needs to be done with old
1023 disks that will not be part of the new file system, just do not mount
1027 <section xml:id="reclaiming_reserved_disk_space">
1030 <primary>operations</primary>
1031 <secondary>reclaiming space</secondary>
1032 </indexterm>Reclaiming Reserved Disk Space</title>
1033 <para>All current Lustre installations run the ldiskfs file system
1034 internally on service nodes. By default, ldiskfs reserves 5% of the disk
1035 space to avoid file system fragmentation. In order to reclaim this space,
1036 run the following command on your OSS for each OST in the file
1039 # tune2fs [-m reserved_blocks_percent] /dev/<replaceable>ostdev</replaceable>
1041 <para>You do not need to shut down Lustre before running this command or
1042 restart it afterwards.</para>
1044 <para>Reducing the space reservation can cause severe performance
1045 degradation as the OST file system becomes more than 95% full, due to
1046 difficulty in locating large areas of contiguous free space. This
1047 performance degradation may persist even if the space usage drops below
1048 95% again. It is recommended NOT to reduce the reserved disk space below
1052 <section xml:id="replacing_existing_ost_mdt">
1055 <primary>operations</primary>
1056 <secondary>replacing an OST or MDS</secondary>
1057 </indexterm>Replacing an Existing OST or MDT</title>
1058 <para>To copy the contents of an existing OST to a new OST (or an old MDT
1059 to a new MDT), follow the process for either OST/MDT backups in
1060 <xref linkend='backup_device' />or
1061 <xref linkend='backup_fs_level' />.
1062 For more information on removing a MDT, see
1063 <xref linkend='lustremaint.rmremotedir' />.</para>
1065 <section xml:id="identifying_file_objects">
1068 <primary>operations</primary>
1069 <secondary>identifying OSTs</secondary>
1070 </indexterm>Identifying To Which Lustre File an OST Object Belongs</title>
1071 <para>Use this procedure to identify the file containing a given object on
1075 <para>On the OST (as root), run
1076 <literal>debugfs</literal> to display the file identifier (
1077 <literal>FID</literal>) of the file associated with the object.</para>
1078 <para>For example, if the object is
1079 <literal>34976</literal> on
1080 <literal>/dev/lustre/ost_test2</literal>, the debug command is:
1082 # debugfs -c -R "stat /O/0/d$((34976 % 32))/34976" /dev/lustre/ost_test2
1084 <para>The command output is:
1086 debugfs 1.45.6.wc1 (20-Mar-2020)
1087 /dev/lustre/ost_test2: catastrophic mode - not reading inode or group bitmaps
1088 Inode: 352365 Type: regular Mode: 0666 Flags: 0x80000
1089 Generation: 2393149953 Version: 0x0000002a:00005f81
1090 User: 1000 Group: 1000 Size: 260096
1091 File ACL: 0 Directory ACL: 0
1092 Links: 1 Blockcount: 512
1093 Fragment: Address: 0 Number: 0 Size: 0
1094 ctime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
1095 atime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
1096 mtime: 0x4a216b48:00000000 -- Sat May 30 13:22:16 2009
1097 crtime: 0x4a216b3c:975870dc -- Sat May 30 13:22:04 2009
1098 Size of extra inode fields: 24
1099 Extended attributes stored in inode body:
1100 fid = "b9 da 24 00 00 00 00 00 6a fa 0d 3f 01 00 00 00 eb 5b 0b 00 00 00 0000
1101 00 00 00 00 00 00 00 00 " (32)
1102 fid: objid=34976 seq=0 parent=[0x200000400:0x122:0x0] stripe=1
1104 (0-64):4620544-4620607
1109 <para>The parent FID will be of the form
1110 <literal>[0x200000400:0x122:0x0]</literal> and can be resolved directly
1111 using the command <literal>lfs fid2path [0x200000404:0x122:0x0]
1112 /mnt/lustre</literal> on any Lustre client, and the process is
1116 <para>In cases of an upgraded 1.x inode (if the first part of the
1117 FID is below 0x200000400), the MDT inode number is
1118 <literal>0x24dab9</literal> and generation
1119 <literal>0x3f0dfa6a</literal> and the pathname can also be resolved
1120 using <literal>debugfs</literal>.</para>
1123 <para>On the MDS (as root), use
1124 <literal>debugfs</literal> to find the file associated with the
1127 # debugfs -c -R "ncheck 0x24dab9" /dev/lustre/mdt_test
1128 debugfs 1.42.3.wc3 (15-Aug-2012)
1129 /dev/lustre/mdt_test: catastrophic mode - not reading inode or group bitmaps
1131 2415289 /ROOT/brian-laptop-guest/clients/client11/~dmtmp/PWRPNT/ZD16.BMP
1135 <para>The command lists the inode and pathname associated with the
1139 <literal>Debugfs</literal>' ''ncheck'' is a brute-force search that may
1140 take a long time to complete.</para>
1143 <para>To find the Lustre file from a disk LBA, follow the steps listed in
1144 the document at this URL:
1145 <link xl:href="https://www.smartmontools.org/wiki/BadBlockHowto">
1146 https://www.smartmontools.org/wiki/BadBlockHowto</link>. Then,
1147 follow the steps above to resolve the Lustre filename.</para>
1152 vim:expandtab:shiftwidth=2:tabstop=8: