1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="managingfilesystemio">
5 <title xml:id="managingfilesystemio.title">Managing the File System and
7 <section xml:id="dbdoclet.50438211_17536">
10 <primary>I/O</primary>
13 <primary>I/O</primary>
14 <secondary>full OSTs</secondary>
15 </indexterm>Handling Full OSTs</title>
16 <para>Sometimes a Lustre file system becomes unbalanced, often due to
17 incorrectly-specified stripe settings, or when very large files are created
18 that are not striped over all of the OSTs. Lustre will automatically avoid
19 allocating new files on OSTs that are full. If an OST is completely full and
20 more data is written to files already located on that OST, an error occurs.
21 The procedures below describe how to handle a full OST.</para>
22 <para>The MDS will normally handle space balancing automatically at file
23 creation time, and this procedure is normally not needed, but manual data
24 migration may be desirable in some cases (e.g. creating very large files
25 that would consume more than the total free space of the full OSTs).</para>
29 <primary>I/O</primary>
30 <secondary>OST space usage</secondary>
31 </indexterm>Checking OST Space Usage</title>
32 <para>The example below shows an unbalanced file system:</para>
35 UUID bytes Used Available \
37 testfs-MDT0000_UUID 4.4G 214.5M 3.9G \
39 testfs-OST0000_UUID 2.0G 751.3M 1.1G \
40 37% /mnt/testfs[OST:0]
41 testfs-OST0001_UUID 2.0G 755.3M 1.1G \
42 37% /mnt/testfs[OST:1]
43 testfs-OST0002_UUID 2.0G 1.7G 155.1M \
44 86% /mnt/testfs[OST:2] ****
45 testfs-OST0003_UUID 2.0G 751.3M 1.1G \
46 37% /mnt/testfs[OST:3]
47 testfs-OST0004_UUID 2.0G 747.3M 1.1G \
48 37% /mnt/testfs[OST:4]
49 testfs-OST0005_UUID 2.0G 743.3M 1.1G \
50 36% /mnt/testfs[OST:5]
52 filesystem summary: 11.8G 5.4G 5.8G \
55 <para>In this case, OST0002 is almost full and when an attempt is made to
56 write additional information to the file system (even with uniform
57 striping over all the OSTs), the write command fails as follows:</para>
59 client# lfs setstripe /mnt/testfs 4M 0 -1
60 client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100
61 dd: writing '/mnt/testfs/test_3': No space left on device
64 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s
70 <primary>I/O</primary>
71 <secondary>disabling OST creates</secondary>
72 </indexterm>Disabling creates on a Full OST</title>
73 <para>To avoid running out of space in the file system, if the OST usage
74 is imbalanced and one or more OSTs are close to being full while there
75 are others that have a lot of space, the MDS will typically avoid file
76 creation on the full OST(s) automatically. The full OSTs may optionally
77 be deactivated manually on the MDS to ensure the MDS will not allocate
78 new objects there.</para>
81 <para>Log into the MDS server and use the <literal>lctl</literal>
82 command to stop new object creation on the full OST(s):
85 mds# lctl set_param osp.<replaceable>fsname</replaceable>-OST<replaceable>nnnn</replaceable>*.max_create_count=0
89 <para>When new files are created in the file system, they will only use
90 the remaining OSTs. Either manual space rebalancing can be done by
91 migrating data to other OSTs, as shown in the next section, or normal
92 file deletion and creation can passively rebalance the space usage.</para>
97 <primary>I/O</primary>
98 <secondary>migrating data</secondary>
101 <primary>maintenance</primary>
102 <secondary>full OSTs</secondary>
103 </indexterm>Migrating Data within a File System</title>
105 <para>If there is a need to move the file data from the current
106 OST(s) to new OST(s), the data must be migrated (copied)
107 to the new location. The simplest way to do this is to use the
108 <literal>lfs_migrate</literal> command, as described in
109 <xref linkend="lustremaint.adding_new_ost" />.</para>
114 <primary>I/O</primary>
115 <secondary>bringing OST online</secondary>
118 <primary>maintenance</primary>
119 <secondary>bringing OST online</secondary>
120 </indexterm>Returning an Inactive OST Back Online</title>
121 <para>Once the full OST(s) no longer are severely imbalanced, due
122 to either active or passive data redistribution, they should be
123 reactivated so they will again have new files allocated on them.</para>
125 [mds]# lctl set_param osp.testfs-OST0002.max_create_count=20000
130 <primary>migrating metadata</primary>
131 </indexterm>Migrating Metadata within a Filesystem</title>
132 <section remap="h3" condition='l28'>
134 <primary>migrating metadata</primary>
135 </indexterm>Whole Directory Migration</title>
136 <para>Lustre software version 2.8 includes a feature
137 to migrate metadata (directories and inodes therein) between MDTs.
138 This migration can only be performed on whole directories. Striped
139 directories are not supported until Lustre 2.12. For example, to
140 migrate the contents of the <literal>/testfs/remotedir</literal>
141 directory from the MDT where it currently is located to MDT0000 to
142 allow that MDT to be removed, the sequence of commands is as follows:
145 $ lfs getdirstripe -m ./remotedir <lineannotation>which MDT is dir on?</lineannotation>
147 $ touch ./remotedir/file.{1,2,3}.txt<lineannotation>create test files</lineannotation>
148 $ lfs getstripe -m ./remotedir/file.*.txt<lineannotation>check files are on MDT0001</lineannotation>
152 $ lfs migrate -m 0 ./remotedir <lineannotation>migrate testremote to MDT0000</lineannotation>
153 $ lfs getdirstripe -m ./remotedir <lineannotation>which MDT is dir on now?</lineannotation>
155 $ lfs getstripe -m ./remotedir/file.*.txt<lineannotation>check files are on MDT0000</lineannotation>
159 <para>For more information, see <literal>man lfs-migrate</literal>.
161 <warning><para>During migration each file receives a new identifier
162 (FID). As a consequence, the file will report a new inode number to
163 userspace applications. Some system tools (for example, backup and
164 archiving tools, NFS, Samba) that identify files by inode number may
165 consider the migrated files to be new, even though the contents are
166 unchanged. If a Lustre system is re-exporting to NFS, the migrated
167 files may become inaccessible during and after migration if the
168 client or server are caching a stale file handle with the old FID.
169 Restarting the NFS service will flush the local file handle cache,
170 but clients may also need to be restarted as they may cache stale
171 file handles as well.
174 <section remap="h3" condition='l2C'>
176 <primary>migrating metadata</primary>
177 </indexterm>Striped Directory Migration</title>
178 <para>Lustre 2.8 included a feature to migrate metadata (directories
179 and inodes therein) between MDTs, however it did not support migration
180 of striped directories, or changing the stripe count of an existing
181 directory. Lustre 2.12 adds support for migrating and restriping
182 directories. The <literal>lfs migrate -m</literal> command can only
183 only be performed on whole directories, though it will migrate both
184 the specified directory and its sub-entries recursively.
185 For example, to migrate the contents of a large directory
186 <literal>/testfs/largedir</literal> from its current location on
187 MDT0000 to MDT0001 and MDT0003, run the following command:</para>
188 <screen>$ lfs migrate -m 1,3 /testfs/largedir</screen>
189 <para>Metadata migration will migrate file dirent and inode to other
190 MDTs, but it won't touch file data. During migration, directory and
191 its sub-files can be accessed like normal ones, though the same
192 warning above applies to tools that depend on the file inode number.
193 Migration may fail for various reasons such as MDS restart, or disk
194 full. In those cases, some of the sub-files may have been migrated to
195 the new MDTs, while others are still on the original MDT. The files
196 can be accessed normally. The same <literal>lfs migrate -m</literal>
197 command should be executed again when these issues are fixed to finish
198 this migration. However, you cannot abort a failed migration, or
199 migrate to different MDTs from previous migration command.</para>
203 <section xml:id="dbdoclet.50438211_75549">
206 <primary>I/O</primary>
207 <secondary>pools</secondary>
210 <primary>maintenance</primary>
211 <secondary>pools</secondary>
214 <primary>pools</primary>
215 </indexterm>Creating and Managing OST Pools</title>
216 <para>The OST pools feature enables users to group OSTs together to make
217 object placement more flexible. A 'pool' is the name associated with an
218 arbitrary subset of OSTs in a Lustre cluster.</para>
219 <para>OST pools follow these rules:</para>
222 <para>An OST can be a member of multiple pools.</para>
225 <para>No ordering of OSTs in a pool is defined or implied.</para>
228 <para>Stripe allocation within a pool follows the same rules as the
229 normal stripe allocator.</para>
232 <para>OST membership in a pool is flexible, and can change over
236 <para>When an OST pool is defined, it can be used to allocate files. When
237 file or directory striping is set to a pool, only OSTs in the pool are
238 candidates for striping. If a stripe_index is specified which refers to an
239 OST that is not a member of the pool, an error is returned.</para>
240 <para>OST pools are used only at file creation. If the definition of a pool
241 changes (an OST is added or removed or the pool is destroyed),
242 already-created files are not affected.</para>
245 <literal>EINVAL</literal>) results if you create a file using an empty
249 <para>If a directory has pool striping set and the pool is subsequently
250 removed, the new files created in this directory have the (non-pool)
251 default striping pattern for that directory applied and no error is
255 <title>Working with OST Pools</title>
256 <para>OST pools are defined in the configuration log on the MGS. Use the
257 lctl command to:</para>
260 <para>Create/destroy a pool</para>
263 <para>Add/remove OSTs in a pool</para>
266 <para>List pools and OSTs in a specific pool</para>
269 <para>The lctl command MUST be run on the MGS. Another requirement for
270 managing OST pools is to either have the MDT and MGS on the same node or
271 have a Lustre client mounted on the MGS node, if it is separate from the
272 MDS. This is needed to validate the pool commands being run are
276 <literal>writeconf</literal> command on the MDS erases all pools
277 information (as well as any other parameters set using
278 <literal>lctl conf_param</literal>). We recommend that the pools
280 <literal>conf_param</literal> settings) be executed using a script, so
281 they can be reproduced easily after a
282 <literal>writeconf</literal> is performed.</para>
284 <para>To create a new pool, run:</para>
287 <replaceable>fsname</replaceable>.
288 <replaceable>poolname</replaceable>
291 <para>The pool name is an ASCII string up to 15 characters.</para>
293 <para>To add the named OST to a pool, run:</para>
296 <replaceable>fsname</replaceable>.
297 <replaceable>poolname</replaceable>
298 <replaceable>ost_list</replaceable>
305 <replaceable>ost_list</replaceable>is
306 <replaceable>fsname</replaceable>-OST
307 <replaceable>index_range</replaceable></literal>
313 <replaceable>index_range</replaceable>is
314 <replaceable>ost_index_start</replaceable>-
315 <replaceable>ost_index_end[,index_range]</replaceable></literal> or
317 <replaceable>ost_index_start</replaceable>-
318 <replaceable>ost_index_end/step</replaceable></literal></para>
323 <replaceable>fsname</replaceable>
324 </literal> and/or ending
325 <literal>_UUID</literal> are missing, they are automatically added.</para>
326 <para>For example, to add even-numbered OSTs to
327 <literal>pool1</literal> on file system
328 <literal>testfs</literal>, run a single command (
329 <literal>pool_add</literal>) to add many OSTs to the pool at one
333 lctl pool_add testfs.pool1 OST[0-10/2]
337 <para>Each time an OST is added to a pool, a new
338 <literal>llog</literal> configuration record is created. For
339 convenience, you can run a single command.</para>
341 <para>To remove a named OST from a pool, run:</para>
343 mgs# lctl pool_remove
344 <replaceable>fsname</replaceable>.
345 <replaceable>poolname</replaceable>
346 <replaceable>ost_list</replaceable>
348 <para>To destroy a pool, run:</para>
350 mgs# lctl pool_destroy
351 <replaceable>fsname</replaceable>.
352 <replaceable>poolname</replaceable>
355 <para>All OSTs must be removed from a pool before it can be
358 <para>To list pools in the named file system, run:</para>
361 <replaceable>fsname|pathname</replaceable>
363 <para>To list OSTs in a named pool, run:</para>
366 <replaceable>fsname</replaceable>.
367 <replaceable>poolname</replaceable>
370 <title>Using the lfs Command with OST Pools</title>
371 <para>Several lfs commands can be run with OST pools. Use the
372 <literal>lfs setstripe</literal> command to associate a directory with
373 an OST pool. This causes all new regular files and directories in the
374 directory to be created in the pool. The lfs command can be used to
375 list pools in a file system and OSTs in a named pool.</para>
376 <para>To associate a directory with a pool, so all new files and
377 directories will be created in the pool, run:</para>
379 client# lfs setstripe --pool|-p pool_name
380 <replaceable>filename|dirname</replaceable>
382 <para>To set striping patterns, run:</para>
384 client# lfs setstripe [--size|-s stripe_size] [--offset|-o start_ost]
385 [--stripe-count|-c stripe_count] [--overstripe-count|-C stripe_count]
386 [--pool|-p pool_name]
388 <replaceable>dir|filename</replaceable>
391 <para>If you specify striping with an invalid pool name, because the
392 pool does not exist or the pool name was mistyped,
393 <literal>lfs setstripe</literal> returns an error. Run
394 <literal>lfs pool_list</literal> to make sure the pool exists and the
395 pool name is entered correctly.</para>
399 <literal>--pool</literal> option for lfs setstripe is compatible with
400 other modifiers. For example, you can set striping on a directory to
401 use an explicit starting index.</para>
408 <primary>pools</primary>
409 <secondary>usage tips</secondary>
410 </indexterm>Tips for Using OST Pools</title>
411 <para>Here are several suggestions for using OST pools.</para>
414 <para>A directory or file can be given an extended attribute (EA),
415 that restricts striping to a pool.</para>
418 <para>Pools can be used to group OSTs with the same technology or
419 performance (slower or faster), or that are preferred for certain
420 jobs. Examples are SATA OSTs versus SAS OSTs or remote OSTs versus
424 <para>A file created in an OST pool tracks the pool by keeping the
425 pool name in the file LOV EA.</para>
430 <section xml:id="dbdoclet.50438211_11204">
433 <primary>I/O</primary>
434 <secondary>adding an OST</secondary>
435 </indexterm>Adding an OST to a Lustre File System</title>
436 <para>To add an OST to existing Lustre file system:</para>
439 <para>Add a new OST by passing on the following commands, run:</para>
441 oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
442 oss# mkdir -p /mnt/testfs/ost12
443 oss# mount -t lustre /dev/sda /mnt/testfs/ost12
447 <para>Migrate the data (possibly).</para>
448 <para>The file system is quite unbalanced when new empty OSTs are
449 added. New file creations are automatically balanced. If this is a
450 scratch file system or files are pruned at a regular interval, then no
451 further work may be needed. Files existing prior to the expansion can
452 be rebalanced with an in-place copy, which can be done with a simple
454 <para>The basic method is to copy existing files to a temporary file,
455 then move the temp file over the old one. This should not be attempted
456 with files which are currently being written to by users or
457 applications. This operation redistributes the stripes over the entire
459 <para>A very clever migration script would do the following:</para>
462 <para>Examine the current distribution of data.</para>
465 <para>Calculate how much data should move from each full OST to the
469 <para>Search for files on a given full OST (using
470 <literal>lfs getstripe</literal>).</para>
473 <para>Force the new destination OST (using
474 <literal>lfs setstripe</literal>).</para>
477 <para>Copy only enough files to address the imbalance.</para>
482 <para>If a Lustre file system administrator wants to explore this approach
483 further, per-OST disk-usage statistics can be found under
484 <literal>/proc/fs/lustre/osc/*/rpc_stats</literal></para>
486 <section xml:id="dbdoclet.50438211_80295">
489 <primary>I/O</primary>
490 <secondary>direct</secondary>
491 </indexterm>Performing Direct I/O</title>
492 <para>The Lustre software supports the
493 <literal>O_DIRECT</literal> flag to open.</para>
494 <para>Applications using the
495 <literal>read()</literal> and
496 <literal>write()</literal> calls must supply buffers aligned on a page
497 boundary (usually 4 K). If the alignment is not correct, the call returns
498 <literal>-EINVAL</literal>. Direct I/O may help performance in cases where
499 the client is doing a large amount of I/O and is CPU-bound (CPU utilization
502 <title>Making File System Objects Immutable</title>
503 <para>An immutable file or directory is one that cannot be modified,
504 renamed or removed. To do this:</para>
507 <replaceable>file</replaceable>
509 <para>To remove this flag, use
510 <literal>chattr -i</literal></para>
513 <section xml:id="dbdoclet.50438211_61024">
514 <title>Other I/O Options</title>
515 <para>This section describes other I/O options, including checksums, and
516 the ptlrpcd thread pool.</para>
518 <title>Lustre Checksums</title>
519 <para>To guard against network data corruption, a Lustre client can
520 perform two types of data checksums: in-memory (for data in client
521 memory) and wire (for data sent over the network). For each checksum
522 type, a 32-bit checksum of the data read or written on both the client
523 and server is computed, to ensure that the data has not been corrupted in
524 transit over the network. The
525 <literal>ldiskfs</literal> backing file system does NOT do any persistent
526 checksumming, so it does not detect corruption of data in the OST file
528 <para>The checksumming feature is enabled, by default, on individual
529 client nodes. If the client or OST detects a checksum mismatch, then an
530 error is logged in the syslog of the form:</para>
532 LustreError: BAD WRITE CHECKSUM: changed in transit before arrival at OST: \
533 from 192.168.1.1@tcp inum 8991479/2386814769 object 1127239/0 extent [10240\
536 <para>If this happens, the client will re-read or re-write the affected
537 data up to five times to get a good copy of the data over the network. If
538 it is still not possible, then an I/O error is returned to the
540 <para>To enable both types of checksums (in-memory and wire), run:</para>
542 lctl set_param llite.*.checksum_pages=1
544 <para>To disable both types of checksums (in-memory and wire),
547 lctl set_param llite.*.checksum_pages=0
549 <para>To check the status of a wire checksum, run:</para>
551 lctl get_param osc.*.checksums
554 <title>Changing Checksum Algorithms</title>
555 <para>By default, the Lustre software uses the adler32 checksum
556 algorithm, because it is robust and has a lower impact on performance
557 than crc32. The Lustre file system administrator can change the
558 checksum algorithm via
559 <literal>lctl get_param</literal>, depending on what is supported in
561 <para>To check which checksum algorithm is being used by the Lustre
562 software, run:</para>
564 $ lctl get_param osc.*.checksum_type
566 <para>To change the wire checksum algorithm, run:</para>
568 $ lctl set_param osc.*.checksum_type=
569 <replaceable>algorithm</replaceable>
572 <para>The in-memory checksum always uses the adler32 algorithm, if
573 available, and only falls back to crc32 if adler32 cannot be
576 <para>In the following example, the
577 <literal>lctl get_param</literal> command is used to determine that the
578 Lustre software is using the adler32 checksum algorithm. Then the
579 <literal>lctl set_param</literal> command is used to change the checksum
580 algorithm to crc32. A second
581 <literal>lctl get_param</literal> command confirms that the crc32
582 checksum algorithm is now in use.</para>
584 $ lctl get_param osc.*.checksum_type
585 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler]
586 $ lctl set_param osc.*.checksum_type=crc32
587 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32
588 $ lctl get_param osc.*.checksum_type
589 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler
594 <title>Ptlrpc Thread Pool</title>
595 <para>Releases prior to Lustre software release 2.2 used two portal RPC
596 daemons for each client/server pair. One daemon handled all synchronous
597 IO requests, and the second daemon handled all asynchronous (non-IO)
598 RPCs. The increasing use of large SMP nodes for Lustre servers exposed
599 some scaling issues. The lack of threads for large SMP nodes resulted in
600 cases where a single CPU would be 100% utilized and other CPUs would be
601 relativity idle. This is especially noticeable when a single client
602 traverses a large directory.</para>
603 <para>Lustre software release 2.2.x implements a ptlrpc thread pool, so
604 that multiple threads can be created to serve asynchronous RPC requests.
605 The number of threads spawned is controlled at module load time using
606 module options. By default one thread is spawned per CPU, with a minimum
607 of 2 threads spawned irrespective of module options.</para>
608 <para>One of the issues with thread operations is the cost of moving a
609 thread context from one CPU to another with the resulting loss of CPU
610 cache warmth. To reduce this cost, ptlrpc threads can be bound to a CPU.
611 However, if the CPUs are busy, a bound thread may not be able to respond
612 quickly, as the bound CPU may be busy with other tasks and the thread
613 must wait to schedule.</para>
614 <para>Because of these considerations, the pool of ptlrpc threads can be
615 a mixture of bound and unbound threads. The system operator can balance
616 the thread mixture based on system size and workload.</para>
618 <title>ptlrpcd parameters</title>
619 <para>These parameters should be set in
620 <literal>/etc/modprobe.conf</literal> or in the
621 <literal>etc/modprobe.d</literal> directory, as options for the ptlrpc
624 options ptlrpcd max_ptlrpcds=XXX
626 <para>Sets the number of ptlrpcd threads created at module load time.
627 The default if not specified is one thread per CPU, including
628 hyper-threaded CPUs. The lower bound is 2 (old prlrpcd behaviour)
630 options ptlrpcd ptlrpcd_bind_policy=[1-4]
632 <para>Controls the binding of threads to CPUs. There are four policy
637 <literal role="bold">
638 PDB_POLICY_NONE</literal>(ptlrpcd_bind_policy=1) All threads are
643 <literal role="bold">
644 PDB_POLICY_FULL</literal>(ptlrpcd_bind_policy=2) All threads
645 attempt to bind to a CPU.</para>
649 <literal role="bold">
650 PDB_POLICY_PAIR</literal>(ptlrpcd_bind_policy=3) This is the
651 default policy. Threads are allocated as a bound/unbound pair. Each
652 thread (bound or free) has a partner thread. The partnering is used
653 by the ptlrpcd load policy, which determines how threads are
654 allocated to CPUs.</para>
658 <literal role="bold">
659 PDB_POLICY_NEIGHBOR</literal>(ptlrpcd_bind_policy=4) Threads are
660 allocated as a bound/unbound pair. Each thread (bound or free) has
661 two partner threads.</para>