1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="managingfilesystemio">
5 <title xml:id="managingfilesystemio.title">Managing the File System and
7 <section xml:id="dbdoclet.50438211_17536">
10 <primary>I/O</primary>
13 <primary>I/O</primary>
14 <secondary>full OSTs</secondary>
15 </indexterm>Handling Full OSTs</title>
16 <para>Sometimes a Lustre file system becomes unbalanced, often due to
17 incorrectly-specified stripe settings, or when very large files are created
18 that are not striped over all of the OSTs. If an OST is full and an attempt
19 is made to write more information to the file system, an error occurs. The
20 procedures below describe how to handle a full OST.</para>
21 <para>The MDS will normally handle space balancing automatically at file
22 creation time, and this procedure is normally not needed, but may be
23 desirable in certain circumstances (e.g. when creating very large files
24 that would consume more than the total free space of the full OSTs).</para>
28 <primary>I/O</primary>
29 <secondary>OST space usage</secondary>
30 </indexterm>Checking OST Space Usage</title>
31 <para>The example below shows an unbalanced file system:</para>
34 UUID bytes Used Available \
36 testfs-MDT0000_UUID 4.4G 214.5M 3.9G \
38 testfs-OST0000_UUID 2.0G 751.3M 1.1G \
39 37% /mnt/testfs[OST:0]
40 testfs-OST0001_UUID 2.0G 755.3M 1.1G \
41 37% /mnt/testfs[OST:1]
42 testfs-OST0002_UUID 2.0G 1.7G 155.1M \
43 86% /mnt/testfs[OST:2] ****
44 testfs-OST0003_UUID 2.0G 751.3M 1.1G \
45 37% /mnt/testfs[OST:3]
46 testfs-OST0004_UUID 2.0G 747.3M 1.1G \
47 37% /mnt/testfs[OST:4]
48 testfs-OST0005_UUID 2.0G 743.3M 1.1G \
49 36% /mnt/testfs[OST:5]
51 filesystem summary: 11.8G 5.4G 5.8G \
54 <para>In this case, OST0002 is almost full and when an attempt is made to
55 write additional information to the file system (even with uniform
56 striping over all the OSTs), the write command fails as follows:</para>
58 client# lfs setstripe /mnt/testfs 4M 0 -1
59 client# dd if=/dev/zero of=/mnt/testfs/test_3 bs=10M count=100
60 dd: writing '/mnt/testfs/test_3': No space left on device
63 1017192448 bytes (1.0 GB) copied, 23.2411 seconds, 43.8 MB/s
69 <primary>I/O</primary>
70 <secondary>taking OST offline</secondary>
71 </indexterm>Taking a Full OST Offline</title>
72 <para>To avoid running out of space in the file system, if the OST usage
73 is imbalanced and one or more OSTs are close to being full while there
74 are others that have a lot of space, the full OSTs may optionally be
75 deactivated at the MDS to prevent the MDS from allocating new objects
79 <para>Log into the MDS server:</para>
81 client# ssh root@192.168.0.10
82 root@192.168.0.10's password:
83 Last login: Wed Nov 26 13:35:12 2008 from 192.168.0.6
88 <literal>lctl dl</literal> command to show the status of all file
89 system components:</para>
93 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
94 2 UP mdt MDS MDS_uuid 3
95 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
96 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
97 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
98 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
99 7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5
100 8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
101 9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
102 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5
107 <literal>lctl</literal> deactivate to take the full OST
110 mds# lctl --device 7 deactivate
114 <para>Display the status of the file system components:</para>
118 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5
119 2 UP mdt MDS MDS_uuid 3
120 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
121 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
122 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
123 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
124 7 IN osc testfs-OST0002-osc testfs-mdtlov_UUID 5
125 8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
126 9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
127 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID 5
131 <para>The device list shows that OST0002 is now inactive. When new files
132 are created in the file system, they will only use the remaining active
133 OSTs. Either manual space rebalancing can be done by migrating data to
134 other OSTs, as shown in the next section, or normal file deletion and
135 creation can be allowed to passively rebalance the space usage.</para>
140 <primary>I/O</primary>
141 <secondary>migrating data</secondary>
144 <primary>migrating metadata</primary>
147 <primary>maintenance</primary>
148 <secondary>full OSTs</secondary>
149 </indexterm>Migrating Data within a File System</title>
151 <para condition='l28'>Lustre software version 2.8 includes a feature
152 to migrate metadata (directories and inodes therein) between MDTs.
153 This migration can only be performed on whole directories. For example,
154 to migrate the contents of the <literal>/testfs/testremote</literal>
155 directory from the MDT it currently resides on to MDT0000, the
156 sequence of commands is as follows:</para>
158 lfs getdirstripe -M ./testremote <lineannotation>which MDT is dir on?</lineannotation>
160 $ for i in $(seq 3); do touch ./testremote/${i}.txt; done <lineannotation>create test files</lineannotation>
161 $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done <lineannotation>check files are on MDT 1</lineannotation>
165 $ lfs migrate -m 0 ./testremote <lineannotation>migrate testremote to MDT 0</lineannotation>
166 $ lfs getdirstripe -M ./testremote <lineannotation>which MDT is dir on now?</lineannotation>
168 $ for i in $(seq 3); do lfs getstripe -M ./testremote/${i}.txt; done <lineannotation>check files are on MDT 0 too</lineannotation>
172 <para>For more information, see <literal>man lfs-migrate</literal></para>
173 <warning><para>Currently, only whole directories can be migrated
174 between MDTs. During migration each file receives a new identifier
175 (FID). As a consequence, the file receives a new inode number. Some
176 system tools (for example, backup and archiving tools) may consider
177 the migrated files to be new, even though the contents are unchanged.
179 <para>If there is a need to migrate the file <emphasis>data</emphasis>
180 from the current OST(s) to new OST(s), the data must be migrated (copied)
181 to the new location. The simplest way to do this is to use the
182 <literal>lfs_migrate</literal> command, see
183 <xref linkend="dbdoclet.lfs_migrate" />.</para>
188 <primary>I/O</primary>
189 <secondary>bringing OST online</secondary>
192 <primary>maintenance</primary>
193 <secondary>bringing OST online</secondary>
194 </indexterm>Returning an Inactive OST Back Online</title>
195 <para>Once the deactivated OST(s) no longer are severely imbalanced, due
196 to either active or passive data redistribution, they should be
197 reactivated so they will again have new files allocated on them.</para>
199 [mds]# lctl --device 7 activate
202 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-816dd1e813 5
203 2 UP mdt MDS MDS_uuid 3
204 3 UP lov testfs-mdtlov testfs-mdtlov_UUID 4
205 4 UP mds testfs-MDT0000 testfs-MDT0000_UUID 5
206 5 UP osc testfs-OST0000-osc testfs-mdtlov_UUID 5
207 6 UP osc testfs-OST0001-osc testfs-mdtlov_UUID 5
208 7 UP osc testfs-OST0002-osc testfs-mdtlov_UUID 5
209 8 UP osc testfs-OST0003-osc testfs-mdtlov_UUID 5
210 9 UP osc testfs-OST0004-osc testfs-mdtlov_UUID 5
211 10 UP osc testfs-OST0005-osc testfs-mdtlov_UUID
215 <section xml:id="dbdoclet.50438211_75549">
218 <primary>I/O</primary>
219 <secondary>pools</secondary>
222 <primary>maintenance</primary>
223 <secondary>pools</secondary>
226 <primary>pools</primary>
227 </indexterm>Creating and Managing OST Pools</title>
228 <para>The OST pools feature enables users to group OSTs together to make
229 object placement more flexible. A 'pool' is the name associated with an
230 arbitrary subset of OSTs in a Lustre cluster.</para>
231 <para>OST pools follow these rules:</para>
234 <para>An OST can be a member of multiple pools.</para>
237 <para>No ordering of OSTs in a pool is defined or implied.</para>
240 <para>Stripe allocation within a pool follows the same rules as the
241 normal stripe allocator.</para>
244 <para>OST membership in a pool is flexible, and can change over
248 <para>When an OST pool is defined, it can be used to allocate files. When
249 file or directory striping is set to a pool, only OSTs in the pool are
250 candidates for striping. If a stripe_index is specified which refers to an
251 OST that is not a member of the pool, an error is returned.</para>
252 <para>OST pools are used only at file creation. If the definition of a pool
253 changes (an OST is added or removed or the pool is destroyed),
254 already-created files are not affected.</para>
257 <literal>EINVAL</literal>) results if you create a file using an empty
261 <para>If a directory has pool striping set and the pool is subsequently
262 removed, the new files created in this directory have the (non-pool)
263 default striping pattern for that directory applied and no error is
267 <title>Working with OST Pools</title>
268 <para>OST pools are defined in the configuration log on the MGS. Use the
269 lctl command to:</para>
272 <para>Create/destroy a pool</para>
275 <para>Add/remove OSTs in a pool</para>
278 <para>List pools and OSTs in a specific pool</para>
281 <para>The lctl command MUST be run on the MGS. Another requirement for
282 managing OST pools is to either have the MDT and MGS on the same node or
283 have a Lustre client mounted on the MGS node, if it is separate from the
284 MDS. This is needed to validate the pool commands being run are
288 <literal>writeconf</literal> command on the MDS erases all pools
289 information (as well as any other parameters set using
290 <literal>lctl conf_param</literal>). We recommend that the pools
292 <literal>conf_param</literal> settings) be executed using a script, so
293 they can be reproduced easily after a
294 <literal>writeconf</literal> is performed.</para>
296 <para>To create a new pool, run:</para>
299 <replaceable>fsname</replaceable>.
300 <replaceable>poolname</replaceable>
303 <para>The pool name is an ASCII string up to 15 characters.</para>
305 <para>To add the named OST to a pool, run:</para>
308 <replaceable>fsname</replaceable>.
309 <replaceable>poolname</replaceable>
310 <replaceable>ost_list</replaceable>
317 <replaceable>ost_list</replaceable>is
318 <replaceable>fsname</replaceable>-OST
319 <replaceable>index_range</replaceable></literal>
325 <replaceable>index_range</replaceable>is
326 <replaceable>ost_index_start</replaceable>-
327 <replaceable>ost_index_end[,index_range]</replaceable></literal> or
329 <replaceable>ost_index_start</replaceable>-
330 <replaceable>ost_index_end/step</replaceable></literal></para>
335 <replaceable>fsname</replaceable>
336 </literal> and/or ending
337 <literal>_UUID</literal> are missing, they are automatically added.</para>
338 <para>For example, to add even-numbered OSTs to
339 <literal>pool1</literal> on file system
340 <literal>testfs</literal>, run a single command (
341 <literal>pool_add</literal>) to add many OSTs to the pool at one
345 lctl pool_add testfs.pool1 OST[0-10/2]
349 <para>Each time an OST is added to a pool, a new
350 <literal>llog</literal> configuration record is created. For
351 convenience, you can run a single command.</para>
353 <para>To remove a named OST from a pool, run:</para>
355 mgs# lctl pool_remove
356 <replaceable>fsname</replaceable>.
357 <replaceable>poolname</replaceable>
358 <replaceable>ost_list</replaceable>
360 <para>To destroy a pool, run:</para>
362 mgs# lctl pool_destroy
363 <replaceable>fsname</replaceable>.
364 <replaceable>poolname</replaceable>
367 <para>All OSTs must be removed from a pool before it can be
370 <para>To list pools in the named file system, run:</para>
373 <replaceable>fsname|pathname</replaceable>
375 <para>To list OSTs in a named pool, run:</para>
378 <replaceable>fsname</replaceable>.
379 <replaceable>poolname</replaceable>
382 <title>Using the lfs Command with OST Pools</title>
383 <para>Several lfs commands can be run with OST pools. Use the
384 <literal>lfs setstripe</literal> command to associate a directory with
385 an OST pool. This causes all new regular files and directories in the
386 directory to be created in the pool. The lfs command can be used to
387 list pools in a file system and OSTs in a named pool.</para>
388 <para>To associate a directory with a pool, so all new files and
389 directories will be created in the pool, run:</para>
391 client# lfs setstripe --pool|-p pool_name
392 <replaceable>filename|dirname</replaceable>
394 <para>To set striping patterns, run:</para>
396 client# lfs setstripe [--size|-s stripe_size] [--offset|-o start_ost]
397 [--count|-c stripe_count] [--pool|-p pool_name]
399 <replaceable>dir|filename</replaceable>
402 <para>If you specify striping with an invalid pool name, because the
403 pool does not exist or the pool name was mistyped,
404 <literal>lfs setstripe</literal> returns an error. Run
405 <literal>lfs pool_list</literal> to make sure the pool exists and the
406 pool name is entered correctly.</para>
410 <literal>--pool</literal> option for lfs setstripe is compatible with
411 other modifiers. For example, you can set striping on a directory to
412 use an explicit starting index.</para>
419 <primary>pools</primary>
420 <secondary>usage tips</secondary>
421 </indexterm>Tips for Using OST Pools</title>
422 <para>Here are several suggestions for using OST pools.</para>
425 <para>A directory or file can be given an extended attribute (EA),
426 that restricts striping to a pool.</para>
429 <para>Pools can be used to group OSTs with the same technology or
430 performance (slower or faster), or that are preferred for certain
431 jobs. Examples are SATA OSTs versus SAS OSTs or remote OSTs versus
435 <para>A file created in an OST pool tracks the pool by keeping the
436 pool name in the file LOV EA.</para>
441 <section xml:id="dbdoclet.50438211_11204">
444 <primary>I/O</primary>
445 <secondary>adding an OST</secondary>
446 </indexterm>Adding an OST to a Lustre File System</title>
447 <para>To add an OST to existing Lustre file system:</para>
450 <para>Add a new OST by passing on the following commands, run:</para>
452 oss# mkfs.lustre --fsname=testfs --mgsnode=mds16@tcp0 --ost --index=12 /dev/sda
453 oss# mkdir -p /mnt/testfs/ost12
454 oss# mount -t lustre /dev/sda /mnt/testfs/ost12
458 <para>Migrate the data (possibly).</para>
459 <para>The file system is quite unbalanced when new empty OSTs are
460 added. New file creations are automatically balanced. If this is a
461 scratch file system or files are pruned at a regular interval, then no
462 further work may be needed. Files existing prior to the expansion can
463 be rebalanced with an in-place copy, which can be done with a simple
465 <para>The basic method is to copy existing files to a temporary file,
466 then move the temp file over the old one. This should not be attempted
467 with files which are currently being written to by users or
468 applications. This operation redistributes the stripes over the entire
470 <para>A very clever migration script would do the following:</para>
473 <para>Examine the current distribution of data.</para>
476 <para>Calculate how much data should move from each full OST to the
480 <para>Search for files on a given full OST (using
481 <literal>lfs getstripe</literal>).</para>
484 <para>Force the new destination OST (using
485 <literal>lfs setstripe</literal>).</para>
488 <para>Copy only enough files to address the imbalance.</para>
493 <para>If a Lustre file system administrator wants to explore this approach
494 further, per-OST disk-usage statistics can be found under
495 <literal>/proc/fs/lustre/osc/*/rpc_stats</literal></para>
497 <section xml:id="dbdoclet.50438211_80295">
500 <primary>I/O</primary>
501 <secondary>direct</secondary>
502 </indexterm>Performing Direct I/O</title>
503 <para>The Lustre software supports the
504 <literal>O_DIRECT</literal> flag to open.</para>
505 <para>Applications using the
506 <literal>read()</literal> and
507 <literal>write()</literal> calls must supply buffers aligned on a page
508 boundary (usually 4 K). If the alignment is not correct, the call returns
509 <literal>-EINVAL</literal>. Direct I/O may help performance in cases where
510 the client is doing a large amount of I/O and is CPU-bound (CPU utilization
513 <title>Making File System Objects Immutable</title>
514 <para>An immutable file or directory is one that cannot be modified,
515 renamed or removed. To do this:</para>
518 <replaceable>file</replaceable>
520 <para>To remove this flag, use
521 <literal>chattr -i</literal></para>
524 <section xml:id="dbdoclet.50438211_61024">
525 <title>Other I/O Options</title>
526 <para>This section describes other I/O options, including checksums, and
527 the ptlrpcd thread pool.</para>
529 <title>Lustre Checksums</title>
530 <para>To guard against network data corruption, a Lustre client can
531 perform two types of data checksums: in-memory (for data in client
532 memory) and wire (for data sent over the network). For each checksum
533 type, a 32-bit checksum of the data read or written on both the client
534 and server is computed, to ensure that the data has not been corrupted in
535 transit over the network. The
536 <literal>ldiskfs</literal> backing file system does NOT do any persistent
537 checksumming, so it does not detect corruption of data in the OST file
539 <para>The checksumming feature is enabled, by default, on individual
540 client nodes. If the client or OST detects a checksum mismatch, then an
541 error is logged in the syslog of the form:</para>
543 LustreError: BAD WRITE CHECKSUM: changed in transit before arrival at OST: \
544 from 192.168.1.1@tcp inum 8991479/2386814769 object 1127239/0 extent [10240\
547 <para>If this happens, the client will re-read or re-write the affected
548 data up to five times to get a good copy of the data over the network. If
549 it is still not possible, then an I/O error is returned to the
551 <para>To enable both types of checksums (in-memory and wire), run:</para>
553 lctl set_param llite.*.checksum_pages=1
555 <para>To disable both types of checksums (in-memory and wire),
558 lctl set_param llite.*.checksum_pages=0
560 <para>To check the status of a wire checksum, run:</para>
562 lctl get_param osc.*.checksums
565 <title>Changing Checksum Algorithms</title>
566 <para>By default, the Lustre software uses the adler32 checksum
567 algorithm, because it is robust and has a lower impact on performance
568 than crc32. The Lustre file system administrator can change the
569 checksum algorithm via
570 <literal>lctl get_param</literal>, depending on what is supported in
572 <para>To check which checksum algorithm is being used by the Lustre
573 software, run:</para>
575 $ lctl get_param osc.*.checksum_type
577 <para>To change the wire checksum algorithm, run:</para>
579 $ lctl set_param osc.*.checksum_type=
580 <replaceable>algorithm</replaceable>
583 <para>The in-memory checksum always uses the adler32 algorithm, if
584 available, and only falls back to crc32 if adler32 cannot be
587 <para>In the following example, the
588 <literal>lctl get_param</literal> command is used to determine that the
589 Lustre software is using the adler32 checksum algorithm. Then the
590 <literal>lctl set_param</literal> command is used to change the checksum
591 algorithm to crc32. A second
592 <literal>lctl get_param</literal> command confirms that the crc32
593 checksum algorithm is now in use.</para>
595 $ lctl get_param osc.*.checksum_type
596 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32 [adler]
597 $ lctl set_param osc.*.checksum_type=crc32
598 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=crc32
599 $ lctl get_param osc.*.checksum_type
600 osc.testfs-OST0000-osc-ffff81012b2c48e0.checksum_type=[crc32] adler
605 <title>Ptlrpc Thread Pool</title>
606 <para>Releases prior to Lustre software release 2.2 used two portal RPC
607 daemons for each client/server pair. One daemon handled all synchronous
608 IO requests, and the second daemon handled all asynchronous (non-IO)
609 RPCs. The increasing use of large SMP nodes for Lustre servers exposed
610 some scaling issues. The lack of threads for large SMP nodes resulted in
611 cases where a single CPU would be 100% utilized and other CPUs would be
612 relativity idle. This is especially noticeable when a single client
613 traverses a large directory.</para>
614 <para>Lustre software release 2.2.x implements a ptlrpc thread pool, so
615 that multiple threads can be created to serve asynchronous RPC requests.
616 The number of threads spawned is controlled at module load time using
617 module options. By default one thread is spawned per CPU, with a minimum
618 of 2 threads spawned irrespective of module options.</para>
619 <para>One of the issues with thread operations is the cost of moving a
620 thread context from one CPU to another with the resulting loss of CPU
621 cache warmth. To reduce this cost, ptlrpc threads can be bound to a CPU.
622 However, if the CPUs are busy, a bound thread may not be able to respond
623 quickly, as the bound CPU may be busy with other tasks and the thread
624 must wait to schedule.</para>
625 <para>Because of these considerations, the pool of ptlrpc threads can be
626 a mixture of bound and unbound threads. The system operator can balance
627 the thread mixture based on system size and workload.</para>
629 <title>ptlrpcd parameters</title>
630 <para>These parameters should be set in
631 <literal>/etc/modprobe.conf</literal> or in the
632 <literal>etc/modprobe.d</literal> directory, as options for the ptlrpc
635 options ptlrpcd max_ptlrpcds=XXX
637 <para>Sets the number of ptlrpcd threads created at module load time.
638 The default if not specified is one thread per CPU, including
639 hyper-threaded CPUs. The lower bound is 2 (old prlrpcd behaviour)
641 options ptlrpcd ptlrpcd_bind_policy=[1-4]
643 <para>Controls the binding of threads to CPUs. There are four policy
648 <literal role="bold">
649 PDB_POLICY_NONE</literal>(ptlrpcd_bind_policy=1) All threads are
654 <literal role="bold">
655 PDB_POLICY_FULL</literal>(ptlrpcd_bind_policy=2) All threads
656 attempt to bind to a CPU.</para>
660 <literal role="bold">
661 PDB_POLICY_PAIR</literal>(ptlrpcd_bind_policy=3) This is the
662 default policy. Threads are allocated as a bound/unbound pair. Each
663 thread (bound or free) has a partner thread. The partnering is used
664 by the ptlrpcd load policy, which determines how threads are
665 allocated to CPUs.</para>
669 <literal role="bold">
670 PDB_POLICY_NEIGHBOR</literal>(ptlrpcd_bind_policy=4) Threads are
671 allocated as a bound/unbound pair. Each thread (bound or free) has
672 two partner threads.</para>