1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="managingstripingfreespace">
2 <title xml:id="managingstripingfreespace.title">Managing File Layout (Striping) and Free
4 <para>This chapter describes file layout (striping) and I/O options, and includes the following
8 <para><xref linkend="dbdoclet.50438209_79324"/></para>
11 <para><xref linkend="dbdoclet.50438209_48033"/></para>
14 <para><xref linkend="dbdoclet.50438209_78664"/></para>
17 <para><xref linkend="dbdoclet.50438209_44776"/></para>
20 <para><xref linkend="dbdoclet.50438209_10424"/></para>
23 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/></para>
26 <section xml:id="dbdoclet.50438209_79324">
29 <primary>space</primary>
32 <primary>striping</primary>
33 <secondary>how it works</secondary>
36 <primary>striping</primary>
40 <primary>space</primary>
41 <secondary>striping</secondary>
42 </indexterm>How Lustre File System Striping Works</title>
43 <para>In a Lustre file system, the MDS allocates objects to OSTs using either a round-robin
44 algorithm or a weighted algorithm. When the amount of free space is well balanced (i.e., by
45 default, when the free space across OSTs differs by less than 17%), the round-robin algorithm
46 is used to select the next OST to which a stripe is to be written. Periodically, the MDS
47 adjusts the striping layout to eliminate some degenerated cases in which applications that
48 create very regular file layouts (striping patterns) preferentially use a particular OST in
50 <para> Normally the usage of OSTs is well balanced. However, if users create a small number of
51 exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may
52 result. When the free space across OSTs differs by more than a specific amount (17% by
53 default), the MDS then uses weighted random allocations with a preference for allocating
54 objects on OSTs with more free space. (This can reduce I/O performance until space usage is
55 rebalanced again.) For a more detailed description of how striping is allocated, see <xref
56 linkend="dbdoclet.50438209_10424"/>.</para>
57 <para>Files can only be striped over a finite number of OSTs, based on the
58 maximum size of the attributes that can be stored on the MDT. If the MDT
59 is ldiskfs-based without the <literal>ea_inode</literal> feature, a file
60 can be striped across at most 160 OSTs. With a ZFS-based MDT, or if the
61 <literal>ea_inode</literal> feature is enabled for an ldiskfs-based MDT,
62 a file can be striped across up to 2000 OSTs. For more information, see
63 <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/>.
66 <section xml:id="dbdoclet.50438209_48033">
68 <primary>file layout</primary>
69 <secondary>See striping</secondary>
70 </indexterm><indexterm>
71 <primary>striping</primary>
72 <secondary>considerations</secondary>
75 <primary>space</primary>
76 <secondary>considerations</secondary>
77 </indexterm> Lustre File Layout (Striping) Considerations</title>
78 <para>Whether you should set up file striping and what parameter values you select depends on
79 your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and
81 <para>Some reasons for using striping include:</para>
84 <para><emphasis role="bold">Providing high-bandwidth access.</emphasis> Many applications
85 require high-bandwidth access to a single file, which may be more bandwidth than can be
86 provided by a single OSS. Examples are a scientific application that writes to a single
87 file from hundreds of nodes, or a binary executable that is loaded by many nodes when an
88 application starts.</para>
89 <para>In cases like these, a file can be striped over as many OSSs as it takes to achieve
90 the required peak aggregate bandwidth for that file. Striping across a larger number of
91 OSSs should only be used when the file size is very large and/or is accessed by many nodes
92 at a time. Currently, Lustre files can be striped across up to 2000 OSTs, the maximum
93 stripe count for an <literal>ldiskfs</literal> file system.</para>
96 <para><emphasis role="bold">Improving performance when OSS bandwidth is exceeded.</emphasis>
97 Striping across many OSSs can improve performance if the aggregate client bandwidth
98 exceeds the server bandwidth and the application reads and writes data fast enough to take
99 advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by
100 the I/O rate of the clients/jobs divided by the performance per OSS.</para>
103 <para><emphasis role="bold">Providing space for very large files.</emphasis> Striping is
104 useful when a single OST does not have enough free space to hold the entire file.</para>
107 <para>Some reasons to minimize or avoid striping:</para>
110 <para><emphasis role="bold">Increased overhead.</emphasis> Striping results in more locks
111 and extra network operations during common operations such as <literal>stat</literal> and
112 <literal>unlink</literal>. Even when these operations are performed in parallel, one
113 network operation takes less time than 100 operations.</para>
114 <para>Increased overhead also results from server contention. Consider a cluster with 100
115 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load
116 is distributed evenly, there is no contention and the disks on each server can manage
117 sequential I/O. If each file has 100 objects, then the clients all compete with one
118 another for the attention of the servers, and the disks on each node seek in 100 different
119 directions resulting in needless contention.</para>
122 <para><emphasis role="bold">Increased risk.</emphasis> When files are striped across all
123 servers and one of the servers breaks down, a small part of each striped file is lost. By
124 comparison, if each file has exactly one stripe, fewer files are lost, but they are lost
125 in their entirety. Many users would prefer to lose some of their files entirely than all
126 of their files partially.</para>
130 <title><indexterm><primary>striping</primary><secondary>size</secondary></indexterm>
131 Choosing a Stripe Size</title>
132 <para>Choosing a stripe size is a balancing act, but reasonable defaults are described below.
133 The stripe size has no effect on a single-stripe file.</para>
136 <para><emphasis role="bold">The stripe size must be a multiple of the page
137 size.</emphasis> Lustre software tools enforce a multiple of 64 KB (the maximum page
138 size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not
139 accidentally create files that might cause problems for ia64 clients.</para>
142 <para><emphasis role="bold">The smallest recommended stripe size is 512 KB.</emphasis>
143 Although you can create files with a stripe size of 64 KB, the smallest practical stripe
144 size is 512 KB because the Lustre file system sends 1MB chunks over the network.
145 Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced
149 <para><emphasis role="bold">A good stripe size for sequential I/O using high-speed
150 networks is between 1 MB and 4 MB.</emphasis> In most situations, stripe sizes larger
151 than 4 MB may result in longer lock hold times and contention during shared file
155 <para><emphasis role="bold">The maximum stripe size is 4 GB.</emphasis> Using a large
156 stripe size can improve performance when accessing very large files. It allows each
157 client to have exclusive access to its own part of a file. However, a large stripe size
158 can be counterproductive in cases where it does not match your I/O pattern.</para>
161 <para><emphasis role="bold">Choose a stripe pattern that takes into account the write
162 patterns of your application.</emphasis> Writes that cross an object boundary are
163 slightly less efficient than writes that go entirely to one server. If the file is
164 written in a consistent and aligned way, make the stripe size a multiple of the
165 <literal>write()</literal> size.</para>
170 <section xml:id="dbdoclet.50438209_78664">
172 <primary>striping</primary>
173 <secondary>configuration</secondary>
174 </indexterm>Setting the File Layout/Striping Configuration (<literal>lfs
175 setstripe</literal>)</title>
176 <para>Use the <literal>lfs setstripe</literal> command to create new files with a specific file layout (stripe pattern) configuration.</para>
177 <screen>lfs setstripe [--size|-s stripe_size] [--count|-c stripe_count] \
178 [--index|-i start_ost] [--pool|-p pool_name] <replaceable>filename|dirname</replaceable> </screen>
179 <para><emphasis role="bold">
180 <literal>stripe_size</literal>
183 <para>The <literal>stripe_size</literal> indicates how much data to write to one OST before
184 moving to the next OST. The default <literal>stripe_size</literal> is 1 MB. Passing a
185 <literal>stripe_size</literal> of 0 causes the default stripe size to be used. Otherwise,
186 the <literal>stripe_size</literal> value must be a multiple of 64 KB.</para>
187 <para><emphasis role="bold">
188 <literal>stripe_count</literal>
191 <para>The <literal>stripe_count</literal> indicates how many OSTs to use. The default <literal>stripe_count</literal> value is 1. Setting <literal>stripe_count</literal> to 0 causes the default stripe count to be used. Setting <literal>stripe_count</literal> to -1 means stripe over all available OSTs (full OSTs are skipped).</para>
192 <para><emphasis role="bold">
193 <literal>start_ost</literal>
196 <para>The start OST is the first OST to which files are written. The default value for
197 <literal>start_ost</literal> is -1, which allows the MDS to choose the starting index. This
198 setting is strongly recommended, as it allows space and load balancing to be done by the MDS
199 as needed. If the value of <literal>start_ost</literal> is set to a value other than -1, the
200 file starts on the specified OST index. OST index numbering starts at 0.</para>
202 <para>If the specified OST is inactive or in a degraded mode, the MDS will silently choose
203 another target.</para>
206 <para>If you pass a <literal>start_ost</literal> value of 0 and a
207 <literal>stripe_count</literal> value of <emphasis>1</emphasis>, all files are written to
208 OST 0, until space is exhausted. <emphasis role="italic">This is probably not what you meant
209 to do.</emphasis> If you only want to adjust the stripe count and keep the other
210 parameters at their default settings, do not specify any of the other parameters:</para>
211 <para><screen>client# lfs setstripe -c <replaceable>stripe_count</replaceable> <replaceable>filename</replaceable></screen></para>
213 <para><emphasis role="bold">
214 <literal>pool_name</literal>
217 <para>The <literal>pool_name</literal> specifies the OST pool to which the file will be written.
218 This allows limiting the OSTs used to a subset of all OSTs in the file system. For more
219 details about using OST pools, see <link xl:href="ManagingFileSystemIO.html#50438211_75549"
220 >Creating and Managing OST Pools</link>.</para>
222 <title>Specifying a File Layout (Striping Pattern) for a Single File</title>
223 <para>It is possible to specify the file layout when a new file is created using the command <literal>lfs setstripe</literal>. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an <literal>lfs setstripe</literal> command fails if the file already exists.</para>
224 <section xml:id="dbdoclet.50438209_60155">
225 <title>Setting the Stripe Size</title>
226 <para>The command to create a new file with a specified stripe size is similar to:</para>
227 <screen>[client]# lfs setstripe -s 4M /mnt/lustre/new_file</screen>
228 <para>This example command creates the new file <literal>/mnt/lustre/new_file</literal> with a stripe size of 4 MB.</para>
229 <para>Now, when the file is created, the new stripe setting creates the file on a single OST with a stripe size of 4M:</para>
230 <screen> [client]# lfs getstripe /mnt/lustre/new_file
233 lmm_stripe_size: 4194304
237 obdidx objid objid group
238 1 690550 0xa8976 0 </screen>
239 <para>In this example, the stripe size is 4 MB.</para>
242 <title><indexterm><primary>striping</primary><secondary>count</secondary></indexterm>
243 Setting the Stripe Count</title>
244 <para>The command below creates a new file with a stripe count of <literal>-1</literal> to
245 specify striping over all available OSTs:</para>
246 <screen>[client]# lfs setstripe -c -1 /mnt/lustre/full_stripe</screen>
247 <para>The example below indicates that the file <literal>full_stripe</literal> is striped
248 over all six active OSTs in the configuration:</para>
249 <screen>[client]# lfs getstripe /mnt/lustre/full_stripe
250 /mnt/lustre/full_stripe
251 obdidx objid objid group
258 <para> This is in contrast to the output in <xref linkend="dbdoclet.50438209_60155"/>, which
259 shows only a single object for the file.</para>
264 <primary>striping</primary>
265 <secondary>per directory</secondary>
266 </indexterm>Setting the Striping Layout for a Directory</title>
267 <para>In a directory, the <literal>lfs setstripe</literal> command sets a default striping
268 configuration for files created in the directory. The usage is the same as <literal>lfs
269 setstripe</literal> for a regular file, except that the directory must exist prior to
270 setting the default striping configuration. If a file is created in a directory with a
271 default stripe configuration (without otherwise specifying striping), the Lustre file system
272 uses those striping parameters instead of the file system default for the new file.</para>
273 <para>To change the striping pattern for a sub-directory, create a directory with desired file
274 layout as described above. Sub-directories inherit the file layout of the root/parent
279 <primary>striping</primary>
280 <secondary>per file system</secondary>
281 </indexterm>Setting the Striping Layout for a File System</title>
282 <para>Setting the striping specification on the <literal>root</literal> directory determines
283 the striping for all new files created in the file system unless an overriding striping
284 specification takes precedence (such as a striping layout specified by the application, or
285 set using <literal>lfs setstripe</literal>, or specified for the parent directory).</para>
287 <para>The striping settings for a <literal>root</literal> directory are, by default, applied
288 to any new child directories created in the root directory, unless striping settings have
289 been specified for the child directory.</para>
294 <primary>striping</primary>
295 <secondary>on specific OST</secondary>
296 </indexterm>Creating a File on a Specific OST</title>
297 <para>You can use <literal>lfs setstripe</literal> to create a file on a specific OST. In the
298 following example, the file <literal>file1</literal> is created on the first OST (OST index
300 <screen>$ lfs setstripe --count 1 --index 0 file1
301 $ dd if=/dev/zero of=file1 count=1 bs=100M
305 $ lfs getstripe file1
308 lmm_stripe_size: 1048576
312 obdidx objid objid group
313 0 37364 0x91f4 0</screen>
316 <section xml:id="dbdoclet.50438209_44776">
317 <title><indexterm><primary>striping</primary><secondary>getting information</secondary></indexterm>Retrieving File Layout/Striping Information (<literal>getstripe</literal>)</title>
318 <para>The <literal>lfs getstripe</literal> command is used to display information that shows
319 over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along
320 with the OST index and object ID for each stripe in the file. For directories, the default
321 settings for files created in that directory are displayed.</para>
323 <title>Displaying the Current Stripe Size</title>
324 <para>To see the current stripe size for a Lustre file or directory, use the <literal>lfs
325 getstripe</literal> command. For example, to view information for a directory, enter a
326 command similar to:</para>
327 <screen>[client]# lfs getstripe /mnt/lustre </screen>
328 <para>This command produces output similar to:</para>
330 (Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1</screen>
331 <para>In this example, the default stripe count is <literal>1</literal> (data blocks are
332 striped over a single OST), the default stripe size is 1 MB, and the objects are created
333 over all available OSTs.</para>
334 <para>To view information for a file, enter a command similar to:</para>
335 <screen>$ lfs getstripe /mnt/lustre/foo
338 lmm_stripe_size: 1048576
342 obdidx objid objid group
343 2 835487 m0xcbf9f 0 </screen>
344 <para>In this example, the file is located on <literal>obdidx 2</literal>, which corresponds
345 to the OST <literal>lustre-OST0002</literal>. To see which node is serving that OST, run:
346 <screen>$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid
347 osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp</screen></para>
350 <title>Inspecting the File Tree</title>
351 <para>To inspect an entire tree of files, use the <literal>lfs find</literal> command:</para>
352 <screen>lfs find [--recursive | -r] <replaceable>file|directory</replaceable> ...</screen>
356 <primary>striping</primary>
357 <secondary>remote directories</secondary>
358 </indexterm>Locating the MDT for a remote directory</title>
359 <para condition="l24">Lustre software release 2.4 can be configured with
360 multiple MDTs in the same file system. Each sub-directory can have a
361 different MDT. To identify on which MDT a given subdirectory is
362 located, pass the <literal>getstripe [--mdt-index|-M]</literal>
363 parameters to <literal>lfs</literal>. An example of this command is
364 provided in the section <xref linkend="dbdoclet.rmremotedir"/>.</para>
367 <section xml:id="pfl" condition='l2A'>
369 <primary>striping</primary>
370 <secondary>PFL</secondary>
371 </indexterm>Progressive File Layout(PFL)</title>
372 <para>The Lustre Progressive File Layout (PFL) feature simplifies the use
373 of Lustre so that users can expect reasonable performance for a variety of
374 normal file IO patterns without the need to explicitly understand their IO
375 model or Lustre usage details in advance. In particular, users do not
376 necessarily need to know the size or concurrency of output files in
377 advance of their creation and explicitly specify an optimal layout for
378 each file in order to achieve good performance for both highly concurrent
379 shared-single-large-file IO or parallel IO to many smaller per-process
381 <para>The layout of a PFL file is stored on disk as <literal>composite
382 layout</literal>. A PFL file is essentially an array of
383 <literal>sub-layout components</literal>, with each sub-layout component
384 being a plain layout covering different and non-overlapped extents of
385 the file. For PFL files, the file layout is composed of a series of
386 components, therefore it's possible that there are some file extents are
387 not described by any components.</para>
388 <para>An example of how data blocks of PFL files are mapped to OST objects
389 of components is shown in the following PFL object mapping diagram:</para>
390 <figure xml:id="managinglayout.fig.pfl">
391 <title>PFL object mapping diagram</title>
394 <imagedata scalefit="1" width="100%"
395 fileref="figures/PFL_object_mapping_diagram.png" />
398 <phrase>PFL object mapping diagram</phrase>
402 <para>The PFL file in <xref linkend="managinglayout.fig.pfl"/> has 3
403 components and shows the mapping for the blocks of a 2055MB file.
404 The stripe size for the first two components is 1MB, while the stripe size
405 for the third component is 4MB. The stripe count is increasing for each
406 successive component. The first component only has two 1MB blocks and the
407 single object has a size of 2MB. The second component holds the next 254MB
408 of the file spread over 4 separate OST objects in RAID-0, each one will
409 have a size of 256MB / 4 objects = 64MB per object. Note the first two
410 objects <literal>obj 2,0</literal> and <literal>obj 2,1</literal>
411 have a 1MB hole at the start where the data is stored in the first
412 component. The final component holds the next 1800MB spread over 32 OST
413 objects. There is a 256MB / 32 = 8MB hole at the start each one for the
414 data stored in the first two components. Each object will be
415 2048MB / 32 objects = 64MB per object, except the
416 <literal>obj 3,0</literal> that holds an extra 4MB chunk and
417 <literal>obj 3,1</literal> that holds an extra 3MB chunk. If more data
418 was written to the file, only the objects in component 3 would increase
420 <para>When a file range with defined but not instantiated component is
421 accessed, clients will send a Layout Intent RPC to the MDT, and the MDT
422 would instantiate the objects of the components covering that range.
424 <para>Next, some commands for user to operate PFL files are introduced and
425 some examples of possible composite layout are illustrated as well.
426 Lustre provides commands
427 <literal>lfs setstripe</literal> and <literal>lfs migrate</literal> for
428 users to operate PFL files. <literal>lfs setstripe</literal> commands
429 are used to create PFL files, add or delete components to or from an
430 existing composite file; <literal>lfs migrate</literal> commands are used
431 to re-layout the data in existing files using the new layout parameter by
432 copying the data from the existing OST(s) to the new OST(s). Also,
433 as introduced in the previous sections, <literal>lfs getstripe</literal>
434 commands can be used to list the striping/component information for a
435 given PFL file, and <literal>lfs find</literal> commands can be used to
436 search the directory tree rooted at the given directory or file name for
437 the files that match the given PFL component parameters.</para>
438 <note><para>Using PFL files requires both the client and server to
439 understand the PFL file layout, which isn't available for Lustre 2.9 and
440 earlier. And it will not prevent older clients from accessing non-PFL
441 files in the filesystem.</para></note>
443 <title><literal>lfs setstripe</literal></title>
444 <para><literal>lfs setstripe</literal> commands are used to create PFL
445 files, add or delete components to or from an existing composite file.
446 (Suppose we have 8 OSTs in the following examples and stripe size is 1MB
449 <title>Create a PFL file</title>
450 <para><emphasis role="bold">Command</emphasis></para>
451 <screen>lfs setstripe
452 [--component-end|-E end1] [STRIPE_OPTIONS]
453 [--component-end|-E end2] [STRIPE_OPTIONS] ... <replaceable>filename</replaceable></screen>
454 <para>The <literal>-E</literal> option is used to specify the end offset
455 (in bytes or using a suffix “kMGTP”, e.g. 256M) of each component, and
456 it also indicates the following <literal>STRIPE_OPTIONS</literal> are
457 for this component. Each component defines the stripe pattern of the
458 file in the range of [start, end). The first component must start from
459 offset 0 and all components must be adjacent with each other, no holes
460 are allowed, so each extent will start at the end of previous extent.
461 A <literal>-1</literal> end offset or <literal>eof</literal> indicates
462 this is the last component extending to the end of file.</para>
463 <para><emphasis role="bold">Example</emphasis></para>
464 <screen>$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \
465 /mnt/testfs/create_comp</screen>
466 <para>This command creates a file with composite layout illustrated in
467 the following figure. The first component has 1 stripe and covers
468 [0, 4M), the second component has 4 stripes and covers [4M, 64M), and
469 the last component stripes start at OST4, cross over all available
470 OSTs and covers [64M, EOF).</para>
471 <figure xml:id="managinglayout.fig.pfl_create">
472 <title>Example: create a composite file</title>
475 <imagedata scalefit="1" depth="2.75in" align="center"
476 fileref="figures/PFL_createfile.png" />
479 <phrase>Example: create a composite file</phrase>
483 <para>The composite layout can be output by the following command:</para>
484 <screen>$ lfs getstripe /mnt/testfs/create_comp
485 /mnt/testfs/create_comp
490 lcme_extent.e_start: 0
491 lcme_extent.e_end: 4194304
493 lmm_stripe_size: 1048576
498 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
502 lcme_extent.e_start: 4194304
503 lcme_extent.e_end: 67108864
505 lmm_stripe_size: 1048576
508 lmm_stripe_offset: -1
511 lcme_extent.e_start: 67108864
512 lcme_extent.e_end: EOF
514 lmm_stripe_size: 1048576
517 lmm_stripe_offset: 4</screen>
518 <note><para>Only the first component’s OST objects of the PFL file are
519 instantiated when the layout is being set. Other instantiation is
520 delayed to later write/truncate operations.</para></note>
521 <para>If we write 128M data to this PFL file, the second and third
522 components will be instantiated:</para>
523 <screen>$ dd if=/dev/zero of=/mnt/testfs/create_comp bs=1M count=128
524 $ lfs getstripe /mnt/testfs/create_comp
525 /mnt/testfs/create_comp
530 lcme_extent.e_start: 0
531 lcme_extent.e_end: 4194304
533 lmm_stripe_size: 1048576
538 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
542 lcme_extent.e_start: 4194304
543 lcme_extent.e_end: 67108864
545 lmm_stripe_size: 1048576
550 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
551 - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
552 - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
553 - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
557 lcme_extent.e_start: 67108864
558 lcme_extent.e_end: EOF
560 lmm_stripe_size: 1048576
565 - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x3:0x0] }
566 - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
567 - 2: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
568 - 3: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
569 - 4: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
570 - 5: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }
571 - 6: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }
572 - 7: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }</screen>
575 <title>Add component(s) to an existing composite file</title>
576 <para><emphasis role="bold">Command</emphasis></para>
577 <screen>lfs setstripe --component-add
578 [--component-end|-E end1] [STRIPE_OPTIONS]
579 [--component-end|-E end2] [STRIPE_OPTIONS] ... <replaceable>filename</replaceable></screen>
580 <para>The option <literal>--component-add</literal> is used to add
581 components to an existing composite file. The extent start of
582 the first component to be added is equal to the extent end of last
583 component in the existing file, and all components to be added must
584 be adjacent with each other.</para>
585 <note><para>If the last existing component is specified by
586 <literal>-E -1</literal> or <literal>-E eof</literal>, which covers
587 to the end of the file, it must be deleted before a new one is added.
589 <para><emphasis role="bold">Example</emphasis></para>
590 <screen>$ lfs setstripe -E 4M -c 1 -E 64M -c 4 /mnt/testfs/add_comp
591 $ lfs setstripe --component-add -E -1 -c 4 -o 6-7,0,5 \
592 /mnt/testfs/add_comp</screen>
593 <para>This command adds a new component which starts from the end of
594 the last existing component to the end of file. The layout of this
595 example is illustrated in
596 <xref linkend="managinglayout.fig.pfl_addcomp"/>. The last component
597 stripes across 4 OSTs in sequence OST6, OST7, OST0 and OST5, covers
599 <figure xml:id="managinglayout.fig.pfl_addcomp">
600 <title>Example: add a component to an existing composite file</title>
603 <imagedata scalefit="1" depth="2.75in" align="center"
604 fileref="figures/PFL_addcomp.png" />
607 <phrase>Example: add a component to an existing composite file
612 <para>The layout can be printed out by the following command:</para>
613 <screen>$ lfs getstripe /mnt/testfs/add_comp
619 lcme_extent.e_start: 0
620 lcme_extent.e_end: 4194304
622 lmm_stripe_size: 1048576
627 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
631 lcme_extent.e_start: 4194304
632 lcme_extent.e_end: 67108864
634 lmm_stripe_size: 1048576
639 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
640 - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
641 - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
642 - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
646 lcme_extent.e_start: 67108864
647 lcme_extent.e_end: EOF
649 lmm_stripe_size: 1048576
652 lmm_stripe_offset: -1</screen>
653 <para>The component ID "lcme_id" changes as layout generation
654 changes. It is not necessarily sequential and does not imply ordering
655 of individual components.</para>
656 <note><para>Similar to specifying a full-file composite layout at file
657 creation time, <literal>--component-add</literal> won't instantiate
658 OST objects, the instantiation is delayed to later write/truncate
659 operations. For example, after writing beyond the 64MB start of the
660 file's last component, the new component has had objects allocated:
662 <screen>$ lfs getstripe -I5 /mnt/testfs/add_comp
668 lcme_extent.e_start: 67108864
669 lcme_extent.e_end: EOF
671 lmm_stripe_size: 1048576
676 - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] }
677 - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] }
678 - 2: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
679 - 3: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }</screen>
682 <title>Delete component(s) from an existing file</title>
683 <para><emphasis role="bold">Command</emphasis></para>
684 <screen>lfs setstripe --component-del
685 [--component-id|-I comp_id | --component-flags comp_flags]
686 <replaceable>filename</replaceable></screen>
687 <para>The option <literal>--component-del</literal> is used to remove
688 the component(s) specified by component ID or flags from an existing
689 file. This operation will result in any data stored in the deleted
690 component will be lost.</para>
691 <para>The ID specified by <literal>-I</literal> option is the numerical
692 unique ID of the component, which can be obtained by command
693 <literal>lfs getstripe -I</literal> command, and the flag specified by
694 <literal>--component-flags</literal> option is a certain type of
695 components, which can be obtained by command
696 <literal>lfs getstripe --component-flags</literal>. For now, we only
697 have two flags <literal>init</literal> and <literal>^init</literal>
698 for instantiated and un-instantiated components respectively.</para>
699 <note><para>Deletion must start with the last component because hole is
700 not allowed.</para></note>
701 <para><emphasis role="bold">Example</emphasis></para>
702 <screen>$ lfs getstripe -I /mnt/testfs/del_comp
706 $ lfs setstripe --component-del -I 5 /mnt/testfs/del_comp</screen>
707 <para>This example deletes the component with ID 5 from file
708 <literal>/mnt/testfs/del_comp</literal>. If we still use the last
709 example, the final result is illustrated in
710 <xref linkend="managinglayout.fig.pfl_delcomp"/>.</para>
711 <figure xml:id="managinglayout.fig.pfl_delcomp">
712 <title>Example: delete a component from an existing file</title>
715 <imagedata scalefit="1" depth="2.75in" align="center"
716 fileref="figures/PFL_delcomp.png" />
719 <phrase>Example: delete a component from an existing file</phrase>
723 <para>If you try to delete a non-last component, you will see the
724 following error:</para>
725 <screen>$ lfs setstripe -component-del -I 2 /mnt/testfs/del_comp
726 Delete component 0x2 from /mnt/testfs/del_comp failed. Invalid argument
727 error: setstripe: delete component of file '/mnt/testfs/del_comp' failed: Invalid argument</screen>
730 <title>Set default PFL layout to an existing directory</title>
731 <para>Similar to create a PFL file, you can set default PFL layout to
732 an existing directory. After that, all the files created will inherit
733 this layout by default.</para>
734 <para><emphasis role="bold">Command</emphasis></para>
735 <screen>lfs setstripe
736 [--component-end|-E end1] [STRIPE_OPTIONS]
737 [--component-end|-E end2] [STRIPE_OPTIONS] ... <replaceable>dirname</replaceable></screen>
738 <para><emphasis role="bold">Example</emphasis></para>
740 $ mkdir /mnt/testfs/pfldir
741 $ lfs setstripe -E 256M -c 1 -E 16G -c 4 -E -1 -S 4M -c -1 /mnt/testfs/pfldir
743 <para>When you run <literal>lfs getstripe</literal>, you will see:
746 $ lfs getstripe /mnt/testfs/pfldir
752 lcme_extent.e_start: 0
753 lcme_extent.e_end: 268435456
754 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
757 lcme_extent.e_start: 268435456
758 lcme_extent.e_end: 17179869184
759 stripe_count: 4 stripe_size: 1048576 stripe_offset: -1
762 lcme_extent.e_start: 17179869184
763 lcme_extent.e_end: EOF
764 stripe_count: -1 stripe_size: 4194304 stripe_offset: -1
766 <para>If you create a file under <literal>/mnt/testfs/pfldir</literal>,
767 the layout of that file will inherit the layout from its parent
770 $ touch /mnt/testfs/pfldir/pflfile
771 $ lfs getstripe /mnt/testfs/pfldir/pflfile
772 /mnt/testfs/pfldir/pflfile
777 lcme_extent.e_start: 0
778 lcme_extent.e_end: 268435456
780 lmm_stripe_size: 1048576
785 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0xa:0x0] }
789 lcme_extent.e_start: 268435456
790 lcme_extent.e_end: 17179869184
792 lmm_stripe_size: 1048576
795 lmm_stripe_offset: -1
799 lcme_extent.e_start: 17179869184
800 lcme_extent.e_end: EOF
802 lmm_stripe_size: 4194304
805 lmm_stripe_offset: -1
808 <literal>lfs setstripe --component-add/del</literal> can't be run
809 on a directory, because default layout in directory is likea config,
810 which can be arbitrarily changed by <literal>lfs setstripe</literal>,
811 while layout in file may have data (OST objects) attached. If you want
812 to delete default layout in a directory, run
813 <literal>lfs setstripe -d <replaceable>dirname</replaceable></literal>
814 to return the directory to the filesystem-wide defaults, like:
816 $ lfs setstripe -d /mnt/testfs/pfldir
817 $ lfs getstripe -d /mnt/testfs/pfldir
819 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
820 /mnt/testfs/pfldir/commonfile
822 lmm_stripe_size: 1048576
826 obdidx objid objid group
833 <title><literal>lfs migrate</literal></title>
834 <para><literal>lfs migrate</literal> commands are used to re-layout the
835 data in the existing files with the new layout parameter by copying the
836 data from the existing OST(s) to the new OST(s).</para>
837 <para><emphasis role="bold">Command</emphasis></para>
838 <screen>lfs migrate [--component-end|-E comp_end] [STRIPE_OPTIONS] ...
839 <replaceable>filename</replaceable></screen>
840 <para>The difference between <literal>migrate</literal> and
841 <literal>setstripe</literal> is that <literal>migrate</literal> is to
842 re-layout the data in the existing files, while
843 <literal>setstripe</literal> is to create new files with the specified
845 <para><emphasis role="bold">Example</emphasis></para>
846 <para><emphasis role="bold">Case1. Migrate a normal one to a composite
847 layout</emphasis></para>
848 <screen>$ lfs setstripe -c 1 -S 128K /mnt/testfs/norm_to_2comp
849 $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
850 $ lfs getstripe /mnt/testfs/norm_to_2comp --yaml
851 /mnt/testfs/norm_to_comp
853 lmm_stripe_size: 131072
859 l_fid: 0x100070000:0x2:0x0
860 $ lfs migrate -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \
861 /mnt/testfs/norm_to_2comp</screen>
862 <para>In this example, a 5MB size file with 1 stripe and 128K stripe size
863 is migrated to a composite layout file with 2 components, illustrated in
864 <xref linkend="managinglayout.fig.pfl_norm_to_comp"/>.</para>
865 <figure xml:id="managinglayout.fig.pfl_norm_to_comp">
866 <title>Example: migrate normal to composite</title>
869 <imagedata scalefit="1" depth="2.75in" align="center"
870 fileref="figures/PFL_norm_to_comp.png" />
873 <phrase>Example: migrate normal to composite</phrase>
877 <para>The stripe information after migration is like:</para>
878 <screen>$ lfs getstripe /mnt/testfs/norm_to_2comp
879 /mnt/testfs/norm_to_2comp
884 lcme_extent.e_start: 0
885 lcme_extent.e_end: 1048576
887 lmm_stripe_size: 524288
892 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
896 lcme_extent.e_start: 1048576
897 lcme_extent.e_end: EOF
899 lmm_stripe_size: 1048576
904 - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
905 - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }</screen>
906 <para><emphasis role="bold">Case2. Migrate a composite layout to another
907 composite layout</emphasis></para>
908 <screen>$ lfs setstripe -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \
909 /mnt/testfs/2comp_to_3comp
910 $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
911 $ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \
912 /mnt/testfs/2comp_to_3comp</screen>
913 <para>In this example, a composite layout file with 2 components is
914 migrated a composite layout file with 3 components. If we still use
915 the example in case1, the migration process is illustrated in
916 <xref linkend="managinglayout.fig.pfl_comp_to_comp"/>.</para>
917 <figure xml:id="managinglayout.fig.pfl_comp_to_comp">
918 <title>Example: migrate composite to composite</title>
921 <imagedata scalefit="1" depth="2.75in" align="center"
922 fileref="figures/PFL_comp_to_comp.png" />
925 <phrase>Example: migrate composite to composite</phrase>
929 <para>The stripe information is like:</para>
930 <screen>$ lfs getstripe /mnt/testfs/2comp_to_3comp
931 /mnt/testfs/2comp_to_3comp
936 lcme_extent.e_start: 0
937 lcme_extent.e_end: 1048576
939 lmm_stripe_size: 1048576
944 - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
945 - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
949 lcme_extent.e_start: 1048576
950 lcme_extent.e_end: 4194304
952 lmm_stripe_size: 1048576
957 - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
958 - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x3:0x0] }
962 lcme_extent.e_start: 4194304
963 lcme_extent.e_end: EOF
965 lmm_stripe_size: 3145728
970 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
971 - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
972 - 2: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }</screen>
973 <para><emphasis role="bold">Case3. Migrate a composite layout to a
974 normal one</emphasis></para>
975 <screen>$ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \
976 /mnt/testfs/3comp_to_norm
977 $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
978 $ lfs migrate -c 2 -S 2M /mnt/testfs/3comp_to_normal</screen>
979 <para>In this example, a composite file with 3 components is migrated to
980 a normal file with 2 stripes and 2M stripe size. If we still use the
981 example in Case2, the migration process is illustrated in
982 <xref linkend="managinglayout.fig.pfl_comp_to_norm"/>.</para>
983 <figure xml:id="managinglayout.fig.pfl_comp_to_norm">
984 <title>Example: migrate composite to normal</title>
987 <imagedata scalefit="1" depth="2.75in" align="center"
988 fileref="figures/PFL_comp_to_norm.png" />
991 <phrase>Example: migrate composite to normal</phrase>
995 <para>The stripe information is like:</para>
996 <screen>$ lfs getstripe /mnt/testfs/3comp_to_norm --yaml
997 /mnt/testfs/3comp_to_norm
999 lmm_stripe_size: 2097152
1002 lmm_stripe_offset: 4
1005 l_fid: 0x100040000:0x3:0x0
1007 l_fid: 0x100050000:0x3:0x0</screen>
1009 <section remap="h3">
1010 <title><literal>lfs getstripe</literal></title>
1011 <para><literal>lfs getstripe</literal> commands can be used to list the
1012 striping/component information for a given PFL file. Here, only those
1013 parameters new for PFL files are shown.</para>
1014 <para><emphasis role="bold">Command</emphasis></para>
1015 <screen>lfs getstripe
1016 [--component-id|-I [comp_id]]
1017 [--component-flags [comp_flags]]
1019 [--component-start [+-][N][kMGTPE]]
1020 [--component-end|-E [+-][N][kMGTPE]]
1021 <replaceable>dirname|filename</replaceable></screen>
1022 <para><emphasis role="bold">Example</emphasis></para>
1023 <para>Suppose we already have a composite file
1024 <literal>/mnt/testfs/3comp</literal>, created by the following
1026 <screen>$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \
1027 /mnt/testfs/3comp</screen>
1028 <para>And write some data</para>
1029 <screen>$ dd if=/dev/zero of=/mnt/testfs/3comp bs=1M count=5</screen>
1030 <para><emphasis role="bold">Case1. List component ID and its related
1031 information</emphasis></para>
1034 <para>List all the components ID</para>
1035 <screen>$ lfs getstripe -I /mnt/testfs/3comp
1041 <para>List the detailed striping information of component ID=2</para>
1042 <screen>$ lfs getstripe -I2 /mnt/testfs/3comp
1048 lcme_extent.e_start: 4194304
1049 lcme_extent.e_end: 67108864
1051 lmm_stripe_size: 1048576
1054 lmm_stripe_offset: 5
1056 - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
1057 - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
1058 - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
1059 - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }</screen>
1062 <para>List the stripe offset and stripe count of component ID=2</para>
1063 <screen>$ lfs getstripe -I2 -i -c /mnt/testfs/3comp
1065 lmm_stripe_offset: 5</screen>
1068 <para><emphasis role="bold">Case2. List the component which contains the
1069 specified flag</emphasis></para>
1072 <para>List the flag of each component</para>
1073 <screen>$ lfs getstripe -component-flag -I /mnt/testfs/3comp
1079 lcme_flags: 0</screen>
1082 <para>List component(s) who is not instantiated</para>
1083 <screen>$ lfs getstripe --component-flags=^init /mnt/testfs/3comp
1089 lcme_extent.e_start: 67108864
1090 lcme_extent.e_end: EOF
1091 lmm_stripe_count: -1
1092 lmm_stripe_size: 1048576
1095 lmm_stripe_offset: 4</screen>
1098 <para><emphasis role="bold">Case3. List the total number of all the
1099 component(s)</emphasis></para>
1102 <para>List the total number of all the components</para>
1103 <screen>$ lfs getstripe --component-count /mnt/testfs/3comp
1107 <para><emphasis role="bold">Case4. List the component with the specified
1108 extent start or end positions</emphasis></para>
1111 <para>List the start position in bytes of each component</para>
1112 <screen>$ lfs getstripe --component-start /mnt/testfs/3comp
1118 <para>List the start position in bytes of component ID=3</para>
1119 <screen>$ lfs getstripe --component-start -I3 /mnt/testfs/3comp
1123 <para>List the component with start = 64M</para>
1124 <screen>$ lfs getstripe --component-start=64M /mnt/testfs/3comp
1130 lcme_extent.e_start: 67108864
1131 lcme_extent.e_end: EOF
1132 lmm_stripe_count: -1
1133 lmm_stripe_size: 1048576
1136 lmm_stripe_offset: 4</screen>
1139 <para>List the component(s) with start > 5M</para>
1140 <screen>$ lfs getstripe --component-start=+5M /mnt/testfs/3comp
1146 lcme_extent.e_start: 67108864
1147 lcme_extent.e_end: EOF
1148 lmm_stripe_count: -1
1149 lmm_stripe_size: 1048576
1152 lmm_stripe_offset: 4</screen>
1155 <para>List the component(s) with start < 5M</para>
1156 <screen>$ lfs getstripe --component-start=-5M /mnt/testfs/3comp
1162 lcme_extent.e_start: 0
1163 lcme_extent.e_end: 4194304
1165 lmm_stripe_size: 1048576
1168 lmm_stripe_offset: 4
1170 - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
1174 lcme_extent.e_start: 4194304
1175 lcme_extent.e_end: 67108864
1177 lmm_stripe_size: 1048576
1180 lmm_stripe_offset: 5
1182 - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
1183 - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
1184 - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
1185 - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }</screen>
1188 <para>List the component(s) with start > 3M and end < 70M</para>
1189 <screen>$ lfs getstripe --component-start=+3M --component-end=-70M \
1196 lcme_extent.e_start: 4194304
1197 lcme_extent.e_end: 67108864
1199 lmm_stripe_size: 1048576
1202 lmm_stripe_offset: 5
1204 - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
1205 - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
1206 - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
1207 - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }</screen>
1211 <section remap="h3">
1212 <title><literal>lfs find</literal></title>
1213 <para><literal>lfs find</literal> commands can be used to search the
1214 directory tree rooted at the given directory or file name for the files
1215 that match the given PFL component parameters. Here, only those
1216 parameters new for PFL files are shown. Their usages are similar to
1217 <literal>lfs getstripe</literal> commands.</para>
1218 <para><emphasis role="bold">Command</emphasis></para>
1219 <screen>lfs find <replaceable>directory|filename</replaceable>
1220 [[!] --component-count [+-=]<replaceable>comp_cnt</replaceable>]
1221 [[!] --component-start [+-=]<replaceable>N</replaceable>[kMGTPE]]
1222 [[!] --component-end|-E [+-=]<replaceable>N</replaceable>[kMGTPE]]
1223 [[!] --component-flags=<replaceable>comp_flags</replaceable>]</screen>
1224 <note><para>If you use <literal>--component-xxx</literal> options, only
1225 the composite files will be searched; but if you use
1226 <literal>! --component-xxx</literal> options, all the files will be
1227 searched.</para></note>
1228 <para><emphasis role="bold">Example</emphasis></para>
1229 <para>We use the following directory and composite files to show how
1230 <literal>lfs find</literal> works.</para>
1231 <screen>$ mkdir /mnt/testfs/testdir
1232 $ lfs setstripe -E 1M -E 10M -E eof /mnt/testfs/testdir/3comp
1233 $ lfs setstripe -E 4M -E 20M -E 30M -E eof /mnt/testfs/testdir/4comp
1234 $ mkdir -p /mnt/testfs/testdir/dir_3comp
1235 $ lfs setstripe -E 6M -E 30M -E eof /mnt/testfs/testdir/dir_3comp
1236 $ lfs setstripe -E 8M -E eof /mnt/testfs/testdir/dir_3comp/2comp
1237 $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile</screen>
1238 <para><emphasis role="bold">Case1. Find the files that match the specified
1239 component count condition</emphasis></para>
1240 <para>Find the files under directory /mnt/testfs/testdir whose number of
1241 components is not equal to 3.</para>
1242 <screen>$ lfs find /mnt/testfs/testdir ! --component-count=3
1244 /mnt/testfs/testdir/4comp
1245 /mnt/testfs/testdir/dir_3comp/2comp
1246 /mnt/testfs/testdir/dir_3comp/commonfile</screen>
1247 <para><emphasis role="bold">Case2. Find the files/dirs that match the
1248 specified component start/end condition</emphasis></para>
1249 <para>Find the file(s) under directory /mnt/testfs/testdir with component
1250 start = 4M and end < 70M</para>
1251 <screen>$ lfs find /mnt/testfs/testdir --component-start=4M -E -30M
1252 /mnt/testfs/testdir/4comp</screen>
1253 <para><emphasis role="bold">Case3. Find the files/dirs that match the
1254 specified component flag condition</emphasis></para>
1255 <para>Find the file(s) under directory /mnt/testfs/testdir whose component
1256 flags contain <literal>init</literal></para>
1257 <screen>$ lfs find /mnt/testfs/testdir --component-flag=init
1258 /mnt/testfs/testdir/3comp
1259 /mnt/testfs/testdir/4comp
1260 /mnt/testfs/testdir/dir_3comp/2comp</screen>
1261 <note><para>Since <literal>lfs find</literal> uses
1262 "<literal>!</literal>" to do negative search, we don’t support
1263 flag <literal>^init</literal> here.</para></note>
1266 <section xml:id="dbdoclet.50438209_10424">
1268 <primary>space</primary>
1269 <secondary>free space</secondary>
1270 </indexterm><indexterm>
1271 <primary>striping</primary>
1272 <secondary>round-robin algorithm</secondary>
1273 </indexterm><indexterm>
1274 <primary>striping</primary>
1275 <secondary>weighted algorithm</secondary>
1276 </indexterm><indexterm>
1277 <primary>round-robin algorithm</primary>
1278 </indexterm><indexterm>
1279 <primary>weighted algorithm</primary>
1280 </indexterm>Managing Free Space</title>
1281 <para>To optimize file system performance, the MDT assigns file stripes to OSTs based on two
1282 allocation algorithms. The <emphasis role="italic">round-robin</emphasis> allocator gives
1283 preference to location (spreading out stripes across OSSs to increase network bandwidth
1284 utilization) and the weighted allocator gives preference to available space (balancing loads
1285 across OSTs). Threshold and weighting factors for these two algorithms can be adjusted by the
1286 user. The MDT reserves 0.1 percent of total OST space and 32 inodes for each OST. The MDT
1287 stops object allocation for the OST if available space is less than reserved or the OST has
1288 fewer than 32 free inodes. The MDT starts object allocation when available space is twice
1289 as big as the reserved space and the OST has more than 64 free inodes. Note, clients
1290 could append existing files no matter what object allocation state is.</para>
1291 <para condition="l29"> The reserved space for each OST can be adjusted by the user. Use the
1292 <literal>lctl set_param</literal> command, for example the next command reserve 1GB space
1294 <screen>lctl set_param -P osp.*.reserved_mb_low=1024</screen></para>
1295 <para>This section describes how to check available free space on disks and how free space is
1296 allocated. It then describes how to set the threshold and weighting factors for the allocation
1298 <section xml:id="dbdoclet.50438209_35838">
1299 <title>Checking File System Free Space</title>
1300 <para>Free space is an important consideration in assigning file stripes. The <literal>lfs
1301 df</literal> command can be used to show available disk space on the mounted Lustre file
1302 system and space consumption per OST. If multiple Lustre file systems are mounted, a path
1303 may be specified, but is not required. Options to the <literal>lfs df</literal> command are
1305 <informaltable frame="all">
1307 <colspec colname="c1" colwidth="50*"/>
1308 <colspec colname="c2" colwidth="50*"/>
1312 <para><emphasis role="bold">Option</emphasis></para>
1315 <para><emphasis role="bold">Description</emphasis></para>
1322 <para> <literal>-h</literal></para>
1325 <para> Displays sizes in human readable format (for example: 1K, 234M, 5G).</para>
1330 <para> <literal role="bold">-i, --inodes</literal></para>
1333 <para> Lists inodes instead of block usage.</para>
1340 <para>The <literal>df -i</literal> and <literal>lfs df -i</literal> commands show the
1341 <emphasis role="italic">minimum</emphasis> number of inodes that can be created in the
1342 file system at the current time. If the total number of objects available across all of
1343 the OSTs is smaller than those available on the MDT(s), taking into account the default
1344 file striping, then <literal>df -i</literal> will also report a smaller number of inodes
1345 than could be created. Running <literal>lfs df -i</literal> will report the actual number
1346 of inodes that are free on each target.</para>
1347 <para>For ZFS file systems, the number of inodes that can be created is dynamic and depends
1348 on the free space in the file system. The Free and Total inode counts reported for a ZFS
1349 file system are only an estimate based on the current usage for each target. The Used
1350 inode count is the actual number of inodes used by the file system.</para>
1352 <para><emphasis role="bold">Examples</emphasis></para>
1353 <screen>[client1] $ lfs df
1354 UUID 1K-blockS Used Available Use% Mounted on
1355 mds-lustre-0_UUID 9174328 1020024 8154304 11% /mnt/lustre[MDT:0]
1356 ost-lustre-0_UUID 94181368 56330708 37850660 59% /mnt/lustre[OST:0]
1357 ost-lustre-1_UUID 94181368 56385748 37795620 59% /mnt/lustre[OST:1]
1358 ost-lustre-2_UUID 94181368 54352012 39829356 57% /mnt/lustre[OST:2]
1359 filesystem summary: 282544104 167068468 39829356 57% /mnt/lustre
1361 [client1] $ lfs df -h
1362 UUID bytes Used Available Use% Mounted on
1363 mds-lustre-0_UUID 8.7G 996.1M 7.8G 11% /mnt/lustre[MDT:0]
1364 ost-lustre-0_UUID 89.8G 53.7G 36.1G 59% /mnt/lustre[OST:0]
1365 ost-lustre-1_UUID 89.8G 53.8G 36.0G 59% /mnt/lustre[OST:1]
1366 ost-lustre-2_UUID 89.8G 51.8G 38.0G 57% /mnt/lustre[OST:2]
1367 filesystem summary: 269.5G 159.3G 110.1G 59% /mnt/lustre
1369 [client1] $ lfs df -i
1370 UUID Inodes IUsed IFree IUse% Mounted on
1371 mds-lustre-0_UUID 2211572 41924 2169648 1% /mnt/lustre[MDT:0]
1372 ost-lustre-0_UUID 737280 12183 725097 1% /mnt/lustre[OST:0]
1373 ost-lustre-1_UUID 737280 12232 725048 1% /mnt/lustre[OST:1]
1374 ost-lustre-2_UUID 737280 12214 725066 1% /mnt/lustre[OST:2]
1375 filesystem summary: 2211572 41924 2169648 1% /mnt/lustre[OST:2]</screen>
1377 <section remap="h3">
1379 <primary>striping</primary>
1380 <secondary>allocations</secondary>
1381 </indexterm> Stripe Allocation Methods</title>
1382 <para>Two stripe allocation methods are provided:</para>
1385 <para><emphasis role="bold">Round-robin allocator</emphasis> - When the OSTs have
1386 approximately the same amount of free space, the round-robin allocator alternates
1387 stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is
1388 evenly distributed among OSTs, regardless of the stripe count. In a simple example with
1389 eight OSTs numbered 0-7, objects would be allocated like this:</para>
1391 <screen>File 1: OST1, OST2, OST3, OST4
1392 File 2: OST5, OST6, OST7
1393 File 3: OST0, OST1, OST2, OST3, OST4, OST5
1394 File 4: OST6, OST7, OST0</screen>
1396 <para>Here are several more sample round-robin stripe orders (each letter represents a
1397 different OST on a single OSS):</para>
1398 <informaltable frame="none">
1400 <colspec colname="c1" colwidth="50*"/>
1401 <colspec colname="c2" colwidth="50*"/>
1405 <para> 3: AAA</para>
1408 <para> One 3-OST OSS</para>
1413 <para> 3x3: ABABAB</para>
1416 <para> Two 3-OST OSSs</para>
1421 <para> 3x4: BBABABA</para>
1424 <para> One 3-OST OSS (A) and one 4-OST OSS (B)</para>
1429 <para> 3x5: BBABBABA</para>
1432 <para> One 3-OST OSS (A) and one 5-OST OSS (B)</para>
1437 <para> 3x3x3: ABCABCABC</para>
1440 <para> Three 3-OST OSSs</para>
1448 <para><emphasis role="bold">Weighted allocator</emphasis> - When the free space difference
1449 between the OSTs becomes significant, the weighting algorithm is used to influence OST
1450 ordering based on size (amount of free space available on each OST) and location
1451 (stripes evenly distributed across OSTs). The weighted allocator fills the emptier OSTs
1452 faster, but uses a weighted random algorithm, so the OST with the most free space is not
1453 necessarily chosen each time.</para>
1456 <para>The allocation method is determined by the amount of free-space imbalance on the OSTs.
1457 When free space is relatively balanced across OSTs, the faster round-robin allocator is
1458 used, which maximizes network balancing. The weighted allocator is used when any two OSTs
1459 are out of balance by more than the specified threshold (17% by default). The threshold
1460 between the two allocation methods is defined in the file
1461 <literal>/proc/fs/<replaceable>fsname</replaceable>/lov/<replaceable>fsname</replaceable>-mdtlov/qos_threshold_rr</literal>. </para>
1462 <para>To set the <literal>qos_threshold_r</literal> to <literal>25</literal>, enter this
1464 MGS:<screen>lctl set_param lov.<replaceable>fsname</replaceable>-mdtlov.qos_threshold_rr=25</screen></para>
1466 <section remap="h3">
1468 <primary>space</primary>
1469 <secondary>location weighting</secondary>
1470 </indexterm>Adjusting the Weighting Between Free Space and Location</title>
1471 <para>The weighting priority used by the weighted allocator is set in the file
1472 <literal>/proc/fs/<replaceable>fsname</replaceable>/lov/<replaceable>fsname</replaceable>-mdtlov/qos_prio_free</literal>.
1473 Increasing the value of <literal>qos_prio_free</literal> puts more weighting on the amount
1474 of free space available on each OST and less on how stripes are distributed across OSTs. The
1475 default value is <literal>91</literal> (percent). When the free space priority is set to
1476 <literal>100</literal> (percent), weighting is based entirely on free space and location
1477 is no longer used by the striping algorithm. </para>
1478 <para>To change the allocator weighting to <literal>100</literal>, enter this command on the
1480 <screen>lctl conf_param <replaceable>fsname</replaceable>-MDT0000.lov.qos_prio_free=100</screen>
1483 <para>When <literal>qos_prio_free</literal> is set to <literal>100</literal>, a weighted
1484 random algorithm is still used to assign stripes, so, for example, if OST2 has twice as
1485 much free space as OST1, OST2 is twice as likely to be used, but it is not guaranteed to
1490 <section xml:id="wide_striping">
1492 <primary>striping</primary>
1493 <secondary>wide striping</secondary>
1494 </indexterm><indexterm>
1495 <primary>wide striping</primary>
1496 </indexterm>Lustre Striping Internals</title>
1497 <para>Individual files can only be striped over a finite number of OSTs,
1498 based on the maximum size of the attributes that can be stored on the MDT.
1499 If the MDT is ldiskfs-based without the <literal>ea_inode</literal>
1500 feature, a file can be striped across at most 160 OSTs. With ZFS-based
1501 MDTs, or if the <literal>ea_inode</literal> feature is enabled for an
1502 ldiskfs-based MDT, a file can be striped across up to 2000 OSTs.
1504 <para>Lustre inodes use an extended attribute to record on which OST each
1505 object is located, and the identifier each object on that OST. The size of
1506 the extended attribute is a function of the number of stripes.</para>
1507 <para>If using an ldiskfs-based MDT, the maximum number of OSTs over which
1508 files can be striped can been raised to 2000 by enabling the
1509 <literal>ea_inode</literal> feature on the MDT:
1510 <screen>tune2fs -O ea_inode /dev/<replaceable>mdtdev</replaceable></screen>
1512 <note><para>The maximum stripe count for a single file does not limit the
1513 maximum number of OSTs that are in the filesystem as a whole, only the
1514 maximum possible size and maximum aggregate bandwidth for the file.