1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="managingstripingfreespace">
2 <title xml:id="managingstripingfreespace.title">Managing File Layout (Striping) and Free
4 <para>This chapter describes file layout (striping) and I/O options, and includes the following
8 <para><xref linkend="dbdoclet.50438209_79324"/></para>
11 <para><xref linkend="dbdoclet.50438209_48033"/></para>
14 <para><xref linkend="dbdoclet.50438209_78664"/></para>
17 <para><xref linkend="dbdoclet.50438209_44776"/></para>
20 <para><xref linkend="dbdoclet.50438209_10424"/></para>
23 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/></para>
26 <section xml:id="dbdoclet.50438209_79324">
29 <primary>space</primary>
32 <primary>striping</primary>
33 <secondary>how it works</secondary>
36 <primary>striping</primary>
40 <primary>space</primary>
41 <secondary>striping</secondary>
42 </indexterm>How Lustre File System Striping Works</title>
43 <para>In a Lustre file system, the MDS allocates objects to OSTs using either a round-robin
44 algorithm or a weighted algorithm. When the amount of free space is well balanced (i.e., by
45 default, when the free space across OSTs differs by less than 17%), the round-robin algorithm
46 is used to select the next OST to which a stripe is to be written. Periodically, the MDS
47 adjusts the striping layout to eliminate some degenerated cases in which applications that
48 create very regular file layouts (striping patterns) preferentially use a particular OST in
50 <para> Normally the usage of OSTs is well balanced. However, if users create a small number of
51 exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may
52 result. When the free space across OSTs differs by more than a specific amount (17% by
53 default), the MDS then uses weighted random allocations with a preference for allocating
54 objects on OSTs with more free space. (This can reduce I/O performance until space usage is
55 rebalanced again.) For a more detailed description of how striping is allocated, see <xref
56 linkend="dbdoclet.50438209_10424"/>.</para>
57 <para>Files can only be striped over a finite number of OSTs, based on the
58 maximum size of the attributes that can be stored on the MDT. If the MDT
59 is ldiskfs-based without the <literal>ea_inode</literal> feature, a file
60 can be striped across at most 160 OSTs. With a ZFS-based MDT, or if the
61 <literal>ea_inode</literal> feature is enabled for an ldiskfs-based MDT,
62 a file can be striped across up to 2000 OSTs. For more information, see
63 <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/>.
66 <section xml:id="dbdoclet.50438209_48033">
68 <primary>file layout</primary>
69 <secondary>See striping</secondary>
70 </indexterm><indexterm>
71 <primary>striping</primary>
72 <secondary>considerations</secondary>
75 <primary>space</primary>
76 <secondary>considerations</secondary>
77 </indexterm> Lustre File Layout (Striping) Considerations</title>
78 <para>Whether you should set up file striping and what parameter values you select depends on
79 your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and
81 <para>Some reasons for using striping include:</para>
84 <para><emphasis role="bold">Providing high-bandwidth access.</emphasis> Many applications
85 require high-bandwidth access to a single file, which may be more bandwidth than can be
86 provided by a single OSS. Examples are a scientific application that writes to a single
87 file from hundreds of nodes, or a binary executable that is loaded by many nodes when an
88 application starts.</para>
89 <para>In cases like these, a file can be striped over as many OSSs as it takes to achieve
90 the required peak aggregate bandwidth for that file. Striping across a larger number of
91 OSSs should only be used when the file size is very large and/or is accessed by many nodes
92 at a time. Currently, Lustre files can be striped across up to 2000 OSTs</para>
95 <para><emphasis role="bold">Improving performance when OSS bandwidth is exceeded.</emphasis>
96 Striping across many OSSs can improve performance if the aggregate client bandwidth
97 exceeds the server bandwidth and the application reads and writes data fast enough to take
98 advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by
99 the I/O rate of the clients/jobs divided by the performance per OSS.</para>
102 <para condition="l2D"><emphasis role="bold">Matching stripes to I/O
103 pattern.</emphasis>When writing to a single file from multiple nodes,
104 having more than one client writing to a stripe can lead to issues
105 with lock exchange, where clients contend over writing to that stripe,
106 even if their I/Os do not overlap. This can be avoided if I/O can be
107 stripe aligned so that each stripe is accessed by only one client.
108 Since Lustre 2.13, the 'overstriping' feature is available, allowing more
109 than stripe per OST. This is particularly helpful for the case where
110 thread count exceeds OST count, making it possible to match stripe count
111 to thread count even in this case.</para>
114 <para><emphasis role="bold">Providing space for very large files.</emphasis> Striping is
115 useful when a single OST does not have enough free space to hold the entire file.</para>
118 <para>Some reasons to minimize or avoid striping:</para>
121 <para><emphasis role="bold">Increased overhead.</emphasis> Striping results in more locks
122 and extra network operations during common operations such as <literal>stat</literal> and
123 <literal>unlink</literal>. Even when these operations are performed in parallel, one
124 network operation takes less time than 100 operations.</para>
125 <para>Increased overhead also results from server contention. Consider a cluster with 100
126 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load
127 is distributed evenly, there is no contention and the disks on each server can manage
128 sequential I/O. If each file has 100 objects, then the clients all compete with one
129 another for the attention of the servers, and the disks on each node seek in 100 different
130 directions resulting in needless contention.</para>
133 <para><emphasis role="bold">Increased risk.</emphasis> When files are striped across all
134 servers and one of the servers breaks down, a small part of each striped file is lost. By
135 comparison, if each file has exactly one stripe, fewer files are lost, but they are lost
136 in their entirety. Many users would prefer to lose some of their files entirely than all
137 of their files partially.</para>
141 <title><indexterm><primary>striping</primary><secondary>size</secondary></indexterm>
142 Choosing a Stripe Size</title>
143 <para>Choosing a stripe size is a balancing act, but reasonable defaults are described below.
144 The stripe size has no effect on a single-stripe file.</para>
147 <para><emphasis role="bold">The stripe size must be a multiple of the page
148 size.</emphasis> Lustre software tools enforce a multiple of 64 KB (the maximum page
149 size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not
150 accidentally create files that might cause problems for ia64 clients.</para>
153 <para><emphasis role="bold">The smallest recommended stripe size is 512 KB.</emphasis>
154 Although you can create files with a stripe size of 64 KB, the smallest practical stripe
155 size is 512 KB because the Lustre file system sends 1MB chunks over the network.
156 Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced
160 <para><emphasis role="bold">A good stripe size for sequential I/O using high-speed
161 networks is between 1 MB and 4 MB.</emphasis> In most situations, stripe sizes larger
162 than 4 MB may result in longer lock hold times and contention during shared file
166 <para><emphasis role="bold">The maximum stripe size is 4 GB.</emphasis> Using a large
167 stripe size can improve performance when accessing very large files. It allows each
168 client to have exclusive access to its own part of a file. However, a large stripe size
169 can be counterproductive in cases where it does not match your I/O pattern.</para>
172 <para><emphasis role="bold">Choose a stripe pattern that takes into account the write
173 patterns of your application.</emphasis> Writes that cross an object boundary are
174 slightly less efficient than writes that go entirely to one server. If the file is
175 written in a consistent and aligned way, make the stripe size a multiple of the
176 <literal>write()</literal> size.</para>
181 <section xml:id="dbdoclet.50438209_78664">
183 <primary>striping</primary>
184 <secondary>configuration</secondary>
185 </indexterm>Setting the File Layout/Striping Configuration (<literal>lfs
186 setstripe</literal>)</title>
187 <para>Use the <literal>lfs setstripe</literal> command to create new files with a specific file layout (stripe pattern) configuration.</para>
188 <screen>lfs setstripe [--size|-s stripe_size] [--stripe-count|-c stripe_count] [--overstripe-count|-C stripe_count] \
189 [--index|-i start_ost] [--pool|-p pool_name] <replaceable>filename|dirname</replaceable> </screen>
190 <para><emphasis role="bold">
191 <literal>stripe_size</literal>
194 <para>The <literal>stripe_size</literal> indicates how much data to write to one OST before
195 moving to the next OST. The default <literal>stripe_size</literal> is 1 MB. Passing a
196 <literal>stripe_size</literal> of 0 causes the default stripe size to be used. Otherwise,
197 the <literal>stripe_size</literal> value must be a multiple of 64 KB.</para>
198 <para><emphasis role="bold">
199 <literal>stripe_count (--stripe-count, --overstripe-count)</literal>
202 <para>The <literal>stripe_count</literal> indicates how many stripes to use.
203 The default <literal>stripe_count</literal> value is 1. Setting
204 <literal>stripe_count</literal> to 0 causes the default stripe count to be
205 used. Setting <literal>stripe_count</literal> to -1 means stripe over all
206 available OSTs (full OSTs are skipped). When --overstripe-count is used,
207 per OST if necessary.</para>
208 <para><emphasis role="bold">
209 <literal>start_ost</literal>
212 <para>The start OST is the first OST to which files are written. The default value for
213 <literal>start_ost</literal> is -1, which allows the MDS to choose the starting index. This
214 setting is strongly recommended, as it allows space and load balancing to be done by the MDS
215 as needed. If the value of <literal>start_ost</literal> is set to a value other than -1, the
216 file starts on the specified OST index. OST index numbering starts at 0.</para>
218 <para>If the specified OST is inactive or in a degraded mode, the MDS will silently choose
219 another target.</para>
222 <para>If you pass a <literal>start_ost</literal> value of 0 and a
223 <literal>stripe_count</literal> value of <emphasis>1</emphasis>, all files are written to
224 OST 0, until space is exhausted. <emphasis role="italic">This is probably not what you meant
225 to do.</emphasis> If you only want to adjust the stripe count and keep the other
226 parameters at their default settings, do not specify any of the other parameters:</para>
227 <para><screen>client# lfs setstripe -c <replaceable>stripe_count</replaceable> <replaceable>filename</replaceable></screen></para>
229 <para><emphasis role="bold">
230 <literal>pool_name</literal>
233 <para>The <literal>pool_name</literal> specifies the OST pool to which the file will be written.
234 This allows limiting the OSTs used to a subset of all OSTs in the file system. For more
235 details about using OST pools, see <link xl:href="ManagingFileSystemIO.html#50438211_75549"
236 >Creating and Managing OST Pools</link>.</para>
238 <title>Specifying a File Layout (Striping Pattern) for a Single File</title>
239 <para>It is possible to specify the file layout when a new file is created using the command <literal>lfs setstripe</literal>. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an <literal>lfs setstripe</literal> command fails if the file already exists.</para>
240 <section xml:id="dbdoclet.50438209_60155">
241 <title>Setting the Stripe Size</title>
242 <para>The command to create a new file with a specified stripe size is similar to:</para>
243 <screen>[client]# lfs setstripe -s 4M /mnt/lustre/new_file</screen>
244 <para>This example command creates the new file <literal>/mnt/lustre/new_file</literal> with a stripe size of 4 MB.</para>
245 <para>Now, when the file is created, the new stripe setting creates the file on a single OST with a stripe size of 4M:</para>
246 <screen> [client]# lfs getstripe /mnt/lustre/new_file
249 lmm_stripe_size: 4194304
253 obdidx objid objid group
254 1 690550 0xa8976 0 </screen>
255 <para>In this example, the stripe size is 4 MB.</para>
258 <title><indexterm><primary>striping</primary><secondary>count</secondary></indexterm>
259 Setting the Stripe Count</title>
260 <para>The command below creates a new file with a stripe count of <literal>-1</literal> to
261 specify striping over all available OSTs:</para>
262 <screen>[client]# lfs setstripe -c -1 /mnt/lustre/full_stripe</screen>
263 <para>The example below indicates that the file <literal>full_stripe</literal> is striped
264 over all six active OSTs in the configuration:</para>
265 <screen>[client]# lfs getstripe /mnt/lustre/full_stripe
266 /mnt/lustre/full_stripe
267 obdidx objid objid group
274 <para> This is in contrast to the output in <xref linkend="dbdoclet.50438209_60155"/>, which
275 shows only a single object for the file.</para>
280 <primary>striping</primary>
281 <secondary>per directory</secondary>
282 </indexterm>Setting the Striping Layout for a Directory</title>
283 <para>In a directory, the <literal>lfs setstripe</literal> command sets a default striping
284 configuration for files created in the directory. The usage is the same as <literal>lfs
285 setstripe</literal> for a regular file, except that the directory must exist prior to
286 setting the default striping configuration. If a file is created in a directory with a
287 default stripe configuration (without otherwise specifying striping), the Lustre file system
288 uses those striping parameters instead of the file system default for the new file.</para>
289 <para>To change the striping pattern for a sub-directory, create a directory with desired file
290 layout as described above. Sub-directories inherit the file layout of the root/parent
295 <primary>striping</primary>
296 <secondary>per file system</secondary>
297 </indexterm>Setting the Striping Layout for a File System</title>
298 <para>Setting the striping specification on the <literal>root</literal> directory determines
299 the striping for all new files created in the file system unless an overriding striping
300 specification takes precedence (such as a striping layout specified by the application, or
301 set using <literal>lfs setstripe</literal>, or specified for the parent directory).</para>
303 <para>The striping settings for a <literal>root</literal> directory are, by default, applied
304 to any new child directories created in the root directory, unless striping settings have
305 been specified for the child directory.</para>
310 <primary>striping</primary>
311 <secondary>on specific OST</secondary>
312 </indexterm>Creating a File on a Specific OST</title>
313 <para>You can use <literal>lfs setstripe</literal> to create a file on a specific OST. In the
314 following example, the file <literal>file1</literal> is created on the first OST (OST index
316 <screen>$ lfs setstripe --stripe-count 1 --index 0 file1
317 $ dd if=/dev/zero of=file1 count=1 bs=100M
321 $ lfs getstripe file1
324 lmm_stripe_size: 1048576
328 obdidx objid objid group
329 0 37364 0x91f4 0</screen>
332 <section xml:id="dbdoclet.50438209_44776">
333 <title><indexterm><primary>striping</primary><secondary>getting information</secondary></indexterm>Retrieving File Layout/Striping Information (<literal>getstripe</literal>)</title>
334 <para>The <literal>lfs getstripe</literal> command is used to display information that shows
335 over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along
336 with the OST index and object ID for each stripe in the file. For directories, the default
337 settings for files created in that directory are displayed.</para>
339 <title>Displaying the Current Stripe Size</title>
340 <para>To see the current stripe size for a Lustre file or directory, use the <literal>lfs
341 getstripe</literal> command. For example, to view information for a directory, enter a
342 command similar to:</para>
343 <screen>[client]# lfs getstripe /mnt/lustre </screen>
344 <para>This command produces output similar to:</para>
346 (Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1</screen>
347 <para>In this example, the default stripe count is <literal>1</literal> (data blocks are
348 striped over a single OST), the default stripe size is 1 MB, and the objects are created
349 over all available OSTs.</para>
350 <para>To view information for a file, enter a command similar to:</para>
351 <screen>$ lfs getstripe /mnt/lustre/foo
354 lmm_stripe_size: 1048576
358 obdidx objid objid group
359 2 835487 m0xcbf9f 0 </screen>
360 <para>In this example, the file is located on <literal>obdidx 2</literal>, which corresponds
361 to the OST <literal>lustre-OST0002</literal>. To see which node is serving that OST, run:
362 <screen>$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid
363 osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp</screen></para>
366 <title>Inspecting the File Tree</title>
367 <para>To inspect an entire tree of files, use the <literal>lfs find</literal> command:</para>
368 <screen>lfs find [--recursive | -r] <replaceable>file|directory</replaceable> ...</screen>
372 <primary>striping</primary>
373 <secondary>remote directories</secondary>
374 </indexterm>Locating the MDT for a remote directory</title>
375 <para>Lustre can be configured with multiple MDTs in the same file
376 system. Each directory and file could be located on a different MDT.
377 To identify which MDT a given subdirectory is located, pass the
378 <literal>getstripe [--mdt-index|-M]</literal> parameter to
379 <literal>lfs</literal>. An example of this command is provided in
380 the section <xref linkend="lustremaint.rmremotedir"/>.</para>
383 <section xml:id="pfl" condition='l2A'>
385 <primary>striping</primary>
386 <secondary>PFL</secondary>
387 </indexterm>Progressive File Layout(PFL)</title>
388 <para>The Lustre Progressive File Layout (PFL) feature simplifies the use
389 of Lustre so that users can expect reasonable performance for a variety of
390 normal file IO patterns without the need to explicitly understand their IO
391 model or Lustre usage details in advance. In particular, users do not
392 necessarily need to know the size or concurrency of output files in
393 advance of their creation and explicitly specify an optimal layout for
394 each file in order to achieve good performance for both highly concurrent
395 shared-single-large-file IO or parallel IO to many smaller per-process
397 <para>The layout of a PFL file is stored on disk as <literal>composite
398 layout</literal>. A PFL file is essentially an array of
399 <literal>sub-layout components</literal>, with each sub-layout component
400 being a plain layout covering different and non-overlapped extents of
401 the file. For PFL files, the file layout is composed of a series of
402 components, therefore it's possible that there are some file extents are
403 not described by any components.</para>
404 <para>An example of how data blocks of PFL files are mapped to OST objects
405 of components is shown in the following PFL object mapping diagram:</para>
406 <figure xml:id="managinglayout.fig.pfl">
407 <title>PFL object mapping diagram</title>
410 <imagedata scalefit="1" width="100%"
411 fileref="figures/PFL_object_mapping_diagram.png" />
414 <phrase>PFL object mapping diagram</phrase>
418 <para>The PFL file in <xref linkend="managinglayout.fig.pfl"/> has 3
419 components and shows the mapping for the blocks of a 2055MB file.
420 The stripe size for the first two components is 1MB, while the stripe size
421 for the third component is 4MB. The stripe count is increasing for each
422 successive component. The first component only has two 1MB blocks and the
423 single object has a size of 2MB. The second component holds the next 254MB
424 of the file spread over 4 separate OST objects in RAID-0, each one will
425 have a size of 256MB / 4 objects = 64MB per object. Note the first two
426 objects <literal>obj 2,0</literal> and <literal>obj 2,1</literal>
427 have a 1MB hole at the start where the data is stored in the first
428 component. The final component holds the next 1800MB spread over 32 OST
429 objects. There is a 256MB / 32 = 8MB hole at the start each one for the
430 data stored in the first two components. Each object will be
431 2048MB / 32 objects = 64MB per object, except the
432 <literal>obj 3,0</literal> that holds an extra 4MB chunk and
433 <literal>obj 3,1</literal> that holds an extra 3MB chunk. If more data
434 was written to the file, only the objects in component 3 would increase
436 <para>When a file range with defined but not instantiated component is
437 accessed, clients will send a Layout Intent RPC to the MDT, and the MDT
438 would instantiate the objects of the components covering that range.
440 <para>Next, some commands for user to operate PFL files are introduced and
441 some examples of possible composite layout are illustrated as well.
442 Lustre provides commands
443 <literal>lfs setstripe</literal> and <literal>lfs migrate</literal> for
444 users to operate PFL files. <literal>lfs setstripe</literal> commands
445 are used to create PFL files, add or delete components to or from an
446 existing composite file; <literal>lfs migrate</literal> commands are used
447 to re-layout the data in existing files using the new layout parameter by
448 copying the data from the existing OST(s) to the new OST(s). Also,
449 as introduced in the previous sections, <literal>lfs getstripe</literal>
450 commands can be used to list the striping/component information for a
451 given PFL file, and <literal>lfs find</literal> commands can be used to
452 search the directory tree rooted at the given directory or file name for
453 the files that match the given PFL component parameters.</para>
454 <note><para>Using PFL files requires both the client and server to
455 understand the PFL file layout, which isn't available for Lustre 2.9 and
456 earlier. And it will not prevent older clients from accessing non-PFL
457 files in the filesystem.</para></note>
459 <title><literal>lfs setstripe</literal></title>
460 <para><literal>lfs setstripe</literal> commands are used to create PFL
461 files, add or delete components to or from an existing composite file.
462 (Suppose we have 8 OSTs in the following examples and stripe size is 1MB
465 <title>Create a PFL file</title>
466 <para><emphasis role="bold">Command</emphasis></para>
467 <screen>lfs setstripe
468 [--component-end|-E end1] [STRIPE_OPTIONS]
469 [--component-end|-E end2] [STRIPE_OPTIONS] ... <replaceable>filename</replaceable></screen>
470 <para>The <literal>-E</literal> option is used to specify the end offset
471 (in bytes or using a suffix “kMGTP”, e.g. 256M) of each component, and
472 it also indicates the following <literal>STRIPE_OPTIONS</literal> are
473 for this component. Each component defines the stripe pattern of the
474 file in the range of [start, end). The first component must start from
475 offset 0 and all components must be adjacent with each other, no holes
476 are allowed, so each extent will start at the end of previous extent.
477 A <literal>-1</literal> end offset or <literal>eof</literal> indicates
478 this is the last component extending to the end of file.</para>
479 <para><emphasis role="bold">Example</emphasis></para>
480 <screen>$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \
481 /mnt/testfs/create_comp</screen>
482 <para>This command creates a file with composite layout illustrated in
483 the following figure. The first component has 1 stripe and covers
484 [0, 4M), the second component has 4 stripes and covers [4M, 64M), and
485 the last component stripes start at OST4, cross over all available
486 OSTs and covers [64M, EOF).</para>
487 <figure xml:id="managinglayout.fig.pfl_create">
488 <title>Example: create a composite file</title>
491 <imagedata scalefit="1" depth="2.75in" align="center"
492 fileref="figures/PFL_createfile.png" />
495 <phrase>Example: create a composite file</phrase>
499 <para>The composite layout can be output by the following command:</para>
500 <screen>$ lfs getstripe /mnt/testfs/create_comp
501 /mnt/testfs/create_comp
506 lcme_extent.e_start: 0
507 lcme_extent.e_end: 4194304
509 lmm_stripe_size: 1048576
514 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
518 lcme_extent.e_start: 4194304
519 lcme_extent.e_end: 67108864
521 lmm_stripe_size: 1048576
524 lmm_stripe_offset: -1
527 lcme_extent.e_start: 67108864
528 lcme_extent.e_end: EOF
530 lmm_stripe_size: 1048576
533 lmm_stripe_offset: 4</screen>
534 <note><para>Only the first component’s OST objects of the PFL file are
535 instantiated when the layout is being set. Other instantiation is
536 delayed to later write/truncate operations.</para></note>
537 <para>If we write 128M data to this PFL file, the second and third
538 components will be instantiated:</para>
539 <screen>$ dd if=/dev/zero of=/mnt/testfs/create_comp bs=1M count=128
540 $ lfs getstripe /mnt/testfs/create_comp
541 /mnt/testfs/create_comp
546 lcme_extent.e_start: 0
547 lcme_extent.e_end: 4194304
549 lmm_stripe_size: 1048576
554 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
558 lcme_extent.e_start: 4194304
559 lcme_extent.e_end: 67108864
561 lmm_stripe_size: 1048576
566 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
567 - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
568 - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
569 - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
573 lcme_extent.e_start: 67108864
574 lcme_extent.e_end: EOF
576 lmm_stripe_size: 1048576
581 - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x3:0x0] }
582 - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
583 - 2: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
584 - 3: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
585 - 4: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
586 - 5: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }
587 - 6: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }
588 - 7: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }</screen>
591 <title>Add component(s) to an existing composite file</title>
592 <para><emphasis role="bold">Command</emphasis></para>
593 <screen>lfs setstripe --component-add
594 [--component-end|-E end1] [STRIPE_OPTIONS]
595 [--component-end|-E end2] [STRIPE_OPTIONS] ... <replaceable>filename</replaceable></screen>
596 <para>The option <literal>--component-add</literal> is used to add
597 components to an existing composite file. The extent start of
598 the first component to be added is equal to the extent end of last
599 component in the existing file, and all components to be added must
600 be adjacent with each other.</para>
601 <note><para>If the last existing component is specified by
602 <literal>-E -1</literal> or <literal>-E eof</literal>, which covers
603 to the end of the file, it must be deleted before a new one is added.
605 <para><emphasis role="bold">Example</emphasis></para>
606 <screen>$ lfs setstripe -E 4M -c 1 -E 64M -c 4 /mnt/testfs/add_comp
607 $ lfs setstripe --component-add -E -1 -c 4 -o 6-7,0,5 \
608 /mnt/testfs/add_comp</screen>
609 <para>This command adds a new component which starts from the end of
610 the last existing component to the end of file. The layout of this
611 example is illustrated in
612 <xref linkend="managinglayout.fig.pfl_addcomp"/>. The last component
613 stripes across 4 OSTs in sequence OST6, OST7, OST0 and OST5, covers
615 <figure xml:id="managinglayout.fig.pfl_addcomp">
616 <title>Example: add a component to an existing composite file</title>
619 <imagedata scalefit="1" depth="2.75in" align="center"
620 fileref="figures/PFL_addcomp.png" />
623 <phrase>Example: add a component to an existing composite file
628 <para>The layout can be printed out by the following command:</para>
629 <screen>$ lfs getstripe /mnt/testfs/add_comp
635 lcme_extent.e_start: 0
636 lcme_extent.e_end: 4194304
638 lmm_stripe_size: 1048576
643 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
647 lcme_extent.e_start: 4194304
648 lcme_extent.e_end: 67108864
650 lmm_stripe_size: 1048576
655 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
656 - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
657 - 2: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
658 - 3: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
662 lcme_extent.e_start: 67108864
663 lcme_extent.e_end: EOF
665 lmm_stripe_size: 1048576
668 lmm_stripe_offset: -1</screen>
669 <para>The component ID "lcme_id" changes as layout generation
670 changes. It is not necessarily sequential and does not imply ordering
671 of individual components.</para>
672 <note><para>Similar to specifying a full-file composite layout at file
673 creation time, <literal>--component-add</literal> won't instantiate
674 OST objects, the instantiation is delayed to later write/truncate
675 operations. For example, after writing beyond the 64MB start of the
676 file's last component, the new component has had objects allocated:
678 <screen>$ lfs getstripe -I5 /mnt/testfs/add_comp
684 lcme_extent.e_start: 67108864
685 lcme_extent.e_end: EOF
687 lmm_stripe_size: 1048576
692 - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x4:0x0] }
693 - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x4:0x0] }
694 - 2: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
695 - 3: { l_ost_idx: 5, l_fid: [0x100050000:0x4:0x0] }</screen>
698 <title>Delete component(s) from an existing file</title>
699 <para><emphasis role="bold">Command</emphasis></para>
700 <screen>lfs setstripe --component-del
701 [--component-id|-I comp_id | --component-flags comp_flags]
702 <replaceable>filename</replaceable></screen>
703 <para>The option <literal>--component-del</literal> is used to remove
704 the component(s) specified by component ID or flags from an existing
705 file. This operation will result in any data stored in the deleted
706 component will be lost.</para>
707 <para>The ID specified by <literal>-I</literal> option is the numerical
708 unique ID of the component, which can be obtained by command
709 <literal>lfs getstripe -I</literal> command, and the flag specified by
710 <literal>--component-flags</literal> option is a certain type of
711 components, which can be obtained by command
712 <literal>lfs getstripe --component-flags</literal>. For now, we only
713 have two flags <literal>init</literal> and <literal>^init</literal>
714 for instantiated and un-instantiated components respectively.</para>
715 <note><para>Deletion must start with the last component because hole is
716 not allowed.</para></note>
717 <para><emphasis role="bold">Example</emphasis></para>
718 <screen>$ lfs getstripe -I /mnt/testfs/del_comp
722 $ lfs setstripe --component-del -I 5 /mnt/testfs/del_comp</screen>
723 <para>This example deletes the component with ID 5 from file
724 <literal>/mnt/testfs/del_comp</literal>. If we still use the last
725 example, the final result is illustrated in
726 <xref linkend="managinglayout.fig.pfl_delcomp"/>.</para>
727 <figure xml:id="managinglayout.fig.pfl_delcomp">
728 <title>Example: delete a component from an existing file</title>
731 <imagedata scalefit="1" depth="2.75in" align="center"
732 fileref="figures/PFL_delcomp.png" />
735 <phrase>Example: delete a component from an existing file</phrase>
739 <para>If you try to delete a non-last component, you will see the
740 following error:</para>
741 <screen>$ lfs setstripe -component-del -I 2 /mnt/testfs/del_comp
742 Delete component 0x2 from /mnt/testfs/del_comp failed. Invalid argument
743 error: setstripe: delete component of file '/mnt/testfs/del_comp' failed: Invalid argument</screen>
746 <title>Set default PFL layout to an existing directory</title>
747 <para>Similar to create a PFL file, you can set default PFL layout to
748 an existing directory. After that, all the files created will inherit
749 this layout by default.</para>
750 <para><emphasis role="bold">Command</emphasis></para>
751 <screen>lfs setstripe
752 [--component-end|-E end1] [STRIPE_OPTIONS]
753 [--component-end|-E end2] [STRIPE_OPTIONS] ... <replaceable>dirname</replaceable></screen>
754 <para><emphasis role="bold">Example</emphasis></para>
756 $ mkdir /mnt/testfs/pfldir
757 $ lfs setstripe -E 256M -c 1 -E 16G -c 4 -E -1 -S 4M -c -1 /mnt/testfs/pfldir
759 <para>When you run <literal>lfs getstripe</literal>, you will see:
762 $ lfs getstripe /mnt/testfs/pfldir
768 lcme_extent.e_start: 0
769 lcme_extent.e_end: 268435456
770 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
773 lcme_extent.e_start: 268435456
774 lcme_extent.e_end: 17179869184
775 stripe_count: 4 stripe_size: 1048576 stripe_offset: -1
778 lcme_extent.e_start: 17179869184
779 lcme_extent.e_end: EOF
780 stripe_count: -1 stripe_size: 4194304 stripe_offset: -1
782 <para>If you create a file under <literal>/mnt/testfs/pfldir</literal>,
783 the layout of that file will inherit the layout from its parent
786 $ touch /mnt/testfs/pfldir/pflfile
787 $ lfs getstripe /mnt/testfs/pfldir/pflfile
788 /mnt/testfs/pfldir/pflfile
793 lcme_extent.e_start: 0
794 lcme_extent.e_end: 268435456
796 lmm_stripe_size: 1048576
801 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0xa:0x0] }
805 lcme_extent.e_start: 268435456
806 lcme_extent.e_end: 17179869184
808 lmm_stripe_size: 1048576
811 lmm_stripe_offset: -1
815 lcme_extent.e_start: 17179869184
816 lcme_extent.e_end: EOF
818 lmm_stripe_size: 4194304
821 lmm_stripe_offset: -1
824 <literal>lfs setstripe --component-add/del</literal> can't be run
825 on a directory, because default layout in directory is likea config,
826 which can be arbitrarily changed by <literal>lfs setstripe</literal>,
827 while layout in file may have data (OST objects) attached. If you want
828 to delete default layout in a directory, run
829 <literal>lfs setstripe -d <replaceable>dirname</replaceable></literal>
830 to return the directory to the filesystem-wide defaults, like:
832 $ lfs setstripe -d /mnt/testfs/pfldir
833 $ lfs getstripe -d /mnt/testfs/pfldir
835 stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
836 /mnt/testfs/pfldir/commonfile
838 lmm_stripe_size: 1048576
842 obdidx objid objid group
849 <title><literal>lfs migrate</literal></title>
850 <para><literal>lfs migrate</literal> commands are used to re-layout the
851 data in the existing files with the new layout parameter by copying the
852 data from the existing OST(s) to the new OST(s).</para>
853 <para><emphasis role="bold">Command</emphasis></para>
854 <screen>lfs migrate [--component-end|-E comp_end] [STRIPE_OPTIONS] ...
855 <replaceable>filename</replaceable></screen>
856 <para>The difference between <literal>migrate</literal> and
857 <literal>setstripe</literal> is that <literal>migrate</literal> is to
858 re-layout the data in the existing files, while
859 <literal>setstripe</literal> is to create new files with the specified
861 <para><emphasis role="bold">Example</emphasis></para>
862 <para><emphasis role="bold">Case1. Migrate a normal one to a composite
863 layout</emphasis></para>
864 <screen>$ lfs setstripe -c 1 -S 128K /mnt/testfs/norm_to_2comp
865 $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
866 $ lfs getstripe /mnt/testfs/norm_to_2comp --yaml
867 /mnt/testfs/norm_to_comp
869 lmm_stripe_size: 131072
875 l_fid: 0x100070000:0x2:0x0
876 $ lfs migrate -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \
877 /mnt/testfs/norm_to_2comp</screen>
878 <para>In this example, a 5MB size file with 1 stripe and 128K stripe size
879 is migrated to a composite layout file with 2 components, illustrated in
880 <xref linkend="managinglayout.fig.pfl_norm_to_comp"/>.</para>
881 <figure xml:id="managinglayout.fig.pfl_norm_to_comp">
882 <title>Example: migrate normal to composite</title>
885 <imagedata scalefit="1" depth="2.75in" align="center"
886 fileref="figures/PFL_norm_to_comp.png" />
889 <phrase>Example: migrate normal to composite</phrase>
893 <para>The stripe information after migration is like:</para>
894 <screen>$ lfs getstripe /mnt/testfs/norm_to_2comp
895 /mnt/testfs/norm_to_2comp
900 lcme_extent.e_start: 0
901 lcme_extent.e_end: 1048576
903 lmm_stripe_size: 524288
908 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
912 lcme_extent.e_start: 1048576
913 lcme_extent.e_end: EOF
915 lmm_stripe_size: 1048576
920 - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
921 - 1: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }</screen>
922 <para><emphasis role="bold">Case2. Migrate a composite layout to another
923 composite layout</emphasis></para>
924 <screen>$ lfs setstripe -E 1M -S 512K -c 1 -E -1 -S 1M -c 2 \
925 /mnt/testfs/2comp_to_3comp
926 $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
927 $ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \
928 /mnt/testfs/2comp_to_3comp</screen>
929 <para>In this example, a composite layout file with 2 components is
930 migrated a composite layout file with 3 components. If we still use
931 the example in case1, the migration process is illustrated in
932 <xref linkend="managinglayout.fig.pfl_comp_to_comp"/>.</para>
933 <figure xml:id="managinglayout.fig.pfl_comp_to_comp">
934 <title>Example: migrate composite to composite</title>
937 <imagedata scalefit="1" depth="2.75in" align="center"
938 fileref="figures/PFL_comp_to_comp.png" />
941 <phrase>Example: migrate composite to composite</phrase>
945 <para>The stripe information is like:</para>
946 <screen>$ lfs getstripe /mnt/testfs/2comp_to_3comp
947 /mnt/testfs/2comp_to_3comp
952 lcme_extent.e_start: 0
953 lcme_extent.e_end: 1048576
955 lmm_stripe_size: 1048576
960 - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
961 - 1: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
965 lcme_extent.e_start: 1048576
966 lcme_extent.e_end: 4194304
968 lmm_stripe_size: 1048576
973 - 0: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
974 - 1: { l_ost_idx: 7, l_fid: [0x100070000:0x3:0x0] }
978 lcme_extent.e_start: 4194304
979 lcme_extent.e_end: EOF
981 lmm_stripe_size: 3145728
986 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
987 - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
988 - 2: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }</screen>
989 <para><emphasis role="bold">Case3. Migrate a composite layout to a
990 normal one</emphasis></para>
991 <screen>$ lfs migrate -E 1M -S 1M -c 2 -E 4M -S 1M -c 2 -E -1 -S 3M -c 3 \
992 /mnt/testfs/3comp_to_norm
993 $ dd if=/dev/urandom of=/mnt/testfs/norm_to_2comp bs=1M count=5
994 $ lfs migrate -c 2 -S 2M /mnt/testfs/3comp_to_normal</screen>
995 <para>In this example, a composite file with 3 components is migrated to
996 a normal file with 2 stripes and 2M stripe size. If we still use the
997 example in Case2, the migration process is illustrated in
998 <xref linkend="managinglayout.fig.pfl_comp_to_norm"/>.</para>
999 <figure xml:id="managinglayout.fig.pfl_comp_to_norm">
1000 <title>Example: migrate composite to normal</title>
1003 <imagedata scalefit="1" depth="2.75in" align="center"
1004 fileref="figures/PFL_comp_to_norm.png" />
1007 <phrase>Example: migrate composite to normal</phrase>
1011 <para>The stripe information is like:</para>
1012 <screen>$ lfs getstripe /mnt/testfs/3comp_to_norm --yaml
1013 /mnt/testfs/3comp_to_norm
1015 lmm_stripe_size: 2097152
1018 lmm_stripe_offset: 4
1021 l_fid: 0x100040000:0x3:0x0
1023 l_fid: 0x100050000:0x3:0x0</screen>
1025 <section remap="h3">
1026 <title><literal>lfs getstripe</literal></title>
1027 <para><literal>lfs getstripe</literal> commands can be used to list the
1028 striping/component information for a given PFL file. Here, only those
1029 parameters new for PFL files are shown.</para>
1030 <para><emphasis role="bold">Command</emphasis></para>
1031 <screen>lfs getstripe
1032 [--component-id|-I [comp_id]]
1033 [--component-flags [comp_flags]]
1035 [--component-start [+-][N][kMGTPE]]
1036 [--component-end|-E [+-][N][kMGTPE]]
1037 <replaceable>dirname|filename</replaceable></screen>
1038 <para><emphasis role="bold">Example</emphasis></para>
1039 <para>Suppose we already have a composite file
1040 <literal>/mnt/testfs/3comp</literal>, created by the following
1042 <screen>$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -i 4 \
1043 /mnt/testfs/3comp</screen>
1044 <para>And write some data</para>
1045 <screen>$ dd if=/dev/zero of=/mnt/testfs/3comp bs=1M count=5</screen>
1046 <para><emphasis role="bold">Case1. List component ID and its related
1047 information</emphasis></para>
1050 <para>List all the components ID</para>
1051 <screen>$ lfs getstripe -I /mnt/testfs/3comp
1057 <para>List the detailed striping information of component ID=2</para>
1058 <screen>$ lfs getstripe -I2 /mnt/testfs/3comp
1064 lcme_extent.e_start: 4194304
1065 lcme_extent.e_end: 67108864
1067 lmm_stripe_size: 1048576
1070 lmm_stripe_offset: 5
1072 - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
1073 - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
1074 - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
1075 - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }</screen>
1078 <para>List the stripe offset and stripe count of component ID=2</para>
1079 <screen>$ lfs getstripe -I2 -i -c /mnt/testfs/3comp
1081 lmm_stripe_offset: 5</screen>
1084 <para><emphasis role="bold">Case2. List the component which contains the
1085 specified flag</emphasis></para>
1088 <para>List the flag of each component</para>
1089 <screen>$ lfs getstripe -component-flag -I /mnt/testfs/3comp
1095 lcme_flags: 0</screen>
1098 <para>List component(s) who is not instantiated</para>
1099 <screen>$ lfs getstripe --component-flags=^init /mnt/testfs/3comp
1105 lcme_extent.e_start: 67108864
1106 lcme_extent.e_end: EOF
1107 lmm_stripe_count: -1
1108 lmm_stripe_size: 1048576
1111 lmm_stripe_offset: 4</screen>
1114 <para><emphasis role="bold">Case3. List the total number of all the
1115 component(s)</emphasis></para>
1118 <para>List the total number of all the components</para>
1119 <screen>$ lfs getstripe --component-count /mnt/testfs/3comp
1123 <para><emphasis role="bold">Case4. List the component with the specified
1124 extent start or end positions</emphasis></para>
1127 <para>List the start position in bytes of each component</para>
1128 <screen>$ lfs getstripe --component-start /mnt/testfs/3comp
1134 <para>List the start position in bytes of component ID=3</para>
1135 <screen>$ lfs getstripe --component-start -I3 /mnt/testfs/3comp
1139 <para>List the component with start = 64M</para>
1140 <screen>$ lfs getstripe --component-start=64M /mnt/testfs/3comp
1146 lcme_extent.e_start: 67108864
1147 lcme_extent.e_end: EOF
1148 lmm_stripe_count: -1
1149 lmm_stripe_size: 1048576
1152 lmm_stripe_offset: 4</screen>
1155 <para>List the component(s) with start > 5M</para>
1156 <screen>$ lfs getstripe --component-start=+5M /mnt/testfs/3comp
1162 lcme_extent.e_start: 67108864
1163 lcme_extent.e_end: EOF
1164 lmm_stripe_count: -1
1165 lmm_stripe_size: 1048576
1168 lmm_stripe_offset: 4</screen>
1171 <para>List the component(s) with start < 5M</para>
1172 <screen>$ lfs getstripe --component-start=-5M /mnt/testfs/3comp
1178 lcme_extent.e_start: 0
1179 lcme_extent.e_end: 4194304
1181 lmm_stripe_size: 1048576
1184 lmm_stripe_offset: 4
1186 - 0: { l_ost_idx: 4, l_fid: [0x100040000:0x2:0x0] }
1190 lcme_extent.e_start: 4194304
1191 lcme_extent.e_end: 67108864
1193 lmm_stripe_size: 1048576
1196 lmm_stripe_offset: 5
1198 - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
1199 - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
1200 - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
1201 - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }</screen>
1204 <para>List the component(s) with start > 3M and end < 70M</para>
1205 <screen>$ lfs getstripe --component-start=+3M --component-end=-70M \
1212 lcme_extent.e_start: 4194304
1213 lcme_extent.e_end: 67108864
1215 lmm_stripe_size: 1048576
1218 lmm_stripe_offset: 5
1220 - 0: { l_ost_idx: 5, l_fid: [0x100050000:0x2:0x0] }
1221 - 1: { l_ost_idx: 6, l_fid: [0x100060000:0x2:0x0] }
1222 - 2: { l_ost_idx: 7, l_fid: [0x100070000:0x2:0x0] }
1223 - 3: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }</screen>
1227 <section remap="h3">
1228 <title><literal>lfs find</literal></title>
1229 <para><literal>lfs find</literal> commands can be used to search the
1230 directory tree rooted at the given directory or file name for the files
1231 that match the given PFL component parameters. Here, only those
1232 parameters new for PFL files are shown. Their usages are similar to
1233 <literal>lfs getstripe</literal> commands.</para>
1234 <para><emphasis role="bold">Command</emphasis></para>
1235 <screen>lfs find <replaceable>directory|filename</replaceable>
1236 [[!] --component-count [+-=]<replaceable>comp_cnt</replaceable>]
1237 [[!] --component-start [+-=]<replaceable>N</replaceable>[kMGTPE]]
1238 [[!] --component-end|-E [+-=]<replaceable>N</replaceable>[kMGTPE]]
1239 [[!] --component-flags=<replaceable>comp_flags</replaceable>]</screen>
1240 <note><para>If you use <literal>--component-xxx</literal> options, only
1241 the composite files will be searched; but if you use
1242 <literal>! --component-xxx</literal> options, all the files will be
1243 searched.</para></note>
1244 <para><emphasis role="bold">Example</emphasis></para>
1245 <para>We use the following directory and composite files to show how
1246 <literal>lfs find</literal> works.</para>
1247 <screen>$ mkdir /mnt/testfs/testdir
1248 $ lfs setstripe -E 1M -E 10M -E eof /mnt/testfs/testdir/3comp
1249 $ lfs setstripe -E 4M -E 20M -E 30M -E eof /mnt/testfs/testdir/4comp
1250 $ mkdir -p /mnt/testfs/testdir/dir_3comp
1251 $ lfs setstripe -E 6M -E 30M -E eof /mnt/testfs/testdir/dir_3comp
1252 $ lfs setstripe -E 8M -E eof /mnt/testfs/testdir/dir_3comp/2comp
1253 $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile</screen>
1254 <para><emphasis role="bold">Case1. Find the files that match the specified
1255 component count condition</emphasis></para>
1256 <para>Find the files under directory /mnt/testfs/testdir whose number of
1257 components is not equal to 3.</para>
1258 <screen>$ lfs find /mnt/testfs/testdir ! --component-count=3
1260 /mnt/testfs/testdir/4comp
1261 /mnt/testfs/testdir/dir_3comp/2comp
1262 /mnt/testfs/testdir/dir_3comp/commonfile</screen>
1263 <para><emphasis role="bold">Case2. Find the files/dirs that match the
1264 specified component start/end condition</emphasis></para>
1265 <para>Find the file(s) under directory /mnt/testfs/testdir with component
1266 start = 4M and end < 70M</para>
1267 <screen>$ lfs find /mnt/testfs/testdir --component-start=4M -E -30M
1268 /mnt/testfs/testdir/4comp</screen>
1269 <para><emphasis role="bold">Case3. Find the files/dirs that match the
1270 specified component flag condition</emphasis></para>
1271 <para>Find the file(s) under directory /mnt/testfs/testdir whose component
1272 flags contain <literal>init</literal></para>
1273 <screen>$ lfs find /mnt/testfs/testdir --component-flag=init
1274 /mnt/testfs/testdir/3comp
1275 /mnt/testfs/testdir/4comp
1276 /mnt/testfs/testdir/dir_3comp/2comp</screen>
1277 <note><para>Since <literal>lfs find</literal> uses
1278 "<literal>!</literal>" to do negative search, we don’t support
1279 flag <literal>^init</literal> here.</para></note>
1283 <section xml:id="striping.sel" condition='l2D'>
1285 <indexterm><primary>striping</primary><secondary>SEL</secondary>
1286 </indexterm>Self-Extending Layout (SEL)</title>
1287 <para>The Lustre Self-Extending Layout (SEL) feature is an extension of the
1288 <xref linkend="pfl"/> feature, which allows the MDS to change the defined
1289 PFL layout dynamically. With this feature, the MDS monitors the used space
1290 on OSTs and swaps the OSTs for the current file when they are low on space.
1291 This avoids <literal>ENOSPC</literal> problems for SEL files when
1292 applications are writing to them.</para>
1293 <para>Whereas PFL delays the instantiation of some components until an IO
1294 operation occurs on this region, SEL allows splitting such non-instantiated
1295 components in two parts: an “extendable” component and an “extension”
1296 component. The extendable component is a regular PFL component, covering
1297 just a part of the region, which is small originally. The extension (or SEL)
1298 component is a new component type which is always non-instantiated and
1299 unassigned, covering the other part of the region. When a write reaches this
1300 unassigned space, and the client calls the MDS to have it instantiated, the
1301 MDS makes a decision as to whether to grant additional space to the extendable
1302 component. The granted region moves from the head of the extension
1303 component to the tail of the extendable component, thus the extendable
1304 component grows and the SEL one is shortened. Therefore, it allows the file
1305 to continue on the same OSTs, or in the case where space is low on one of
1306 the current OSTs, to modify the layout to switch to a new component on new
1307 OSTs. In particular, it lets IO automatically spill over to a large HDD OST
1308 pool once a small SSD OST pool is getting low on space.</para>
1309 <para>The default extension policy modifies the layout in the following
1311 <orderedlist numeration="arabic">
1313 <para>Extension: continue on the same OSTs – used when not low on space
1314 on any of the OSTs of the current component; a particular extent is
1315 granted to the extendable component.</para>
1318 <para>Spill over: switch to next component OSTs – it is used only for
1319 not the last component when <emphasis>at least one</emphasis>
1320 of the current OSTs is low on space; the whole region of the SEL
1321 component moves to the next component and the SEL component is removed
1325 <para>Repeating: create a new component with the same layout but on
1326 free OSTs – it is used only for the last component when <emphasis>
1327 at least one</emphasis> of the current OSTs is low on space; a new
1328 component has the same layout but instantiated on different OSTs (from
1329 the same pool) which have enough space.</para>
1332 <para>Forced extension: continue with the current component OSTs despite
1333 the low on space condition – it is used only for the last component when
1334 a repeating attempt detected low on space condition as well - spillover
1335 is impossible and there is no sense in the repeating.</para>
1338 <note><para>The SEL feature does not require clients to understand the SEL
1339 format of already created files, only the MDS support is needed which is
1340 introduced in Lustre 2.13. However, old clients will have some limitations
1341 as the Lustre tools will not support it.</para></note>
1343 <title><literal>lfs setstripe</literal></title>
1344 <para>The <literal>lfs setstripe</literal> command is used to create files
1345 with composite layouts, as well as add or delete components to or from an
1346 existing file. It is extended to support SEL components.</para>
1348 <title>Create a SEL file</title>
1349 <para><emphasis role="bold">Command</emphasis></para>
1350 <screen>lfs setstripe
1351 [--component-end|-E end1] [STRIPE_OPTIONS] ... <replaceable>filename</replaceable>
1354 --extension-size, --ext-size, -z <ext_size></screen>
1355 <para>The <literal>-z</literal> option is added to specify the size of
1356 the region which is granted to the extendable component on each
1357 iteration. While declaring any component, this option turns the declared
1358 component to a pair of components: extendable and extension ones.</para>
1359 <para><emphasis role="bold">Example</emphasis></para>
1360 <para>The following command creates 2 pairs of extendable and
1361 extension components:
1362 <screen># lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file</screen>
1363 <figure xml:id="managinglayout.fig.sel_createfile">
1364 <title>Example: create a SEL file</title>
1367 <imagedata scalefit="1" depth="0.8in" align="center"
1368 fileref="figures/SEL_Createfile.png" />
1371 <phrase>Example: create a SEL file</phrase>
1376 <note><para>As usual, only the first PFL component is instantiated at
1377 the creation time, thus it is immediately extended to the extension
1378 size (64M for the first component), whereas the third component is left
1379 zero-length.</para></note>
1380 <screen># lfs getstripe /mnt/lustre/file
1388 lcme_extent.e_start: 0
1389 lcme_extent.e_end: 67108864
1391 lmm_stripe_size: 1048576
1394 lmm_stripe_offset: 0
1396 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
1400 lcme_flags: extension
1401 lcme_extent.e_start: 67108864
1402 lcme_extent.e_end: 1073741824
1404 lmm_extension_size: 67108864
1407 lmm_stripe_offset: -1
1412 lcme_extent.e_start: 1073741824
1413 lcme_extent.e_end: 1073741824
1415 lmm_stripe_size: 1048576
1418 lmm_stripe_offset: -1
1422 lcme_flags: extension
1423 lcme_extent.e_start: 1073741824
1424 lcme_extent.e_end: EOF
1426 lmm_extension_size: 268435456
1429 lmm_stripe_offset: -1</screen>
1432 <title>Create a SEL layout template</title>
1433 <para>Similar to PFL, it is possible to set a SEL layout template to
1434 a directory. After that, all the files created under it will inherit this
1435 layout by default.</para>
1436 <screen># lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/dir
1437 # ./lustre/utils/lfs getstripe /mnt/lustre/dir
1445 lcme_extent.e_start: 0
1446 lcme_extent.e_end: 67108864
1447 stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
1451 lcme_flags: extension
1452 lcme_extent.e_start: 67108864
1453 lcme_extent.e_end: 1073741824
1454 stripe_count: 1 extension_size: 67108864 pattern: raid0 stripe_offset: -1
1459 lcme_extent.e_start: 1073741824
1460 lcme_extent.e_end: 1073741824
1461 stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
1465 lcme_flags: extension
1466 lcme_extent.e_start: 1073741824
1467 lcme_extent.e_end: EOF
1468 stripe_count: 1 extension_size: 268435456 pattern: raid0 stripe_offset: -1
1473 <title><literal>lfs getstripe</literal></title>
1474 <para><literal>lfs getstripe</literal> commands can be used to list the
1475 striping/component information for a given SEL file. Here, only those parameters
1476 new for SEL files are shown.</para>
1477 <para><emphasis role="bold">Command</emphasis></para>
1478 <screen>lfs getstripe
1479 [--extension-size|--ext-size|-z] <replaceable>filename</replaceable></screen>
1480 <para>The <literal>-z</literal> option is added to print the extension
1481 size in bytes. For composite files this is the extension size of the
1482 first extension component. If a particular component is identified by
1483 other options (<literal>--component-id, --component-start</literal>,
1484 etc...), this component extension size is printed.</para>
1485 <para><emphasis role="bold">Example 1: List a SEL component information
1487 <para>Suppose we already have a composite file
1488 <literal>/mnt/lustre/file</literal>, created by the following command:</para>
1489 <screen># lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file</screen>
1490 <para>The 2nd component could be listed with the following command:</para>
1491 <screen># lfs getstripe -I2 /mnt/lustre/file
1498 lcme_flags: extension
1499 lcme_extent.e_start: 67108864
1500 lcme_extent.e_end: 1073741824
1502 lmm_extension_size: 67108864
1505 lmm_stripe_offset: -1
1507 <note><para>As you can see the SEL components are marked by the <literal>
1508 extension</literal> flag and <literal>lmm_extension_size</literal> field
1509 keeps the specified extension size.</para></note>
1510 <para><emphasis role="bold">Example 2: List the extension size</emphasis></para>
1511 <para>Having the same file as in the above example, the extension size of
1512 the second component could be listed with:</para>
1513 <screen># lfs getstripe -z -I2 /mnt/lustre/file
1515 <para><emphasis role="bold">Example 3: Extension</emphasis></para>
1516 <para>Having the same file as in the above example, suppose there is a
1517 write which crosses the end of the first component (64M), and then another
1518 write another write which crosses the end of the first component (128M) again,
1519 the layout changes as following:</para>
1520 <figure xml:id="managinglayout.fig.sel_extension">
1521 <title>Example: an extension of a SEL file</title>
1524 <imagedata scalefit="1" depth="3.5in" align="center"
1525 fileref="figures/SEL_extension.png" />
1528 <phrase>Example: an extension of a SEL file</phrase>
1532 <para>The layout can be printed out by the following command:</para>
1533 <screen># lfs getstripe /mnt/lustre/file
1541 lcme_extent.e_start: 0
1542 lcme_extent.e_end: 201326592
1544 lmm_stripe_size: 1048576
1547 lmm_stripe_offset: 0
1549 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
1553 lcme_flags: extension
1554 lcme_extent.e_start: 201326592
1555 lcme_extent.e_end: 1073741824
1557 lmm_extension_size: 67108864
1560 lmm_stripe_offset: -1
1565 lcme_extent.e_start: 1073741824
1566 lcme_extent.e_end: 1073741824
1568 lmm_stripe_size: 1048576
1571 lmm_stripe_offset: -1
1575 lcme_flags: extension
1576 lcme_extent.e_start: 1073741824
1577 lcme_extent.e_end: EOF
1579 lmm_extension_size: 268435456
1582 lmm_stripe_offset: -1</screen>
1583 <para><emphasis role="bold">Example 4: Spillover</emphasis></para>
1584 <para>In case where <literal>OST0</literal> is low on space and an IO
1585 happens to a SEL component, a spillover happens: the full region of the
1586 SEL component is added to the next component, e.g. in the example above
1587 the next layout modification will look like:</para>
1588 <figure xml:id="managinglayout.fig.sel_spillover">
1589 <title>Example: a spillover in a SEL file</title>
1592 <imagedata scalefit="1" depth="2.25in" align="center"
1593 fileref="figures/SEL_spillover.png" />
1596 <phrase>Example: a spillover in a SEL file</phrase>
1600 <note><para>Despite the fact the third component was [1G, 1G] originally,
1601 while it is not instantiated, instead of getting extended backward, it is
1602 moved backward to the start of the previous SEL component (192M) and
1603 extended on its extension size (256M) from that position, thus it becomes
1604 <literal>[192M, 448M]</literal>.</para></note>
1605 <screen># lfs getstripe /mnt/lustre/file
1613 lcme_extent.e_start: 0
1614 lcme_extent.e_end: 201326592
1616 lmm_stripe_size: 1048576
1619 lmm_stripe_offset: 0
1621 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
1626 lcme_extent.e_start: 201326592
1627 lcme_extent.e_end: 469762048
1629 lmm_stripe_size: 1048576
1632 lmm_stripe_offset: 1
1634 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] }
1638 lcme_flags: extension
1639 lcme_extent.e_start: 469762048
1640 lcme_extent.e_end: EOF
1642 lmm_extension_size: 268435456
1645 lmm_stripe_offset: -1</screen>
1646 <para><emphasis role="bold">Example 5: Repeating</emphasis></para>
1647 <para>Suppose in the example above, <literal>OST0</literal> got
1648 enough free space back but <literal>OST1</literal> is low on space,
1649 the following write to the last SEL component leads to a new component
1650 allocation before the SEL component, which repeats the previous
1651 component layout but instantiated on free OSTs:</para>
1652 <figure xml:id="managinglayout.fig.sel_repeat">
1653 <title>Example: repeat a SEL component</title>
1656 <imagedata scalefit="1" depth="2.25in" align="center"
1657 fileref="figures/SEL_repeating.png" />
1660 <phrase>Example: repeat a SEL component
1665 <screen># lfs getstripe /mnt/lustre/file
1673 lcme_extent.e_start: 0
1674 lcme_extent.e_end: 201326592
1676 lmm_stripe_size: 1048576
1679 lmm_stripe_offset: 0
1681 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
1686 lcme_extent.e_start: 201326592
1687 lcme_extent.e_end: 469762048
1689 lmm_stripe_size: 1048576
1692 lmm_stripe_offset: 1
1694 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] }
1699 lcme_extent.e_start: 469762048
1700 lcme_extent.e_end: 738197504
1702 lmm_stripe_size: 1048576
1704 lmm_layout_gen: 65535
1705 lmm_stripe_offset: 0
1707 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] }
1711 lcme_flags: extension
1712 lcme_extent.e_start: 738197504
1713 lcme_extent.e_end: EOF
1715 lmm_extension_size: 268435456
1718 lmm_stripe_offset: -1</screen>
1719 <para><emphasis role="bold">Example 6: Forced extension</emphasis></para>
1720 <para>Suppose in the example above, both <literal>OST0</literal> and
1721 <literal>OST1</literal> are low on space, the following write to the
1722 last SEL component will behave as an extension as there is no sense to
1724 <figure xml:id="managinglayout.fig.pfl_forced">
1725 <title>Example: forced extension in a SEL file</title>
1728 <imagedata scalefit="1" depth="2.25in" align="center"
1729 fileref="figures/SEL_forced.png" />
1732 <phrase>Example: forced extension in a SEL file.
1737 <screen># lfs getstripe /mnt/lustre/file
1745 lcme_extent.e_start: 0
1746 lcme_extent.e_end: 201326592
1748 lmm_stripe_size: 1048576
1751 lmm_stripe_offset: 0
1753 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
1758 lcme_extent.e_start: 201326592
1759 lcme_extent.e_end: 469762048
1761 lmm_stripe_size: 1048576
1764 lmm_stripe_offset: 1
1766 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] }
1771 lcme_extent.e_start: 469762048
1772 lcme_extent.e_end: 1006632960
1774 lmm_stripe_size: 1048576
1776 lmm_layout_gen: 65535
1777 lmm_stripe_offset: 0
1779 - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] }
1783 lcme_flags: extension
1784 lcme_extent.e_start: 1006632960
1785 lcme_extent.e_end: EOF
1787 lmm_extension_size: 268435456
1790 lmm_stripe_offset: -1</screen>
1793 <title><literal>lfs find</literal></title>
1794 <para><literal>lfs find</literal> commands can be used to search for
1795 the files that match the given SEL component paremeters. Here, only
1796 those parameters new for the SEL files are shown.</para>
1798 [[!] --extension-size|--ext-size|-z [+-]ext-size[KMG]
1799 [[!] --component-flags=extension]</screen>
1800 <para>The <literal>-z</literal> option is added to specify the extension
1801 size to search for. The files which have any component with the
1802 extension size matched the given criteria are printed out. As always
1803 “+” and “-“ signs are allowed to specify the least and the most size.
1805 <para>A new <literal>extension</literal> component flag is added. Only
1806 files which have at least one SEL component are printed.</para>
1807 <note><para>The negative search for flags searches the files which
1808 <emphasis role="strong">have</emphasis> a non-SEL component (not files
1809 which <emphasis role="strong">do not have</emphasis> any SEL component).
1811 <para><emphasis role="bold">Example</emphasis></para>
1812 <screen># lfs setstripe --extension-size 64M -c 1 -E -1 /mnt/lustre/file
1814 # lfs find --comp-flags extension /mnt/lustre/*
1817 # lfs find ! --comp-flags extension /mnt/lustre/*
1820 # lfs find -z 64M /mnt/lustre/*
1823 # lfs find -z +64M /mnt/lustre/*
1825 # lfs find -z -64M /mnt/lustre/*
1827 # lfs find -z +63M /mnt/lustre/*
1830 # lfs find -z -65M /mnt/lustre/*
1833 # lfs find -z 65M /mnt/lustre/*
1835 # lfs find ! -z 64M /mnt/lustre/*
1837 # lfs find ! -z +64M /mnt/lustre/*
1840 # lfs find ! -z -64M /mnt/lustre/*
1843 # lfs find ! -z +63M /mnt/lustre/*
1845 # lfs find ! -z -65M /mnt/lustre/*
1847 # lfs find ! -z 65M /mnt/lustre/*
1848 /mnt/lustre/file</screen>
1852 <section xml:id="foreign_layout" condition='l2D'>
1854 <indexterm><primary>striping</primary><secondary>Foreign</secondary>
1855 </indexterm>Foreign Layout</title>
1856 <para>The Lustre Foreign Layout feature is an extension of both the
1857 LOV and LMV formats which allows the creation of empty files and directories
1858 with the necessary specifications to point to corresponding objects outside
1859 from Lustre namespace.</para>
1860 <para>The new LOV/LMV foreign internal format can be represented as:</para>
1861 <figure xml:id="managinglayout.fig.foreign_format">
1862 <title>LOV/LMV foreign format</title>
1865 <imagedata scalefit="1" width="100%"
1866 fileref="figures/Foreign_Format.png" />
1869 <phrase>LOV/LMV foreign format</phrase>
1874 <title><literal>lfs set[dir]stripe</literal></title>
1875 <para>The <literal>lfs set[dir]stripe</literal> commands are used to
1876 create files or directories with foreign layouts, by calling the
1877 corresponding API, itself invoking the appropriate ioctl().</para>
1879 <title>Create a Foreign file/dir</title>
1880 <para><emphasis role="bold">Command</emphasis></para>
1881 <screen>lfs set[dir]stripe \
1882 --foreign[=<foreign_type>] --xattr|-x <layout_string> \
1883 [--flags <hex_bitmask>] [--mode <mode_bits>] \
1884 <replaceable>{file,dir}name</replaceable></screen>
1885 <para>Both the <literal>--foreign</literal> and
1886 <literal>--xattr|-x</literal> options are mandatory.
1887 The <literal><foreign_type></literal> (default is "none", meaning
1888 no special behavior), and both <literal>--flags</literal> and
1889 <literal>--mode</literal> (default is 0666) options are optional.</para>
1890 <para><emphasis role="bold">Example</emphasis></para>
1891 <para>The following command creates a foreign file of "none" type and
1892 with "foo@bar" LOV content and specific mode and flags:
1893 <screen># lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \
1894 --xattr=foo@bar /mnt/lustre/file</screen>
1895 <figure xml:id="managinglayout.fig.foreign_createfile">
1896 <title>Example: create a foreign file</title>
1899 <imagedata scalefit="1" width="100%" align="center"
1900 fileref="figures/Foreign_Createfile.png" />
1903 <phrase>Example: create a foreign file</phrase>
1911 <title><literal>lfs get[dir]stripe</literal></title>
1912 <para><literal>lfs get[dir]stripe</literal> commands can be used to
1913 retrieve foreign LOV/LMV informations and content.</para>
1914 <para><emphasis role="bold">Command</emphasis></para>
1915 <screen>lfs get[dir]stripe [-v] <replaceable>filename</replaceable></screen>
1916 <para><emphasis role="bold">List foreign layout information
1918 <para>Suppose we already have a foreign file
1919 <literal>/mnt/lustre/file</literal>, created by the following command:</para>
1920 <screen># lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \
1921 --xattr=foo@bar /mnt/lustre/file</screen>
1922 <para>The full foreign layout informations can be listed using the
1923 following command:</para>
1924 <screen># lfs getstripe -v /mnt/lustre/file
1926 lfm_magic: 0x0BD70BD0
1929 lfm_flags: 0x0000DA08
1932 <note><para>As you can see the <literal>lfm_length</literal> field
1933 value is the characters number in the variable length
1934 <literal>lfm_value</literal> field.</para></note>
1937 <title><literal>lfs find</literal></title>
1938 <para><literal>lfs find</literal> commands can be used to search for
1939 all the foreign files/directories or those that match the given
1940 selection paremeters.</para>
1942 [[!] --foreign[=<foreign_type>]</screen>
1943 <para>The <literal>--foreign[=<foreign_type>]</literal> option
1944 has been added to specify that all [!,but not] files and/or directories
1945 with a foreign layout [and [!,but not] of
1946 <literal><foreign_type></literal>] will be retrieved.</para>
1947 <para><emphasis role="bold">Example</emphasis></para>
1948 <screen># lfs setstripe --foreign=none --xattr=foo@bar /mnt/lustre/file
1949 # touch /mnt/lustre/file2
1951 # lfs find --foreign /mnt/lustre/*
1954 # lfs find ! --foreign /mnt/lustre/*
1957 # lfs find --foreign=none /mnt/lustre/*
1958 /mnt/lustre/file</screen>
1962 <section xml:id="dbdoclet.50438209_10424">
1964 <primary>space</primary>
1965 <secondary>free space</secondary>
1966 </indexterm><indexterm>
1967 <primary>striping</primary>
1968 <secondary>round-robin algorithm</secondary>
1969 </indexterm><indexterm>
1970 <primary>striping</primary>
1971 <secondary>weighted algorithm</secondary>
1972 </indexterm><indexterm>
1973 <primary>round-robin algorithm</primary>
1974 </indexterm><indexterm>
1975 <primary>weighted algorithm</primary>
1976 </indexterm>Managing Free Space</title>
1977 <para>To optimize file system performance, the MDT assigns file stripes to OSTs based on two
1978 allocation algorithms. The <emphasis role="italic">round-robin</emphasis> allocator gives
1979 preference to location (spreading out stripes across OSSs to increase network bandwidth
1980 utilization) and the weighted allocator gives preference to available space (balancing loads
1981 across OSTs). Threshold and weighting factors for these two algorithms can be adjusted by the
1982 user. The MDT reserves 0.1 percent of total OST space and 32 inodes for each OST. The MDT
1983 stops object allocation for the OST if available space is less than reserved or the OST has
1984 fewer than 32 free inodes. The MDT starts object allocation when available space is twice
1985 as big as the reserved space and the OST has more than 64 free inodes. Note, clients
1986 could append existing files no matter what object allocation state is.</para>
1987 <para condition="l29"> The reserved space for each OST can be adjusted by the user. Use the
1988 <literal>lctl set_param</literal> command, for example the next command reserve 1GB space
1990 <screen>lctl set_param -P osp.*.reserved_mb_low=1024</screen></para>
1991 <para>This section describes how to check available free space on disks and how free space is
1992 allocated. It then describes how to set the threshold and weighting factors for the allocation
1994 <section xml:id="dbdoclet.checking_free_space">
1995 <title>Checking File System Free Space</title>
1996 <para>Free space is an important consideration in assigning file stripes. The <literal>lfs
1997 df</literal> command can be used to show available disk space on the mounted Lustre file
1998 system and space consumption per OST. If multiple Lustre file systems are mounted, a path
1999 may be specified, but is not required. Options to the <literal>lfs df</literal> command are
2001 <informaltable frame="all">
2003 <colspec colname="c1" colwidth="50*"/>
2004 <colspec colname="c2" colwidth="50*"/>
2008 <para><emphasis role="bold">Option</emphasis></para>
2011 <para><emphasis role="bold">Description</emphasis></para>
2018 <para> <literal>-h</literal></para>
2021 <para> Displays sizes in human readable format (for example: 1K, 234M, 5G).</para>
2026 <para> <literal role="bold">-i, --inodes</literal></para>
2029 <para> Lists inodes instead of block usage.</para>
2036 <para>The <literal>df -i</literal> and <literal>lfs df -i</literal> commands show the
2037 <emphasis role="italic">minimum</emphasis> number of inodes that can be created in the
2038 file system at the current time. If the total number of objects available across all of
2039 the OSTs is smaller than those available on the MDT(s), taking into account the default
2040 file striping, then <literal>df -i</literal> will also report a smaller number of inodes
2041 than could be created. Running <literal>lfs df -i</literal> will report the actual number
2042 of inodes that are free on each target.</para>
2043 <para>For ZFS file systems, the number of inodes that can be created is dynamic and depends
2044 on the free space in the file system. The Free and Total inode counts reported for a ZFS
2045 file system are only an estimate based on the current usage for each target. The Used
2046 inode count is the actual number of inodes used by the file system.</para>
2048 <para><emphasis role="bold">Examples</emphasis></para>
2049 <screen>[client1] $ lfs df
2050 UUID 1K-blockS Used Available Use% Mounted on
2051 mds-lustre-0_UUID 9174328 1020024 8154304 11% /mnt/lustre[MDT:0]
2052 ost-lustre-0_UUID 94181368 56330708 37850660 59% /mnt/lustre[OST:0]
2053 ost-lustre-1_UUID 94181368 56385748 37795620 59% /mnt/lustre[OST:1]
2054 ost-lustre-2_UUID 94181368 54352012 39829356 57% /mnt/lustre[OST:2]
2055 filesystem summary: 282544104 167068468 39829356 57% /mnt/lustre
2057 [client1] $ lfs df -h
2058 UUID bytes Used Available Use% Mounted on
2059 mds-lustre-0_UUID 8.7G 996.1M 7.8G 11% /mnt/lustre[MDT:0]
2060 ost-lustre-0_UUID 89.8G 53.7G 36.1G 59% /mnt/lustre[OST:0]
2061 ost-lustre-1_UUID 89.8G 53.8G 36.0G 59% /mnt/lustre[OST:1]
2062 ost-lustre-2_UUID 89.8G 51.8G 38.0G 57% /mnt/lustre[OST:2]
2063 filesystem summary: 269.5G 159.3G 110.1G 59% /mnt/lustre
2065 [client1] $ lfs df -i
2066 UUID Inodes IUsed IFree IUse% Mounted on
2067 mds-lustre-0_UUID 2211572 41924 2169648 1% /mnt/lustre[MDT:0]
2068 ost-lustre-0_UUID 737280 12183 725097 1% /mnt/lustre[OST:0]
2069 ost-lustre-1_UUID 737280 12232 725048 1% /mnt/lustre[OST:1]
2070 ost-lustre-2_UUID 737280 12214 725066 1% /mnt/lustre[OST:2]
2071 filesystem summary: 2211572 41924 2169648 1% /mnt/lustre[OST:2]</screen>
2073 <section remap="h3">
2075 <primary>striping</primary>
2076 <secondary>allocations</secondary>
2077 </indexterm> Stripe Allocation Methods</title>
2078 <para>Two stripe allocation methods are provided:</para>
2081 <para><emphasis role="bold">Round-robin allocator</emphasis> - When the OSTs have
2082 approximately the same amount of free space, the round-robin allocator alternates
2083 stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is
2084 evenly distributed among OSTs, regardless of the stripe count. In a simple example with
2085 eight OSTs numbered 0-7, objects would be allocated like this:</para>
2087 <screen>File 1: OST1, OST2, OST3, OST4
2088 File 2: OST5, OST6, OST7
2089 File 3: OST0, OST1, OST2, OST3, OST4, OST5
2090 File 4: OST6, OST7, OST0</screen>
2092 <para>Here are several more sample round-robin stripe orders (each letter represents a
2093 different OST on a single OSS):</para>
2094 <informaltable frame="none">
2096 <colspec colname="c1" colwidth="50*"/>
2097 <colspec colname="c2" colwidth="50*"/>
2101 <para> 3: AAA</para>
2104 <para> One 3-OST OSS</para>
2109 <para> 3x3: ABABAB</para>
2112 <para> Two 3-OST OSSs</para>
2117 <para> 3x4: BBABABA</para>
2120 <para> One 3-OST OSS (A) and one 4-OST OSS (B)</para>
2125 <para> 3x5: BBABBABA</para>
2128 <para> One 3-OST OSS (A) and one 5-OST OSS (B)</para>
2133 <para> 3x3x3: ABCABCABC</para>
2136 <para> Three 3-OST OSSs</para>
2144 <para><emphasis role="bold">Weighted allocator</emphasis> - When the free space difference
2145 between the OSTs becomes significant, the weighting algorithm is used to influence OST
2146 ordering based on size (amount of free space available on each OST) and location
2147 (stripes evenly distributed across OSTs). The weighted allocator fills the emptier OSTs
2148 faster, but uses a weighted random algorithm, so the OST with the most free space is not
2149 necessarily chosen each time.</para>
2152 <para>The allocation method is determined by the amount of free-space
2153 imbalance on the OSTs. When free space is relatively balanced across
2154 OSTs, the faster round-robin allocator is used, which maximizes network
2155 balancing. The weighted allocator is used when any two OSTs are out of
2156 balance by more than the specified threshold (17% by default). The
2157 threshold between the two allocation methods is defined by the
2158 <literal>qos_threshold_rr</literal> parameter. </para>
2159 <para>To temporarily set the <literal>qos_threshold_rr</literal> to
2160 <literal>25</literal>, enter the folowing on each MDS:
2161 <screen>mds# lctl set_param lod.<replaceable>fsname</replaceable>*.qos_threshold_rr=25</screen></para>
2163 <section remap="h3">
2165 <primary>space</primary>
2166 <secondary>location weighting</secondary>
2167 </indexterm>Adjusting the Weighting Between Free Space and Location</title>
2168 <para>The weighting priority used by the weighted allocator is set by the
2169 the <literal>qos_prio_free</literal> parameter.
2170 Increasing the value of <literal>qos_prio_free</literal> puts more
2171 weighting on the amount of free space available on each OST and less
2172 on how stripes are distributed across OSTs. The default value is
2173 <literal>91</literal> (percent). When the free space priority is set to
2174 <literal>100</literal> (percent), weighting is based entirely on free space and location
2175 is no longer used by the striping algorithm. </para>
2176 <para>To permanently change the allocator weighting to <literal>100</literal>, enter this command on the
2178 <screen>lctl conf_param <replaceable>fsname</replaceable>-MDT0000-*.lod.qos_prio_free=100</screen>
2181 <para>When <literal>qos_prio_free</literal> is set to <literal>100</literal>, a weighted
2182 random algorithm is still used to assign stripes, so, for example, if OST2 has twice as
2183 much free space as OST1, OST2 is twice as likely to be used, but it is not guaranteed to
2188 <section xml:id="wide_striping">
2190 <primary>striping</primary>
2191 <secondary>wide striping</secondary>
2192 </indexterm><indexterm>
2193 <primary>wide striping</primary>
2194 </indexterm>Lustre Striping Internals</title>
2195 <para>Individual files can only be striped over a finite number of OSTs,
2196 based on the maximum size of the attributes that can be stored on the MDT.
2197 If the MDT is ldiskfs-based without the <literal>ea_inode</literal>
2198 feature, a file can be striped across at most 160 OSTs. With ZFS-based
2199 MDTs, or if the <literal>ea_inode</literal> feature is enabled for an
2200 ldiskfs-based MDT, a file can be striped across up to 2000 OSTs.
2202 <para>Lustre inodes use an extended attribute to record on which OST each
2203 object is located, and the identifier each object on that OST. The size of
2204 the extended attribute is a function of the number of stripes.</para>
2205 <para>If using an ldiskfs-based MDT, the maximum number of OSTs over which
2206 files can be striped can been raised to 2000 by enabling the
2207 <literal>ea_inode</literal> feature on the MDT:
2208 <screen>tune2fs -O ea_inode /dev/<replaceable>mdtdev</replaceable></screen>
2210 <note><para>The maximum stripe count for a single file does not limit the
2211 maximum number of OSTs that are in the filesystem as a whole, only the
2212 maximum possible size and maximum aggregate bandwidth for the file.