X-Git-Url: https://git.whamcloud.com/?a=blobdiff_plain;f=ManagingStripingFreeSpace.xml;h=d213d1a12b619eab4ccf5f0c2e7287d7abd1f782;hb=921417b64f0c185323c17390edfc128eba980259;hp=8114b2225799361cbd8514197308158d1667c192;hpb=5e609ea889c9626dfe558e170d78ab74dda42230;p=doc%2Fmanual.git diff --git a/ManagingStripingFreeSpace.xml b/ManagingStripingFreeSpace.xml index 8114b22..d213d1a 100644 --- a/ManagingStripingFreeSpace.xml +++ b/ManagingStripingFreeSpace.xml @@ -1,29 +1,32 @@ - + + Managing File Layout (Striping) and Free Space This chapter describes file layout (striping) and I/O options, and includes the following sections: - + - + - + - + - + -
+
<indexterm> <primary>space</primary> @@ -53,17 +56,18 @@ default), the MDS then uses weighted random allocations with a preference for allocating objects on OSTs with more free space. (This can reduce I/O performance until space usage is rebalanced again.) For a more detailed description of how striping is allocated, see <xref - linkend="dbdoclet.50438209_10424"/>.</para> + linkend="file_striping.managing_free_space"/>.</para> <para>Files can only be striped over a finite number of OSTs, based on the maximum size of the attributes that can be stored on the MDT. If the MDT is ldiskfs-based without the <literal>ea_inode</literal> feature, a file can be striped across at most 160 OSTs. With a ZFS-based MDT, or if the - <literal>ea_inode</literal> feature is enabled for an ldiskfs-based MDT, + <literal>ea_inode</literal> feature is enabled for an ldiskfs-based MDT + (the default since Lustre 2.13.0), a file can be striped across up to 2000 OSTs. For more information, see <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="wide_striping"/>. </para> </section> - <section xml:id="dbdoclet.50438209_48033"> + <section xml:id="file_striping.considerations"> <title><indexterm> <primary>file layout</primary> <secondary>See striping</secondary> @@ -89,8 +93,7 @@ <para>In cases like these, a file can be striped over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file. Striping across a larger number of OSSs should only be used when the file size is very large and/or is accessed by many nodes - at a time. Currently, Lustre files can be striped across up to 2000 OSTs, the maximum - stripe count for an <literal>ldiskfs</literal> file system.</para> + at a time. Currently, Lustre files can be striped across up to 2000 OSTs</para> </listitem> <listitem> <para><emphasis role="bold">Improving performance when OSS bandwidth is exceeded.</emphasis> @@ -100,6 +103,18 @@ the I/O rate of the clients/jobs divided by the performance per OSS.</para> </listitem> <listitem> + <para condition="l2D"><emphasis role="bold">Matching stripes to I/O + pattern.</emphasis>When writing to a single file from multiple nodes, + having more than one client writing to a stripe can lead to issues + with lock exchange, where clients contend over writing to that stripe, + even if their I/Os do not overlap. This can be avoided if I/O can be + stripe aligned so that each stripe is accessed by only one client. + Since Lustre 2.13, the 'overstriping' feature is available, allowing more + than stripe per OST. This is particularly helpful for the case where + thread count exceeds OST count, making it possible to match stripe count + to thread count even in this case.</para> + </listitem> + <listitem> <para><emphasis role="bold">Providing space for very large files.</emphasis> Striping is useful when a single OST does not have enough free space to hold the entire file.</para> </listitem> @@ -167,14 +182,14 @@ </itemizedlist> </section> </section> - <section xml:id="dbdoclet.50438209_78664"> + <section xml:id="file_striping.lfs_setstripe"> <title><indexterm> <primary>striping</primary> <secondary>configuration</secondary> </indexterm>Setting the File Layout/Striping Configuration (<literal>lfs setstripe</literal>) Use the lfs setstripe command to create new files with a specific file layout (stripe pattern) configuration. - lfs setstripe [--size|-s stripe_size] [--count|-c stripe_count] \ + lfs setstripe [--size|-s stripe_size] [--stripe-count|-c stripe_count] [--overstripe-count|-C stripe_count] \ [--index|-i start_ost] [--pool|-p pool_name] filename|dirname stripe_size @@ -185,10 +200,15 @@ stripe_size of 0 causes the default stripe size to be used. Otherwise, the stripe_size value must be a multiple of 64 KB. - stripe_count + stripe_count (--stripe-count, --overstripe-count) - - The stripe_count indicates how many OSTs to use. The default stripe_count value is 1. Setting stripe_count to 0 causes the default stripe count to be used. Setting stripe_count to -1 means stripe over all available OSTs (full OSTs are skipped). + + The stripe_count indicates how many stripes to use. + The default stripe_count value is 1. Setting + stripe_count to 0 causes the default stripe count to be + used. Setting stripe_count to -1 means stripe over all + available OSTs (full OSTs are skipped). When --overstripe-count is used, + per OST if necessary. start_ost @@ -214,14 +234,16 @@ pool_name - The pool_name specifies the OST pool to which the file will be written. - This allows limiting the OSTs used to a subset of all OSTs in the file system. For more - details about using OST pools, see Creating and Managing OST Pools. + The pool_name specifies the OST pool to which the + file will be written. This allows limiting the OSTs used to a subset of + all OSTs in the file system. For more details about using OST pools, see + + Creating and Managing OST Pools + .
Specifying a File Layout (Striping Pattern) for a Single File It is possible to specify the file layout when a new file is created using the command lfs setstripe. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an lfs setstripe command fails if the file already exists. -
+
Setting the Stripe Size The command to create a new file with a specified stripe size is similar to: [client]# lfs setstripe -s 4M /mnt/lustre/new_file @@ -244,7 +266,8 @@ obdidx objid objid group The command below creates a new file with a stripe count of -1 to specify striping over all available OSTs: [client]# lfs setstripe -c -1 /mnt/lustre/full_stripe - The example below indicates that the file full_stripe is striped + The example below indicates that the file + full_stripe is striped over all six active OSTs in the configuration: [client]# lfs getstripe /mnt/lustre/full_stripe /mnt/lustre/full_stripe @@ -255,8 +278,9 @@ obdidx objid objid group 3 5 0x5 0 4 4 0x4 0 5 2 0x2 0 - This is in contrast to the output in , which - shows only a single object for the file. + This is in contrast to the output in + , + which shows only a single object for the file.
@@ -297,7 +321,7 @@ obdidx objid objid group You can use lfs setstripe to create a file on a specific OST. In the following example, the file file1 is created on the first OST (OST index is 0). - $ lfs setstripe --count 1 --index 0 file1 + $ lfs setstripe --stripe-count 1 --index 0 file1 $ dd if=/dev/zero of=file1 count=1 bs=100M 1+0 records in 1+0 records out @@ -308,12 +332,12 @@ lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 -lmm_stripe_offset: 0 - obdidx objid objid group +lmm_stripe_offset: 0 + obdidx objid objid group 0 37364 0x91f4 0
-
+
<indexterm><primary>striping</primary><secondary>getting information</secondary></indexterm>Retrieving File Layout/Striping Information (<literal>getstripe</literal>) The lfs getstripe command is used to display information that shows over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along @@ -356,12 +380,12 @@ osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp striping remote directories Locating the MDT for a remote directory - Lustre software release 2.4 can be configured with - multiple MDTs in the same file system. Each sub-directory can have a - different MDT. To identify on which MDT a given subdirectory is - located, pass the getstripe [--mdt-index|-M] - parameters to lfs. An example of this command is - provided in the section . + Lustre can be configured with multiple MDTs in the same file + system. Each directory and file could be located on a different MDT. + To identify which MDT a given subdirectory is located, pass the + getstripe [--mdt-index|-M] parameter to + lfs. An example of this command is provided in + the section .
@@ -1263,7 +1287,687 @@ $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile flag ^init here.
-
+ +
+ + <indexterm><primary>striping</primary><secondary>SEL</secondary> + </indexterm>Self-Extending Layout (SEL) + The Lustre Self-Extending Layout (SEL) feature is an extension of the + feature, which allows the MDS to change the defined + PFL layout dynamically. With this feature, the MDS monitors the used space + on OSTs and swaps the OSTs for the current file when they are low on space. + This avoids ENOSPC problems for SEL files when + applications are writing to them. + Whereas PFL delays the instantiation of some components until an IO + operation occurs on this region, SEL allows splitting such non-instantiated + components in two parts: an “extendable” component and an “extension” + component. The extendable component is a regular PFL component, covering + just a part of the region, which is small originally. The extension (or SEL) + component is a new component type which is always non-instantiated and + unassigned, covering the other part of the region. When a write reaches this + unassigned space, and the client calls the MDS to have it instantiated, the + MDS makes a decision as to whether to grant additional space to the extendable + component. The granted region moves from the head of the extension + component to the tail of the extendable component, thus the extendable + component grows and the SEL one is shortened. Therefore, it allows the file + to continue on the same OSTs, or in the case where space is low on one of + the current OSTs, to modify the layout to switch to a new component on new + OSTs. In particular, it lets IO automatically spill over to a large HDD OST + pool once a small SSD OST pool is getting low on space. + The default extension policy modifies the layout in the following + ways: + + + Extension: continue on the same OSTs – used when not low on space + on any of the OSTs of the current component; a particular extent is + granted to the extendable component. + + + Spill over: switch to next component OSTs – it is used only for + not the last component when at least one + of the current OSTs is low on space; the whole region of the SEL + component moves to the next component and the SEL component is removed + in its turn. + + + Repeating: create a new component with the same layout but on + free OSTs – it is used only for the last component when + at least one of the current OSTs is low on space; a new + component has the same layout but instantiated on different OSTs (from + the same pool) which have enough space. + + + Forced extension: continue with the current component OSTs despite + the low on space condition – it is used only for the last component when + a repeating attempt detected low on space condition as well - spillover + is impossible and there is no sense in the repeating. + + + The SEL feature does not require clients to understand the SEL + format of already created files, only the MDS support is needed which is + introduced in Lustre 2.13. However, old clients will have some limitations + as the Lustre tools will not support it. +
+ <literal>lfs setstripe</literal> + The lfs setstripe command is used to create files + with composite layouts, as well as add or delete components to or from an + existing file. It is extended to support SEL components. +
+ Create a SEL file + Command + lfs setstripe +[--component-end|-E end1] [STRIPE_OPTIONS] ... filename + +STRIPE OPTIONS: +--extension-size, --ext-size, -z <ext_size> + The -z option is added to specify the size of + the region which is granted to the extendable component on each + iteration. While declaring any component, this option turns the declared + component to a pair of components: extendable and extension ones. + Example + The following command creates 2 pairs of extendable and + extension components: + # lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file +
+ Example: create a SEL file + + + + + + Example: create a SEL file + + +
+
+ As usual, only the first PFL component is instantiated at + the creation time, thus it is immediately extended to the extension + size (64M for the first component), whereas the third component is left + zero-length. + # lfs getstripe /mnt/lustre/file +/mnt/lustre/file + lcm_layout_gen: 4 + lcm_mirror_count: 1 + lcm_entry_count: 4 + lcme_id: 1 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 0 + lcme_extent.e_end: 67108864 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } + + lcme_id: 2 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 67108864 + lcme_extent.e_end: 1073741824 + lmm_stripe_count: 0 + lmm_extension_size: 67108864 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + + lcme_id: 3 + lcme_mirror_id: 0 + lcme_flags: 0 + lcme_extent.e_start: 1073741824 + lcme_extent.e_end: 1073741824 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + + lcme_id: 4 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 1073741824 + lcme_extent.e_end: EOF + lmm_stripe_count: 0 + lmm_extension_size: 268435456 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 +
+
+ Create a SEL layout template + Similar to PFL, it is possible to set a SEL layout template to + a directory. After that, all the files created under it will inherit this + layout by default. + # lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/dir +# ./lustre/utils/lfs getstripe /mnt/lustre/dir +/mnt/lustre/dir + lcm_layout_gen: 0 + lcm_mirror_count: 1 + lcm_entry_count: 4 + lcme_id: N/A + lcme_mirror_id: N/A + lcme_flags: 0 + lcme_extent.e_start: 0 + lcme_extent.e_end: 67108864 + stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 + + lcme_id: N/A + lcme_mirror_id: N/A + lcme_flags: extension + lcme_extent.e_start: 67108864 + lcme_extent.e_end: 1073741824 + stripe_count: 1 extension_size: 67108864 pattern: raid0 stripe_offset: -1 + + lcme_id: N/A + lcme_mirror_id: N/A + lcme_flags: 0 + lcme_extent.e_start: 1073741824 + lcme_extent.e_end: 1073741824 + stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 + + lcme_id: N/A + lcme_mirror_id: N/A + lcme_flags: extension + lcme_extent.e_start: 1073741824 + lcme_extent.e_end: EOF + stripe_count: 1 extension_size: 268435456 pattern: raid0 stripe_offset: -1 + +
+
+
+ <literal>lfs getstripe</literal> + lfs getstripe commands can be used to list the + striping/component information for a given SEL file. Here, only those parameters + new for SEL files are shown. + Command + lfs getstripe +[--extension-size|--ext-size|-z] filename + The -z option is added to print the extension + size in bytes. For composite files this is the extension size of the + first extension component. If a particular component is identified by + other options (--component-id, --component-start, + etc...), this component extension size is printed. + Example 1: List a SEL component information + + Suppose we already have a composite file + /mnt/lustre/file, created by the following command: + # lfs setstripe -E 1G -z 64M -E -1 -z 256M /mnt/lustre/file + The 2nd component could be listed with the following command: + # lfs getstripe -I2 /mnt/lustre/file +/mnt/lustre/file + lcm_layout_gen: 4 + lcm_mirror_count: 1 + lcm_entry_count: 4 + lcme_id: 2 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 67108864 + lcme_extent.e_end: 1073741824 + lmm_stripe_count: 0 + lmm_extension_size: 67108864 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + + As you can see the SEL components are marked by the + extension flag and lmm_extension_size field + keeps the specified extension size. + Example 2: List the extension size + Having the same file as in the above example, the extension size of + the second component could be listed with: + # lfs getstripe -z -I2 /mnt/lustre/file +67108864 + Example 3: Extension + Having the same file as in the above example, suppose there is a + write which crosses the end of the first component (64M), and then another + write another write which crosses the end of the first component (128M) again, + the layout changes as following: +
+ Example: an extension of a SEL file + + + + + + Example: an extension of a SEL file + + +
+ The layout can be printed out by the following command: + # lfs getstripe /mnt/lustre/file +/mnt/lustre/file + lcm_layout_gen: 6 + lcm_mirror_count: 1 + lcm_entry_count: 4 + lcme_id: 1 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 0 + lcme_extent.e_end: 201326592 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } + + lcme_id: 2 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 201326592 + lcme_extent.e_end: 1073741824 + lmm_stripe_count: 0 + lmm_extension_size: 67108864 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + + lcme_id: 3 + lcme_mirror_id: 0 + lcme_flags: 0 + lcme_extent.e_start: 1073741824 + lcme_extent.e_end: 1073741824 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + + lcme_id: 4 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 1073741824 + lcme_extent.e_end: EOF + lmm_stripe_count: 0 + lmm_extension_size: 268435456 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + Example 4: Spillover + In case where OST0 is low on space and an IO + happens to a SEL component, a spillover happens: the full region of the + SEL component is added to the next component, e.g. in the example above + the next layout modification will look like: +
+ Example: a spillover in a SEL file + + + + + + Example: a spillover in a SEL file + + +
+ Despite the fact the third component was [1G, 1G] originally, + while it is not instantiated, instead of getting extended backward, it is + moved backward to the start of the previous SEL component (192M) and + extended on its extension size (256M) from that position, thus it becomes + [192M, 448M]. + # lfs getstripe /mnt/lustre/file +/mnt/lustre/file + lcm_layout_gen: 7 + lcm_mirror_count: 1 + lcm_entry_count: 3 + lcme_id: 1 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 0 + lcme_extent.e_end: 201326592 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } + + lcme_id: 3 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 201326592 + lcme_extent.e_end: 469762048 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 1 + lmm_objects: + - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] } + + lcme_id: 4 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 469762048 + lcme_extent.e_end: EOF + lmm_stripe_count: 0 + lmm_extension_size: 268435456 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + Example 5: Repeating + Suppose in the example above, OST0 got + enough free space back but OST1 is low on space, + the following write to the last SEL component leads to a new component + allocation before the SEL component, which repeats the previous + component layout but instantiated on free OSTs: +
+ Example: repeat a SEL component + + + + + + Example: repeat a SEL component + + + +
+ # lfs getstripe /mnt/lustre/file +/mnt/lustre/file + lcm_layout_gen: 9 + lcm_mirror_count: 1 + lcm_entry_count: 4 + lcme_id: 1 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 0 + lcme_extent.e_end: 201326592 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } + + lcme_id: 3 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 201326592 + lcme_extent.e_end: 469762048 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 1 + lmm_objects: + - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] } + + lcme_id: 8 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 469762048 + lcme_extent.e_end: 738197504 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 65535 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] } + + lcme_id: 4 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 738197504 + lcme_extent.e_end: EOF + lmm_stripe_count: 0 + lmm_extension_size: 268435456 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 + Example 6: Forced extension + Suppose in the example above, both OST0 and + OST1 are low on space, the following write to the + last SEL component will behave as an extension as there is no sense to + repeat. +
+ Example: forced extension in a SEL file + + + + + + Example: forced extension in a SEL file. + + + +
+ # lfs getstripe /mnt/lustre/file +/mnt/lustre/file + lcm_layout_gen: 11 + lcm_mirror_count: 1 + lcm_entry_count: 4 + lcme_id: 1 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 0 + lcme_extent.e_end: 201326592 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] } + + lcme_id: 3 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 201326592 + lcme_extent.e_end: 469762048 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: 1 + lmm_objects: + - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x8:0x0] } + + lcme_id: 8 + lcme_mirror_id: 0 + lcme_flags: init + lcme_extent.e_start: 469762048 + lcme_extent.e_end: 1006632960 + lmm_stripe_count: 1 + lmm_stripe_size: 1048576 + lmm_pattern: raid0 + lmm_layout_gen: 65535 + lmm_stripe_offset: 0 + lmm_objects: + - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x6:0x0] } + + lcme_id: 4 + lcme_mirror_id: 0 + lcme_flags: extension + lcme_extent.e_start: 1006632960 + lcme_extent.e_end: EOF + lmm_stripe_count: 0 + lmm_extension_size: 268435456 + lmm_pattern: raid0 + lmm_layout_gen: 0 + lmm_stripe_offset: -1 +
+
+ <literal>lfs find</literal> + lfs find commands can be used to search for + the files that match the given SEL component paremeters. Here, only + those parameters new for the SEL files are shown. + lfs find +[[!] --extension-size|--ext-size|-z [+-]ext-size[KMG] +[[!] --component-flags=extension] + The -z option is added to specify the extension + size to search for. The files which have any component with the + extension size matched the given criteria are printed out. As always + “+” and “-“ signs are allowed to specify the least and the most size. + + A new extension component flag is added. Only + files which have at least one SEL component are printed. + The negative search for flags searches the files which + have a non-SEL component (not files + which do not have any SEL component). + + Example + # lfs setstripe --extension-size 64M -c 1 -E -1 /mnt/lustre/file + +# lfs find --comp-flags extension /mnt/lustre/* +/mnt/lustre/file + +# lfs find ! --comp-flags extension /mnt/lustre/* +/mnt/lustre/file + +# lfs find -z 64M /mnt/lustre/* +/mnt/lustre/file + +# lfs find -z +64M /mnt/lustre/* + +# lfs find -z -64M /mnt/lustre/* + +# lfs find -z +63M /mnt/lustre/* +/mnt/lustre/file + +# lfs find -z -65M /mnt/lustre/* +/mnt/lustre/file + +# lfs find -z 65M /mnt/lustre/* + +# lfs find ! -z 64M /mnt/lustre/* + +# lfs find ! -z +64M /mnt/lustre/* +/mnt/lustre/file + +# lfs find ! -z -64M /mnt/lustre/* +/mnt/lustre/file + +# lfs find ! -z +63M /mnt/lustre/* + +# lfs find ! -z -65M /mnt/lustre/* + +# lfs find ! -z 65M /mnt/lustre/* +/mnt/lustre/file +
+
+ +
+ + <indexterm><primary>striping</primary><secondary>Foreign</secondary> + </indexterm>Foreign Layout + The Lustre Foreign Layout feature is an extension of both the + LOV and LMV formats which allows the creation of empty files and directories + with the necessary specifications to point to corresponding objects outside + from Lustre namespace. + The new LOV/LMV foreign internal format can be represented as: +
+ LOV/LMV foreign format + + + + + + LOV/LMV foreign format + + +
+
+ <literal>lfs set[dir]stripe</literal> + The lfs set[dir]stripe commands are used to + create files or directories with foreign layouts, by calling the + corresponding API, itself invoking the appropriate ioctl(). +
+ Create a Foreign file/dir + Command + lfs set[dir]stripe \ +--foreign[=<foreign_type>] --xattr|-x <layout_string> \ +[--flags <hex_bitmask>] [--mode <mode_bits>] \ +{file,dir}name + Both the --foreign and + --xattr|-x options are mandatory. + The <foreign_type> (default is "none", meaning + no special behavior), and both --flags and + --mode (default is 0666) options are optional. + Example + The following command creates a foreign file of "none" type and + with "foo@bar" LOV content and specific mode and flags: + # lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \ +--xattr=foo@bar /mnt/lustre/file +
+ Example: create a foreign file + + + + + + Example: create a foreign file + + +
+
+
+
+
+ <literal>lfs get[dir]stripe</literal> + lfs get[dir]stripe commands can be used to + retrieve foreign LOV/LMV informations and content. + Command + lfs get[dir]stripe [-v] filename + List foreign layout information + + Suppose we already have a foreign file + /mnt/lustre/file, created by the following command: + # lfs setstripe --foreign=none --flags=0xda08 --mode=0640 \ +--xattr=foo@bar /mnt/lustre/file + The full foreign layout informations can be listed using the + following command: + # lfs getstripe -v /mnt/lustre/file +/mnt/lustre/file + lfm_magic: 0x0BD70BD0 + lfm_length: 7 + lfm_type: none + lfm_flags: 0x0000DA08 + lfm_value: foo@bar + + As you can see the lfm_length field + value is the characters number in the variable length + lfm_value field. +
+
+ <literal>lfs find</literal> + lfs find commands can be used to search for + all the foreign files/directories or those that match the given + selection paremeters. + lfs find +[[!] --foreign[=<foreign_type>] + The --foreign[=<foreign_type>] option + has been added to specify that all [!,but not] files and/or directories + with a foreign layout [and [!,but not] of + <foreign_type>] will be retrieved. + Example + # lfs setstripe --foreign=none --xattr=foo@bar /mnt/lustre/file +# touch /mnt/lustre/file2 + +# lfs find --foreign /mnt/lustre/* +/mnt/lustre/file + +# lfs find ! --foreign /mnt/lustre/* +/mnt/lustre/file2 + +# lfs find --foreign=none /mnt/lustre/* +/mnt/lustre/file +
+
+ +
<indexterm> <primary>space</primary> <secondary>free space</secondary> @@ -1292,16 +1996,17 @@ $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile</screen> <literal>lctl set_param</literal> command, for example the next command reserve 1GB space for all OSTs. <screen>lctl set_param -P osp.*.reserved_mb_low=1024</screen></para> - <para>This section describes how to check available free space on disks and how free space is - allocated. It then describes how to set the threshold and weighting factors for the allocation - algorithms.</para> - <section xml:id="dbdoclet.checking_free_space"> + <para>This section describes how to check available free space on disks + and how free space is allocated. It then describes how to set the + threshold and weighting factors for the allocation algorithms.</para> + <section xml:id="file_striping.checking_free_space"> <title>Checking File System Free Space - Free space is an important consideration in assigning file stripes. The lfs - df command can be used to show available disk space on the mounted Lustre file - system and space consumption per OST. If multiple Lustre file systems are mounted, a path - may be specified, but is not required. Options to the lfs df command are - shown below. + Free space is an important consideration in assigning file stripes. + The lfs df command can be used to show available + disk space on the mounted Lustre file system and space consumption per + OST. If multiple Lustre file systems are mounted, a path may be + specified, but is not required. Options to the lfs df + command are shown below. @@ -1319,10 +2024,25 @@ $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile - -h + + -h, --human-readable + + + + Displays sizes in human readable format (for example: 1K, + 234M, 5G) using base-2 (binary) values (i.e. 1G = 1024M). + + + + + + -H, --si + - Displays sizes in human readable format (for example: 1K, 234M, 5G). + Like -h, this displays counts in human + readable format, but using base-10 (decimal) values + (i.e. 1G = 1000M). @@ -1333,46 +2053,151 @@ $ lfs setstripe -c 1 /mnt/testfs/testdir/dir_3comp/commnfile Lists inodes instead of block usage. + + + -l, --lazy + + + Do not attempt to contact any OST or MDT not currently + connected to the client. This avoids blocking the + lfs df output if a target is offline or + unreachable, and only returns the space on OSTs that can + currently be accessed. + + + + + -p, --pool + + + Limit the usage to report only OSTs that are in the + specified pool. If multiple + Lustre filesystems are mounted, list the OSTs in + pool for each filesystem, or + limit the display to only a pool for a specific filesystem + if fsname.pool is given. + Specifying both fsname and + pool is equivalent to providing + a specific mountpoint. + + + + + + + -v, --verbose + + + + Display verbose status of MDTs and OSTs. This may + include one or more optional flags at the end of each line. + + + + + lfs df may also report additional target status + as the last column in the display, if there are issues with that target. + Target states include: + + + + D: OST/MDT is Degraded. + The target has a failed drive in the RAID device, or is + undergoing RAID reconstruction. This state is marked on + the server automatically for ZFS targets via + zed, or a (user-supplied) script that + monitors the target device and sets + "lctl set_param obdfilter.target.degraded=1" + on the OST. This target will be avoided for new + allocations, but will still be used to read existing files + located there or if there are not enough non-degraded OSTs + to make up a widely-striped file. + + + R: OST/MDT is Read-only. + The target filesystem is marked read-only due to filesystem + corruption detected by ldiskfs or ZFS. No modifications + are allowed on this OST, and it needs to be unmounted and + e2fsck or zpool scrub + run to repair the underlying filesystem. + + + N: OST/MDT is No-precreate. + The target is configured to deny object precreation set by + "lctl set_param obdfilter.target.no_precreate=1" + parameter or the "-o no_precreate" mount option. + This may be done to add an OST to the filesystem without allowing + objects to be allocated on it yet, or for other reasons. + + + S: OST/MDT is out of Space. + The target filesystem has less than the minimum required + free space and will not be used for new object allocations + until it has more free space. + + + I: OST/MDT is out of Inodes. + The target filesystem has less than the minimum required + free inodes and will not be used for new object allocations + until it has more free inodes. + + + f: OST/MDT is on flash. + The target filesystem is using a flash (non-rotational) + storage device. This is normally detected from the + underlying Linux block device, but can be set manually + with "lctl set_param osd-*.*.nonrotational=1 + on the respective OSTs. This lower-case status is only + shown in conjunction with the -v option, + since it is not an error condition. + + - The df -i and lfs df -i commands show the - minimum number of inodes that can be created in the - file system at the current time. If the total number of objects available across all of - the OSTs is smaller than those available on the MDT(s), taking into account the default - file striping, then df -i will also report a smaller number of inodes - than could be created. Running lfs df -i will report the actual number - of inodes that are free on each target. - For ZFS file systems, the number of inodes that can be created is dynamic and depends - on the free space in the file system. The Free and Total inode counts reported for a ZFS - file system are only an estimate based on the current usage for each target. The Used - inode count is the actual number of inodes used by the file system. + The df -i and lfs df -i + commands show the minimum number + of inodes that can be created in the file system at the current time. + If the total number of objects available across all of the OSTs is + smaller than those available on the MDT(s), taking into account the + default file striping, then df -i will also + report a smaller number of inodes than could be created. Running + lfs df -i will report the actual number of inodes + that are free on each target. + + For ZFS file systems, the number of inodes that can be created + is dynamic and depends on the free space in the file system. The + Free and Total inode counts reported for a ZFS file system are only + an estimate based on the current usage for each target. The Used + inode count is the actual number of inodes used by the file system. + Examples - [client1] $ lfs df -UUID 1K-blockS Used Available Use% Mounted on -mds-lustre-0_UUID 9174328 1020024 8154304 11% /mnt/lustre[MDT:0] -ost-lustre-0_UUID 94181368 56330708 37850660 59% /mnt/lustre[OST:0] -ost-lustre-1_UUID 94181368 56385748 37795620 59% /mnt/lustre[OST:1] -ost-lustre-2_UUID 94181368 54352012 39829356 57% /mnt/lustre[OST:2] -filesystem summary: 282544104 167068468 39829356 57% /mnt/lustre + client$ lfs df +UUID 1K-blocks Used Available Use% Mounted on +testfs-OST0000_UUID 9174328 1020024 8154304 11% /mnt/lustre[MDT:0] +testfs-OST0000_UUID 94181368 56330708 37850660 59% /mnt/lustre[OST:0] +testfs-OST0001_UUID 94181368 56385748 37795620 59% /mnt/lustre[OST:1] +testfs-OST0002_UUID 94181368 54352012 39829356 57% /mnt/lustre[OST:2] +filesystem summary: 282544104 167068468 39829356 57% /mnt/lustre -[client1] $ lfs df -h -UUID bytes Used Available Use% Mounted on -mds-lustre-0_UUID 8.7G 996.1M 7.8G 11% /mnt/lustre[MDT:0] -ost-lustre-0_UUID 89.8G 53.7G 36.1G 59% /mnt/lustre[OST:0] -ost-lustre-1_UUID 89.8G 53.8G 36.0G 59% /mnt/lustre[OST:1] -ost-lustre-2_UUID 89.8G 51.8G 38.0G 57% /mnt/lustre[OST:2] -filesystem summary: 269.5G 159.3G 110.1G 59% /mnt/lustre +[client1] $ lfs df -hv +UUID bytes Used Available Use% Mounted on +testfs-MDT0000_UUID 8.7G 996.1M 7.8G 11% /mnt/lustre[MDT:0] +testfs-OST0000_UUID 89.8G 53.7G 36.1G 59% /mnt/lustre[OST:0] f +testfs-OST0001_UUID 89.8G 53.8G 36.0G 59% /mnt/lustre[OST:1] f +testfs-OST0002_UUID 89.8G 51.8G 38.0G 57% /mnt/lustre[OST:2] f +filesystem summary: 269.5G 159.3G 110.1G 59% /mnt/lustre -[client1] $ lfs df -i -UUID Inodes IUsed IFree IUse% Mounted on -mds-lustre-0_UUID 2211572 41924 2169648 1% /mnt/lustre[MDT:0] -ost-lustre-0_UUID 737280 12183 725097 1% /mnt/lustre[OST:0] -ost-lustre-1_UUID 737280 12232 725048 1% /mnt/lustre[OST:1] -ost-lustre-2_UUID 737280 12214 725066 1% /mnt/lustre[OST:2] -filesystem summary: 2211572 41924 2169648 1% /mnt/lustre[OST:2] +[client1] $ lfs df -iH +UUID Inodes IUsed IFree IUse% Mounted on +testfs-MDT0000_UUID 2.21M 41.9k 2.17M 1% /mnt/lustre[MDT:0] +testfs-OST0000_UUID 737.3k 12.1k 725.1k 1% /mnt/lustre[OST:0] +testfs-OST0001_UUID 737.3k 12.2k 725.0k 1% /mnt/lustre[OST:1] +testfs-OST0002_UUID 737.3k 12.2k 725.0k 1% /mnt/lustre[OST:2] +filesystem summary: 2.21M 41.9k 2.17M 1% /mnt/lustre[OST:2] +
<indexterm> @@ -1453,31 +2278,33 @@ File 4: OST6, OST7, OST0</screen> necessarily chosen each time.</para> </listitem> </itemizedlist> - <para>The allocation method is determined by the amount of free-space imbalance on the OSTs. - When free space is relatively balanced across OSTs, the faster round-robin allocator is - used, which maximizes network balancing. The weighted allocator is used when any two OSTs - are out of balance by more than the specified threshold (17% by default). The threshold - between the two allocation methods is defined in the file - <literal>/proc/fs/<replaceable>fsname</replaceable>/lov/<replaceable>fsname</replaceable>-mdtlov/qos_threshold_rr</literal>. </para> - <para>To set the <literal>qos_threshold_r</literal> to <literal>25</literal>, enter this - command on the - MGS:<screen>lctl set_param lov.<replaceable>fsname</replaceable>-mdtlov.qos_threshold_rr=25</screen></para> + <para>The allocation method is determined by the amount of free-space + imbalance on the OSTs. When free space is relatively balanced across + OSTs, the faster round-robin allocator is used, which maximizes network + balancing. The weighted allocator is used when any two OSTs are out of + balance by more than the specified threshold (17% by default). The + threshold between the two allocation methods is defined by the + <literal>qos_threshold_rr</literal> parameter. </para> + <para>To temporarily set the <literal>qos_threshold_rr</literal> to + <literal>25</literal>, enter the folowing on each MDS: + <screen>mds# lctl set_param lod.<replaceable>fsname</replaceable>*.qos_threshold_rr=25</screen></para> </section> <section remap="h3"> <title><indexterm> <primary>space</primary> <secondary>location weighting</secondary> </indexterm>Adjusting the Weighting Between Free Space and Location - The weighting priority used by the weighted allocator is set in the file - /proc/fs/fsname/lov/fsname-mdtlov/qos_prio_free. - Increasing the value of qos_prio_free puts more weighting on the amount - of free space available on each OST and less on how stripes are distributed across OSTs. The - default value is 91 (percent). When the free space priority is set to + The weighting priority used by the weighted allocator is set by the + the qos_prio_free parameter. + Increasing the value of qos_prio_free puts more + weighting on the amount of free space available on each OST and less + on how stripes are distributed across OSTs. The default value is + 91 (percent). When the free space priority is set to 100 (percent), weighting is based entirely on free space and location is no longer used by the striping algorithm. - To change the allocator weighting to 100, enter this command on the + To permanently change the allocator weighting to 100, enter this command on the MGS: - lctl conf_param fsname-MDT0000.lov.qos_prio_free=100 + lctl conf_param fsname-MDT0000-*.lod.qos_prio_free=100 . When qos_prio_free is set to 100, a weighted @@ -1509,9 +2336,15 @@ File 4: OST6, OST7, OST0 ea_inode feature on the MDT: tune2fs -O ea_inode /dev/mdtdev + Since Lustre 2.13 the + ea_inode feature is enabled by default on all newly + formatted ldiskfs MDT filesystems. The maximum stripe count for a single file does not limit the maximum number of OSTs that are in the filesystem as a whole, only the maximum possible size and maximum aggregate bandwidth for the file.
+