1 <?xml version='1.0' encoding='UTF-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="settinguplustresystem">
3 <title xml:id="settinguplustresystem.title">Determining Hardware Configuration Requirements and
4 Formatting Options</title>
5 <para>This chapter describes hardware configuration requirements for a Lustre file system
10 <xref linkend="dbdoclet.50438256_49017"/>
15 <xref linkend="dbdoclet.space_requirements"/>
20 <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>
25 <xref linkend="dbdoclet.50438256_26456"/>
30 <xref linkend="dbdoclet.50438256_78272"/>
34 <section xml:id="dbdoclet.50438256_49017">
35 <title><indexterm><primary>setup</primary></indexterm>
36 <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm>
37 <indexterm><primary>design</primary><see>setup</see></indexterm>
38 Hardware Considerations</title>
39 <para>A Lustre file system can utilize any kind of block storage device such as single disks,
40 software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file
41 systems, the block devices are only attached to the MDS and OSS nodes in a Lustre file system
42 and are not accessed by the clients directly.</para>
43 <para>Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)</para>
44 <para>For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.</para>
45 <para>For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.</para>
46 <warning><para>Performance and recovery issues can occur if you put a client on an MDS or OSS:</para>
49 <para>Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.</para>
52 <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
56 <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
57 typically used for testing to match expected customer usage and avoid limitations due to the 4
58 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs.
59 Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file
60 systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit
62 <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
63 optionally be organized with logical volume management (LVM), which is then formatted as a
64 Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
65 imposed by the file system.</para>
66 <para>The Lustre file system uses journaling file system technology on both the MDTs and OSTs.
67 For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on
68 a separate device.</para>
69 <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para>
71 <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). </para>
75 <primary>setup</primary>
76 <secondary>MDT</secondary>
77 </indexterm> MGT and MDT Storage Hardware Considerations</title>
78 <para>MGT storage requirements are small (less than 100 MB even in the
79 largest Lustre file systems), and the data on an MGT is only accessed
80 on a server/client mount, so disk performance is not a consideration.
81 However, this data is vital for file system access, so
82 the MGT should be reliable storage, preferably mirrored RAID1.</para>
83 <para>MDS storage is accessed in a database-like access pattern with
84 many seeks and read-and-writes of small amounts of data.
85 Storage types that provide much lower seek times, such as SSD or NVMe
86 is strongly preferred for the MDT, and high-RPM SAS is acceptable.</para>
87 <para>For maximum performance, the MDT should be configured as RAID1 with
88 an internal journal and two disks from different controllers.</para>
89 <para>If you need a larger MDT, create multiple RAID1 devices from pairs
90 of disks, and then make a RAID0 array of the RAID1 devices. For ZFS,
91 use <literal>mirror</literal> VDEVs for the MDT. This ensures
92 maximum reliability because multiple disk failures only have a small
93 chance of hitting both disks in the same RAID1 device.</para>
94 <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50%
95 chance that even two disk failures can cause the loss of the whole MDT
96 device. The first failure disables an entire half of the mirror and the
97 second failure has a 50% chance of disabling the remaining mirror.</para>
98 <para>If multiple MDTs are going to be present in the
99 system, each MDT should be specified for the anticipated usage and load.
100 For details on how to add additional MDTs to the filesystem, see
101 <xref linkend="lustremaint.adding_new_mdt"/>.</para>
102 <warning><para>MDT0000 contains the root of the Lustre file system. If
103 MDT0000 is unavailable for any reason, the file system cannot be used.
105 <note><para>Using the DNE feature it is possible to dedicate additional
106 MDTs to sub-directories off the file system root directory stored on
107 MDT0000, or arbitrarily for lower-level subdirectories, using the
108 <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal>
109 command. If an MDT serving a subdirectory becomes unavailable, any
110 subdirectories on that MDT and all directories beneath it will also
111 become inaccessible. This is typically useful for top-level directories
112 to assign different users or projects to separate MDTs, or to distribute
113 other large working sets of files to multiple MDTs.</para></note>
114 <note condition='l28'><para>Starting in the 2.8 release it is possible
115 to spread a single large directory across multiple MDTs using the DNE
116 striped directory feature by specifying multiple stripes (or shards)
117 at creation time using the
118 <literal>lfs mkdir -c <replaceable>stripe_count</replaceable></literal>
119 command, where <replaceable>stripe_count</replaceable> is often the
120 number of MDTs in the filesystem. Striped directories should typically
121 not be used for all directories in the filesystem, since this incurs
122 extra overhead compared to non-striped directories, but is useful for
123 larger directories (over 50k entries) where many output files are being
128 <title><indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations</title>
129 <para>The data access pattern for the OSS storage is a streaming I/O
130 pattern that is dependent on the access patterns of applications being
131 used. Each OSS can manage multiple object storage targets (OSTs), one
132 for each volume with I/O traffic load-balanced between servers and
133 targets. An OSS should be configured to have a balance between the
134 network bandwidth and the attached storage bandwidth to prevent
135 bottlenecks in the I/O path. Depending on the server hardware, an OSS
136 typically serves between 2 and 8 targets, with each target between
137 24-48TB, but may be up to 256 terabytes (TBs) in size.</para>
138 <para>Lustre file system capacity is the sum of the capacities provided
139 by the targets. For example, 64 OSSs, each with two 8 TB OSTs,
140 provide a file system with a capacity of nearly 1 PB. If each OST uses
141 ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6
142 configuration), it may be possible to get 50 MB/sec from each drive,
143 providing up to 400 MB/sec of disk bandwidth per OST. If this system
144 is used as storage backend with a system network, such as the InfiniBand
145 network, that provides a similar bandwidth, then each OSS could provide
146 800 MB/sec of end-to-end I/O throughput. (Although the architectural
147 constraints described here are simple, in practice it takes careful
148 hardware selection, benchmarking and integration to obtain such
152 <section xml:id="dbdoclet.space_requirements">
153 <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
154 <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
155 Determining Space Requirements</title>
156 <para>The desired performance characteristics of the backing file systems
157 on the MDT and OSTs are independent of one another. The size of the MDT
158 backing file system depends on the number of inodes needed in the total
159 Lustre file system, while the aggregate OST space depends on the total
160 amount of data stored on the file system. If MGS data is to be stored
161 on the MDT device (co-located MGT and MDT), add 100 MB to the required
162 size estimate for the MDT.</para>
163 <para>Each time a file is created on a Lustre file system, it consumes
164 one inode on the MDT and one OST object over which the file is striped.
165 Normally, each file's stripe count is based on the system-wide
166 default stripe count. However, this can be changed for individual files
167 using the <literal>lfs setstripe</literal> option. For more details,
168 see <xref linkend="managingstripingfreespace"/>.</para>
169 <para>In a Lustre ldiskfs file system, all the MDT inodes and OST
170 objects are allocated when the file system is first formatted. When
171 the file system is in use and a file is created, metadata associated
172 with that file is stored in one of the pre-allocated inodes and does
173 not consume any of the free space used to store file data. The total
174 number of inodes on a formatted ldiskfs MDT or OST cannot be easily
175 changed. Thus, the number of inodes created at format time should be
176 generous enough to anticipate near term expected usage, with some room
177 for growth without the effort of additional storage.</para>
178 <para>By default, the ldiskfs file system used by Lustre servers to store
179 user-data objects and system data reserves 5% of space that cannot be used
180 by the Lustre file system. Additionally, an ldiskfs Lustre file system
181 reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal
182 use and a small amount of space outside the journal to store accounting
183 data. This reserved space is unusable for general storage. Thus, at least
184 this much space will be used per OST before any file object data is saved.
186 <para>With a ZFS backing filesystem for the MDT or OST,
187 the space allocation for inodes and file data is dynamic, and inodes are
188 allocated as needed. A minimum of 4kB of usable space (before mirroring)
189 is needed for each inode, exclusive of other overhead such as directories,
190 internal log files, extended attributes, ACLs, etc. ZFS also reserves
191 approximately 3% of the total storage space for internal and redundant
192 metadata, which is not usable by Lustre.
193 Since the size of extended attributes and ACLs is highly dependent on
194 kernel versions and site-specific policies, it is best to over-estimate
195 the amount of space needed for the desired number of inodes, and any
196 excess space will be utilized to store more inodes.
200 <primary>setup</primary>
201 <secondary>MGT</secondary>
204 <primary>space</primary>
205 <secondary>determining MGT requirements</secondary>
206 </indexterm> Determining MGT Space Requirements</title>
207 <para>Less than 100 MB of space is typically required for the MGT.
208 The size is determined by the total number of servers in the Lustre
209 file system cluster(s) that are managed by the MGS.</para>
211 <section xml:id="dbdoclet.mdt_space_requirements">
213 <primary>setup</primary>
214 <secondary>MDT</secondary>
217 <primary>space</primary>
218 <secondary>determining MDT requirements</secondary>
219 </indexterm> Determining MDT Space Requirements</title>
220 <para>When calculating the MDT size, the important factor to consider
221 is the number of files to be stored in the file system, which depends on
222 at least 2 KiB per inode of usable space on the MDT. Since MDTs typically
223 use RAID-1+0 mirroring, the total storage needed will be double this.
225 <para>Please note that the actual used space per MDT depends on the number
226 of files per directory, the number of stripes per file, whether files
227 have ACLs or user xattrs, and the number of hard links per file. The
228 storage required for Lustre file system metadata is typically 1-2
229 percent of the total file system capacity depending upon file size.
230 If the <xref linkend="dataonmdt"/> feature is in use for Lustre
231 2.11 or later, MDT space should typically be 5 percent or more of the
232 total space, depending on the distribution of small files within the
233 filesystem and the <literal>lod.*.dom_stripesize</literal> limit on
234 the MDT and file layout used.</para>
235 <para>For ZFS-based MDT filesystems, the number of inodes created on
236 the MDT and OST is dynamic, so there is less need to determine the
237 number of inodes in advance, though there still needs to be some thought
238 given to the total MDT space compared to the total filesystem size.</para>
239 <para>For example, if the average file size is 5 MiB and you have
240 100 TiB of usable OST space, then you can calculate the
241 <emphasis>minimum</emphasis> total number of inodes for MDTs and OSTs
244 <para>(500 TB * 1000000 MB/TB) / 5 MB/inode = 100M inodes</para>
246 <para>It is recommended that the MDT(s) have at least twice the minimum
247 number of inodes to allow for future expansion and allow for an average
248 file size smaller than expected. Thus, the minimum space for ldiskfs
249 MDT(s) should be approximately:
252 <para>2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT</para>
254 <para>For details about formatting options for ldiskfs MDT and OST file
255 systems, see <xref linkend="dbdoclet.ldiskfs_mdt_mkfs"/>.</para>
257 <para>If the median file size is very small, 4 KB for example, the
258 MDT would use as much space for each file as the space used on the OST,
259 so the use of Data-on-MDT is strongly recommended in that case.
260 The MDT space per inode should be increased correspondingly to
261 account for the extra data space usage for each inode:
263 <para>6 KiB/inode x 100 million inodes x 2 = 1200 GiB ldiskfs MDT</para>
268 <para>If the MDT has too few inodes, this can cause the space on the
269 OSTs to be inaccessible since no new files can be created. In this
270 case, the <literal>lfs df -i</literal> and <literal>df -i</literal>
271 commands will limit the number of available inodes reported for the
272 filesystem to match the total number of available objects on the OSTs.
273 Be sure to determine the appropriate MDT size needed to support the
274 filesystem before formatting. It is possible to increase the
275 number of inodes after the file system is formatted, depending on the
276 storage. For ldiskfs MDT filesystems the <literal>resize2fs</literal>
277 tool can be used if the underlying block device is on a LVM logical
278 volume and the underlying logical volume size can be increased.
279 For ZFS new (mirrored) VDEVs can be added to the MDT pool to increase
280 the total space available for inode storage.
281 Inodes will be added approximately in proportion to space added.
285 <para>Note that the number of total and free inodes reported by
286 <literal>lfs df -i</literal> for ZFS MDTs and OSTs is estimated based
287 on the current average space used per inode. When a ZFS filesystem is
288 first formatted, this free inode estimate will be very conservative
289 (low) due to the high ratio of directories to regular files created for
290 internal Lustre metadata storage, but this estimate will improve as
291 more files are created by regular users and the average file size will
292 better reflect actual site usage.
296 <para>Using the DNE remote directory feature
297 it is possible to increase the total number of inodes of a Lustre
298 filesystem, as well as increasing the aggregate metadata performance,
299 by configuring additional MDTs into the filesystem, see
300 <xref linkend="lustremaint.adding_new_mdt"/> for details.
306 <primary>setup</primary>
307 <secondary>OST</secondary>
310 <primary>space</primary>
311 <secondary>determining OST requirements</secondary>
312 </indexterm> Determining OST Space Requirements</title>
313 <para>For the OST, the amount of space taken by each object depends on
314 the usage pattern of the users/applications running on the system. The
315 Lustre software defaults to a conservative estimate for the average
316 object size (between 64 KiB per object for 10 GiB OSTs, and 1 MiB per
317 object for 16 TiB and larger OSTs). If you are confident that the average
318 file size for your applications will be different than this, you can
319 specify a different average file size (number of total inodes for a given
320 OST size) to reduce file system overhead and minimize file system check
322 See <xref linkend="dbdoclet.ldiskfs_ost_mkfs"/> for more details.</para>
325 <section xml:id="dbdoclet.ldiskfs_mkfs_opts">
328 <primary>ldiskfs</primary>
329 <secondary>formatting options</secondary>
332 <primary>setup</primary>
333 <secondary>ldiskfs</secondary>
335 Setting ldiskfs File System Formatting Options
337 <para>By default, the <literal>mkfs.lustre</literal> utility applies these
338 options to the Lustre backing file system used to store data and metadata
339 in order to enhance Lustre file system performance and scalability. These
340 options include:</para>
343 <para><literal>flex_bg</literal> - When the flag is set to enable
344 this flexible-block-groups feature, block and inode bitmaps for
345 multiple groups are aggregated to minimize seeking when bitmaps
346 are read or written and to reduce read/modify/write operations
347 on typical RAID storage (with 1 MiB RAID stripe widths). This flag
348 is enabled on both OST and MDT file systems. On MDT file systems
349 the <literal>flex_bg</literal> factor is left at the default value
350 of 16. On OSTs, the <literal>flex_bg</literal> factor is set
351 to 256 to allow all of the block or inode bitmaps in a single
352 <literal>flex_bg</literal> to be read or written in a single
353 1MiB I/O typical for RAID storage.</para>
356 <para><literal>huge_file</literal> - Setting this flag allows
357 files on OSTs to be larger than 2 TiB in size.</para>
360 <para><literal>lazy_journal_init</literal> - This extended option
361 is enabled to prevent a full overwrite to zero out the large
362 journal that is allocated by default in a Lustre file system
363 (up to 400 MiB for OSTs, up to 4GiB for MDTs), to reduce the
364 formatting time.</para>
367 <para>To override the default formatting options, use arguments to
368 <literal>mkfs.lustre</literal> to pass formatting options to the backing file system:</para>
369 <screen>--mkfsoptions='backing fs options'</screen>
370 <para>For other <literal>mkfs.lustre</literal> options, see the Linux man page for
371 <literal>mke2fs(8)</literal>.</para>
372 <section xml:id="dbdoclet.ldiskfs_mdt_mkfs">
374 <primary>inodes</primary>
375 <secondary>MDS</secondary>
376 </indexterm><indexterm>
377 <primary>setup</primary>
378 <secondary>inodes</secondary>
379 </indexterm>Setting Formatting Options for an ldiskfs MDT</title>
380 <para>The number of inodes on the MDT is determined at format time
381 based on the total size of the file system to be created. The default
382 <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
383 for an ldiskfs MDT is optimized at one inode for every 2048 bytes of file
385 <para>This setting takes into account the space needed for additional
386 ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB),
387 bitmaps, and directories, as well as files that Lustre uses internally
388 to maintain cluster consistency. There is additional per-file metadata
389 such as file layout for files with a large number of stripes, Access
390 Control Lists (ACLs), and user extended attributes.</para>
391 <para condition="l2B"> Starting in Lustre 2.11, the <xref linkend=
392 "dataonmdt.title"/> (DoM) feature allows storing small files on the MDT
393 to take advantage of high-performance flash storage, as well as reduce
394 space and network overhead. If you are planning to use the DoM feature
395 with an ldiskfs MDT, it is recommended to <emphasis>increase</emphasis>
396 the bytes-per-inode ratio to have enough space on the MDT for small files,
399 <para>It is possible to change the recommended 2048 bytes
400 per inode for an ldiskfs MDT when it is first formatted by adding the
401 <literal>--mkfsoptions="-i bytes-per-inode"</literal> option to
402 <literal>mkfs.lustre</literal>. Decreasing the inode ratio tunable
403 <literal>bytes-per-inode</literal> will create more inodes for a given
404 MDT size, but will leave less space for extra per-file metadata and is
405 not recommended. The inode ratio must always be strictly larger than
406 the MDT inode size, which is 1024 bytes by default. It is recommended
407 to use an inode ratio at least 1024 bytes larger than the inode size to
408 ensure the MDT does not run out of space. Increasing the inode ratio
409 to include enough space for the most common file data (e.g. 5120 or 65560
410 bytes if 4KB or 64KB files are widely used) is recommended for DoM.</para>
411 <para>The size of the inode may be changed at format time by adding the
412 <literal>--stripe-count-hint=N</literal> to have
413 <literal>mkfs.lustre</literal> automatically calculate a reasonable
414 inode size based on the default stripe count that will be used by the
415 filesystem, or directly by specifying the
416 <literal>--mkfsoptions="-I inode-size"</literal> option. Increasing
417 the inode size will provide more space in the inode for a larger Lustre
418 file layout, ACLs, user and system extended attributes, SELinux and
419 other security labels, and other internal metadata and DoM data. However,
420 if these features or other in-inode xattrs are not needed, a larger inode
421 size may hurt metadata performance as 2x, 4x, or 8x as much data would be
422 read or written for each MDT inode access.
425 <section xml:id="dbdoclet.ldiskfs_ost_mkfs">
427 <primary>inodes</primary>
428 <secondary>OST</secondary>
429 </indexterm>Setting Formatting Options for an ldiskfs OST</title>
430 <para>When formatting an OST file system, it can be beneficial
431 to take local file system usage into account, for example by running
432 <literal>df</literal> and <literal>df -i</literal> on a current filesystem
433 to get the used bytes and used inodes respectively, then computing the
434 average bytes-per-inode value. When deciding on the ratio for a new
435 filesystem, try to avoid having too many inodes on each OST, while keeping
436 enough margin to allow for future usage of smaller files. This helps
437 reduce the format and e2fsck time and makes more space available for data.
439 <para>The table below shows the default
440 <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
441 used for OSTs of various sizes when they are formatted.</para>
443 <table frame="all" xml:id="settinguplustresystem.tab1">
444 <title>Default Inode Ratios Used for Newly Formatted OSTs</title>
446 <colspec colname="c1" colwidth="3*"/>
447 <colspec colname="c2" colwidth="2*"/>
448 <colspec colname="c3" colwidth="4*"/>
452 <para><emphasis role="bold">LUN/OST size</emphasis></para>
455 <para><emphasis role="bold">Default Inode ratio</emphasis></para>
458 <para><emphasis role="bold">Total inodes</emphasis></para>
465 <para>under 10GiB </para>
468 <para>1 inode/16KiB </para>
471 <para>640 - 655k </para>
476 <para>10GiB - 1TiB </para>
479 <para>1 inode/68KiB </para>
482 <para>153k - 15.7M </para>
487 <para>1TiB - 8TiB </para>
490 <para>1 inode/256KiB </para>
493 <para>4.2M - 33.6M </para>
498 <para>over 8TiB </para>
501 <para>1 inode/1MiB </para>
504 <para>8.4M - 268M </para>
511 <para>In environments with few small files, the default inode ratio
512 may result in far too many inodes for the average file size. In this
513 case, performance can be improved by increasing the number of
514 <emphasis role="italic">bytes-per-inode</emphasis>. To set the inode
515 ratio, use the <literal>--mkfsoptions="-i <replaceable>bytes-per-inode</replaceable>"</literal>
516 argument to <literal>mkfs.lustre</literal> to specify the expected
517 average (mean) size of OST objects. For example, to create an OST
518 with an expected average object size of 8 MiB run:
519 <screen>[oss#] mkfs.lustre --ost --mkfsoptions="-i $((8192 * 1024))" ...</screen>
522 <para>OSTs formatted with ldiskfs can use a maximum of approximately
523 320 million objects per MDT, up to a maximum of 4 billion inodes.
524 Specifying a very small bytes-per-inode ratio for a large OST that
525 exceeds this limit can cause either premature out-of-space errors and prevent
526 the full OST space from being used, or will waste space and slow down
527 e2fsck more than necessary. The default inode ratios are chosen to
528 ensure that the total number of inodes remain below this limit.
532 <para>File system check time on OSTs is affected by a number of
533 variables in addition to the number of inodes, including the size of
534 the file system, the number of allocated blocks, the distribution of
535 allocated blocks on the disk, disk speed, CPU speed, and the amount
536 of RAM on the server. Reasonable file system check times for valid
537 filesystems are 5-30 minutes per TiB, but may increase significantly
538 if substantial errors are detected and need to be repaired.</para>
540 <para>For further details about optimizing MDT and OST file systems,
541 see <xref linkend="dbdoclet.ldiskfs_raid_opts"/>.</para>
546 <primary>setup</primary>
547 <secondary>limits</secondary>
548 </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
549 <primary>wide striping</primary>
550 </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
551 <primary>xattr</primary>
552 <secondary><emphasis role="italic">See</emphasis> wide striping</secondary>
553 </indexterm><indexterm>
554 <primary>large_xattr</primary>
555 <secondary>ea_inode</secondary>
556 </indexterm><indexterm>
557 <primary>wide striping</primary>
558 <secondary>large_xattr</secondary>
559 <tertiary>ea_inode</tertiary>
560 </indexterm>File and File System Limits</title>
562 <para><xref linkend="settinguplustresystem.tab2"/> describes
563 current known limits of Lustre. These limits are imposed by either
564 the Lustre architecture or the Linux virtual file system (VFS) and
565 virtual memory subsystems. In a few cases, a limit is defined within
566 the code and can be changed by re-compiling the Lustre software.
567 Instructions to install from source code are beyond the scope of this
568 document, and can be found elsewhere online. In these cases, the
569 indicated limit was used for testing of the Lustre software. </para>
571 <table frame="all" xml:id="settinguplustresystem.tab2">
572 <title>File and file system limits</title>
574 <colspec colname="c1" colwidth="3*"/>
575 <colspec colname="c2" colwidth="2*"/>
576 <colspec colname="c3" colwidth="4*"/>
580 <para><emphasis role="bold">Limit</emphasis></para>
583 <para><emphasis role="bold">Value</emphasis></para>
586 <para><emphasis role="bold">Description</emphasis></para>
593 <para>Maximum number of MDTs</para>
599 <para>A single MDS can host
600 multiple MDTs, either for separate file systems, or up to 255
601 additional MDTs can be added to the filesystem and attached into
602 the namespace with DNE remote or striped directories.</para>
607 <para>Maximum number of OSTs</para>
613 <para>The maximum number of OSTs is a constant that can be
614 changed at compile time. Lustre file systems with up to
615 4000 OSTs have been tested. Multiple OST file systems can
616 be configured on a single OSS node.</para>
621 <para>Maximum OST size</para>
624 <para>512TiB (ldiskfs), 512TiB (ZFS)</para>
627 <para>This is not a <emphasis>hard</emphasis> limit. Larger
628 OSTs are possible but most production systems do not
629 typically go beyond the stated limit per OST because Lustre
630 can add capacity and performance with additional OSTs, and
631 having more OSTs improves aggregate I/O performance,
632 minimizes contention, and allows parallel recovery (e2fsck
633 for ldiskfs OSTs, scrub for ZFS OSTs).
636 With 32-bit kernels, due to page cache limits, 16TB is the
637 maximum block device size, which in turn applies to the
638 size of OST. It is strongly recommended to run Lustre
639 clients and servers with 64-bit kernels.</para>
644 <para>Maximum number of clients</para>
650 <para>The maximum number of clients is a constant that can
651 be changed at compile time. Up to 30000 clients have been
652 used in production accessing a single filesystem.</para>
657 <para>Maximum size of a single file system</para>
660 <para>at least 1EiB</para>
663 <para>Each OST can have a file system up to the
664 Maximum OST size limit, and the Maximum number of OSTs
665 can be combined into a single filesystem.
671 <para>Maximum stripe count</para>
677 <para>This limit is imposed by the size of the layout that
678 needs to be stored on disk and sent in RPC requests, but is
679 not a hard limit of the protocol. The number of OSTs in the
680 filesystem can exceed the stripe count, but this limits the
681 number of OSTs across which a single file can be striped.</para>
686 <para>Maximum stripe size</para>
689 <para>< 4 GiB</para>
692 <para>The amount of data written to each object before moving
693 on to next object.</para>
698 <para>Minimum stripe size</para>
704 <para>Due to the use of 64 KiB PAGE_SIZE on some CPU
705 architectures such as ARM and POWER, the minimum stripe
706 size is 64 KiB so that a single page is not split over
707 multiple servers.</para>
712 <para>Maximum single object size</para>
715 <para>16TiB (ldiskfs), 256TiB (ZFS)</para>
718 <para>The amount of data that can be stored in a single object.
719 An object corresponds to a stripe. The ldiskfs limit of 16 TB
720 for a single object applies. For ZFS the limit is the size of
721 the underlying OST. Files can consist of up to 2000 stripes,
722 each stripe can be up to the maximum object size. </para>
727 <para>Maximum <anchor xml:id="dbdoclet.50438256_marker-1290761" xreflabel=""/>file size</para>
730 <para>16 TiB on 32-bit systems</para>
732 <para>31.25 PiB on 64-bit ldiskfs systems,
733 8EiB on 64-bit ZFS systems</para>
736 <para>Individual files have a hard limit of nearly 16 TiB on
737 32-bit systems imposed by the kernel memory subsystem. On
738 64-bit systems this limit does not exist. Hence, files can
739 be 2^63 bits (8EiB) in size if the backing filesystem can
740 support large enough objects and/or the files are sparse.</para>
741 <para>A single file can have a maximum of 2000 stripes, which
742 gives an upper single file data capacity of 31.25 PiB for 64-bit
743 ldiskfs systems. The actual amount of data that can be stored
744 in a file depends upon the amount of free space in each OST
745 on which the file is striped.</para>
750 <para>Maximum number of files or subdirectories in a single directory</para>
753 <para>10 million files (ldiskfs), 2^48 (ZFS)</para>
756 <para>The Lustre software uses the ldiskfs hashed directory
757 code, which has a limit of about 10 million files, depending
758 on the length of the file name. The limit on subdirectories
759 is the same as the limit on regular files.</para>
760 <note condition='l28'><para>Starting in the 2.8 release it is
761 possible to exceed this limit by striping a single directory
762 over multiple MDTs with the <literal>lfs mkdir -c</literal>
763 command, which increases the single directory limit by a
764 factor of the number of directory stripes used.</para></note>
765 <para>Lustre file systems are tested with ten million files
766 in a single directory.</para>
771 <para>Maximum number of files in the file system</para>
774 <para>4 billion (ldiskfs), 256 trillion (ZFS) per MDT</para>
777 <para>The ldiskfs filesystem imposes an upper limit of
778 4 billion inodes per filesystem. By default, the MDT
779 filesystem is formatted with one inode per 2KB of space,
780 meaning 512 million inodes per TiB of MDT space. This can be
781 increased initially at the time of MDT filesystem creation.
782 For more information, see
783 <xref linkend="settinguplustresystem"/>.</para>
784 <para>The ZFS filesystem dynamically allocates
785 inodes and does not have a fixed ratio of inodes per unit of MDT
786 space, but consumes approximately 4KiB of mirrored space per
787 inode, depending on the configuration.</para>
788 <para>Each additional MDT can hold up to the
789 above maximum number of additional files, depending on
790 available space and the distribution directories and files
791 in the filesystem.</para>
796 <para>Maximum length of a filename</para>
799 <para>255 bytes (filename)</para>
802 <para>This limit is 255 bytes for a single filename, the
803 same as the limit in the underlying filesystems.</para>
808 <para>Maximum length of a pathname</para>
811 <para>4096 bytes (pathname)</para>
814 <para>The Linux VFS imposes a full pathname length of 4096 bytes.</para>
819 <para>Maximum number of open files for a Lustre file system</para>
822 <para>No limit</para>
825 <para>The Lustre software does not impose a maximum for the number
826 of open files, but the practical limit depends on the amount of
827 RAM on the MDS. No "tables" for open files exist on the
828 MDS, as they are only linked in a list to a given client's
829 export. Each client process has a limit of several
830 thousands of open files which depends on its ulimit.</para>
837 <note><para>By default for ldiskfs MDTs the maximum stripe count for a
838 <emphasis>single file</emphasis> is limited to 160 OSTs. In order to
839 increase the maximum file stripe count, use
840 <literal>--mkfsoptions="-O ea_inode"</literal> when formatting the MDT,
841 or use <literal>tune2fs -O ea_inode</literal> to enable it after the
842 MDT has been formatted.</para>
845 <section xml:id="dbdoclet.50438256_26456">
846 <title><indexterm><primary>setup</primary><secondary>memory</secondary></indexterm>Determining Memory Requirements</title>
847 <para>This section describes the memory requirements for each Lustre file system component.</para>
850 <indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>client</tertiary></indexterm>
851 Client Memory Requirements</title>
852 <para>A minimum of 2 GB RAM is recommended for clients.</para>
855 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>MDS Memory Requirements</title>
856 <para>MDS memory requirements are determined by the following factors:</para>
859 <para>Number of clients</para>
862 <para>Size of the directories</para>
865 <para>Load placed on server</para>
868 <para>The amount of memory used by the MDS is a function of how many clients are on
869 the system, and how many files they are using in their working set. This is driven,
870 primarily, by the number of locks a client can hold at one time. The number of locks
871 held by clients varies by load and memory availability on the server. Interactive
872 clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is
873 approximately 2 KB per file, including the Lustre distributed lock manager (LDLM)
874 lock and kernel data structures for the files currently in use. Having file data
875 in cache can improve metadata performance by a factor of 10x or more compared to
876 reading it from storage.</para>
877 <para>MDS memory requirements include:</para>
880 <para><emphasis role="bold">File system metadata</emphasis>:
881 A reasonable amount of RAM needs to be available for file system metadata.
882 While no hard limit can be placed on the amount of file system metadata,
883 if more RAM is available, then the disk I/O is needed less often to retrieve
887 <para><emphasis role="bold">Network transport</emphasis>:
888 If you are using TCP or other network transport that uses system memory for
889 send/receive buffers, this memory requirement must also be taken into
890 consideration.</para>
893 <para><emphasis role="bold">Journal size</emphasis>:
894 By default, the journal size is 4096 MB for each MDT ldiskfs file system.
895 This can pin up to an equal amount of RAM on the MDS node per file system.</para>
898 <para><emphasis role="bold">Failover configuration</emphasis>:
899 If the MDS node will be used for failover from another node, then the RAM
900 for each journal should be doubled, so the backup server can handle the
901 additional load if the primary server fails.</para>
905 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>Calculating MDS Memory Requirements</title>
906 <para>By default, 4096 MB are used for the ldiskfs filesystem journal. Additional
907 RAM is used for caching file data for the larger working set, which is not
908 actively in use by clients but should be kept "hot" for improved
909 access times. Approximately 1.5 KB per file is needed to keep a file in cache
910 without a lock.</para>
911 <para>For example, for a single MDT on an MDS with 1,024 clients, 12 interactive
912 login nodes, and a 6 million file working set (of which 4M files are cached
913 on the clients):</para>
915 <para>Operating system overhead = 1024 MB</para>
916 <para>File system journal = 4096 MB</para>
917 <para>1024 * 4-core clients * 1024 files/core * 2kB = 4096 MB</para>
918 <para>12 interactive clients * 100,000 files * 2kB = 2400 MB</para>
919 <para>2M file extra working set * 1.5kB/file = 3096 MB</para>
921 <para>Thus, the minimum requirement for an MDT with this configuration is at least
922 16 GB of RAM. Additional memory may significantly improve performance.</para>
923 <para>For directories containing 1 million or more files, more memory can provide
924 a significant benefit. For example, in an environment where clients randomly
925 access one of 10 million files, having extra memory for the cache significantly
926 improves performance.</para>
930 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>OSS Memory Requirements</title>
931 <para>When planning the hardware for an OSS node, consider the memory usage of
932 several components in the Lustre file system (i.e., journal, service threads,
933 file system metadata, etc.). Also, consider the effect of the OSS read cache
934 feature, which consumes memory as it caches data on the OSS node.</para>
935 <para>In addition to the MDS memory requirements mentioned above,
936 the OSS requirements also include:</para>
939 <para><emphasis role="bold">Service threads</emphasis>:
940 The service threads on the OSS node pre-allocate an RPC-sized MB I/O buffer
941 for each ost_io service thread, so these buffers do not need to be allocated
942 and freed for each I/O request.</para>
945 <para><emphasis role="bold">OSS read cache</emphasis>:
946 OSS read cache provides read-only caching of data on an OSS, using the regular
947 Linux page cache to store the data. Just like caching from a regular file
948 system in the Linux operating system, OSS read cache uses as much physical
949 memory as is available.</para>
952 <para>The same calculation applies to files accessed from the OSS as for the MDS,
953 but the load is distributed over many more OSSs nodes, so the amount of memory
954 required for locks, inode cache, etc. listed under MDS is spread out over the
956 <para>Because of these memory requirements, the following calculations should be
957 taken as determining the absolute minimum RAM required in an OSS node.</para>
959 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>Calculating OSS Memory Requirements</title>
960 <para>The minimum recommended RAM size for an OSS with eight OSTs is:</para>
962 <para>Linux kernel and userspace daemon memory = 1024 MB</para>
963 <para>Network send/receive buffers (16 MB * 512 threads) = 8192 MB</para>
964 <para>1024 MB ldiskfs journal size * 8 OST devices = 8192 MB</para>
965 <para>16 MB read/write buffer per OST IO thread * 512 threads = 8192 MB</para>
966 <para>2048 MB file system read cache * 8 OSTs = 16384 MB</para>
967 <para>1024 * 4-core clients * 1024 files/core * 2kB/file = 8192 MB</para>
968 <para>12 interactive clients * 100,000 files * 2kB/file = 2400 MB</para>
969 <para>2M file extra working set * 2kB/file = 4096 MB</para>
970 <para>DLM locks + file cache TOTAL = 31072 MB</para>
971 <para>Per OSS DLM locks + file system metadata = 31072 MB/4 OSS = 7768 MB (approx.)</para>
972 <para>Per OSS RAM minimum requirement = 32 GB (approx.)</para>
974 <para>This consumes about 16 GB just for pre-allocated buffers, and an
975 additional 1 GB for minimal file system and kernel usage. Therefore, for a
976 non-failover configuration, the minimum RAM would be about 32 GB for an OSS node
977 with eight OSTs. Adding additional memory on the OSS will improve the performance
978 of reading smaller, frequently-accessed files.</para>
979 <para>For a failover configuration, the minimum RAM would be at least 48 GB,
980 as some of the memory is per-node. When the OSS is not handling any failed-over
981 OSTs the extra RAM will be used as a read cache.</para>
982 <para>As a reasonable rule of thumb, about 8 GB of base memory plus 3 GB per OST
983 can be used. In failover configurations, about 6 GB per OST is needed.</para>
987 <section xml:id="dbdoclet.50438256_78272">
989 <primary>setup</primary>
990 <secondary>network</secondary>
991 </indexterm>Implementing Networks To Be Used by the Lustre File System</title>
992 <para>As a high performance file system, the Lustre file system places heavy loads on networks.
993 Thus, a network interface in each Lustre server and client is commonly dedicated to Lustre
994 file system traffic. This is often a dedicated TCP/IP subnet, although other network hardware
995 can also be used.</para>
996 <para>A typical Lustre file system implementation may include the following:</para>
999 <para>A high-performance backend network for the Lustre servers, typically an InfiniBand (IB) network.</para>
1002 <para>A larger client network.</para>
1005 <para>Lustre routers to connect the two networks.</para>
1008 <para>Lustre networks and routing are configured and managed by specifying parameters to the
1009 Lustre Networking (<literal>lnet</literal>) module in
1010 <literal>/etc/modprobe.d/lustre.conf</literal>.</para>
1011 <para>To prepare to configure Lustre networking, complete the following steps:</para>
1014 <para><emphasis role="bold">Identify all machines that will be running Lustre software and
1015 the network interfaces they will use to run Lustre file system traffic. These machines
1016 will form the Lustre network .</emphasis></para>
1017 <para>A network is a group of nodes that communicate directly with one another. The Lustre
1018 software includes Lustre network drivers (LNDs) to support a variety of network types and
1019 hardware (see <xref linkend="understandinglustrenetworking"/> for a complete list). The
1020 standard rules for specifying networks applies to Lustre networks. For example, two TCP
1021 networks on two different subnets (<literal>tcp0</literal> and <literal>tcp1</literal>)
1022 are considered to be two different Lustre networks.</para>
1025 <para><emphasis role="bold">If routing is needed, identify the nodes to be used to route traffic between networks.</emphasis></para>
1026 <para>If you are using multiple network types, then you will need a router. Any node with
1027 appropriate interfaces can route Lustre networking (LNet) traffic between different
1028 network hardware types or topologies --the node may be a server, a client, or a standalone
1029 router. LNet can route messages between different network types (such as
1030 TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or
1031 TCP/IP networks). Routing will be configured in <xref linkend="configuringlnet"/>.</para>
1034 <para><emphasis role="bold">Identify the network interfaces to include
1035 in or exclude from LNet.</emphasis></para>
1036 <para>If not explicitly specified, LNet uses either the first available
1037 interface or a pre-defined default for a given network type. Interfaces
1038 that LNet should not use (such as an administrative network or
1039 IP-over-IB), can be excluded.</para>
1040 <para>Network interfaces to be used or excluded will be specified using
1041 the lnet kernel module parameters <literal>networks</literal> and
1042 <literal>ip2nets</literal> as described in
1043 <xref linkend="configuringlnet"/>.</para>
1046 <para><emphasis role="bold">To ease the setup of networks with complex
1047 network configurations, determine a cluster-wide module configuration.
1049 <para>For large clusters, you can configure the networking setup for
1050 all nodes by using a single, unified set of parameters in the
1051 <literal>lustre.conf</literal> file on each node. Cluster-wide
1052 configuration is described in <xref linkend="configuringlnet"/>.</para>
1056 <para>We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.</para>