1 <?xml version='1.0' encoding='UTF-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="settinguplustresystem">
3 <title xml:id="settinguplustresystem.title">Determining Hardware Configuration Requirements and
4 Formatting Options</title>
5 <para>This chapter describes hardware configuration requirements for a Lustre file system
10 <xref linkend="dbdoclet.50438256_49017"/>
15 <xref linkend="dbdoclet.50438256_31079"/>
20 <xref linkend="dbdoclet.50438256_84701"/>
25 <xref linkend="dbdoclet.50438256_26456"/>
30 <xref linkend="dbdoclet.50438256_78272"/>
34 <section xml:id="dbdoclet.50438256_49017">
35 <title><indexterm><primary>setup</primary></indexterm>
36 <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm>
37 <indexterm><primary>design</primary><see>setup</see></indexterm>
38 Hardware Considerations</title>
39 <para>A Lustre file system can utilize any kind of block storage device such as single disks,
40 software RAID, hardware RAID, or a logical volume manager. In contrast to some networked file
41 systems, the block devices are only attached to the MDS and OSS nodes in a Lustre file system
42 and are not accessed by the clients directly.</para>
43 <para>Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)</para>
44 <para>For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.</para>
45 <para>For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.</para>
46 <warning><para>Performance and recovery issues can occur if you put a client on an MDS or OSS:</para>
49 <para>Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.</para>
52 <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
56 <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
57 typically used for testing to match expected customer usage and avoid limitations due to the 4
58 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs.
59 Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file
60 systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit
62 <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
63 optionally be organized with logical volume management (LVM), which is then formatted as a
64 Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
65 imposed by the file system.</para>
66 <para>The Lustre file system uses journaling file system technology on both the MDTs and OSTs.
67 For a MDT, as much as a 20 percent performance gain can be obtained by placing the journal on
68 a separate device.</para>
69 <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para>
71 <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). </para>
75 <primary>setup</primary>
76 <secondary>MDT</secondary>
77 </indexterm> MGT and MDT Storage Hardware Considerations</title>
78 <para>MGT storage requirements are small (less than 100 MB even in the largest Lustre file
79 systems), and the data on an MGT is only accessed on a server/client mount, so disk
80 performance is not a consideration. However, this data is vital for file system access, so
81 the MGT should be reliable storage, preferably mirrored RAID1.</para>
82 <para>MDS storage is accessed in a database-like access pattern with many seeks and
83 read-and-writes of small amounts of data. High throughput to MDS storage is not important.
84 Storage types that provide much lower seek times, such as high-RPM SAS or SSD drives can be
85 used for the MDT.</para>
86 <para>For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.</para>
87 <para>If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.</para>
88 <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.</para>
89 <para condition='l24'>If multiple MDTs are going to be present in the
90 system, each MDT should be specified for the anticipated usage and load.
91 For details on how to add additional MDTs to the filesystem, see
92 <xref linkend="dbdoclet.addingamdt"/>.</para>
93 <warning condition='l24'><para>MDT0 contains the root of the Lustre file
94 system. If MDT0 is unavailable for any reason, the file system cannot be
95 used.</para></warning>
96 <note condition='l24'><para>Using the DNE feature it is possible to
97 dedicate additional MDTs to sub-directories off the file system root
98 directory stored on MDT0, or arbitrarily for lower-level subdirectories.
99 using the <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal> command.
100 If an MDT serving a subdirectory becomes unavailable, any subdirectories
101 on that MDT and all directories beneath it will also become inaccessible.
102 Configuring multiple levels of MDTs is an experimental feature for the
103 2.4 release, and is fully functional in the 2.8 release. This is
104 typically useful for top-level directories to assign different users
105 or projects to separate MDTs, or to distribute other large working sets
106 of files to multiple MDTs.</para></note>
107 <note condition='l28'><para>Starting in the 2.8 release it is possible
108 to spread a single large directory across multiple MDTs using the DNE
109 striped directory feature by specifying multiple stripes (or shards)
110 at creation time using the
111 <literal>lfs mkdir -c <replaceable>stripe_count</replaceable></literal>
112 command, where <replaceable>stripe_count</replaceable> is often the
113 number of MDTs in the filesystem. Striped directories should typically
114 not be used for all directories in the filesystem, since this incurs
115 extra overhead compared to non-striped directories, but is useful for
116 larger directories (over 50k entries) where many output files are being
121 <title><indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations</title>
122 <para>The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target up to 128 terabytes (TBs) in size.</para>
123 <para>Lustre file system capacity is the sum of the capacities provided by the targets. For
124 example, 64 OSSs, each with two 8 TB targets, provide a file system with a capacity of
125 nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a
126 RAID 6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to
127 400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a
128 system network, such as the InfiniBand network, that provides a similar bandwidth, then each
129 OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural
130 constraints described here are simple, in practice it takes careful hardware selection,
131 benchmarking and integration to obtain such results.)</para>
134 <section xml:id="dbdoclet.50438256_31079">
135 <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
136 <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
137 Determining Space Requirements</title>
138 <para>The desired performance characteristics of the backing file systems on the MDT and OSTs
139 are independent of one another. The size of the MDT backing file system depends on the number
140 of inodes needed in the total Lustre file system, while the aggregate OST space depends on the
141 total amount of data stored on the file system. If MGS data is to be stored on the MDT device
142 (co-located MGT and MDT), add 100 MB to the required size estimate for the MDT.</para>
143 <para>Each time a file is created on a Lustre file system, it consumes one inode on the MDT and one inode for each OST object over which the file is striped. Normally, each file's stripe count is based on the system-wide default stripe count. However, this can be changed for individual files using the <literal>lfs setstripe</literal> option. For more details, see <xref linkend="managingstripingfreespace"/>.</para>
144 <para>In a Lustre ldiskfs file system, all the inodes are allocated on the MDT and OSTs when the file system is first formatted. The total number of inodes on a formatted MDT or OST cannot be easily changed, although it is possible to add OSTs with additional space and corresponding inodes. Thus, the number of inodes created at format time should be generous enough to anticipate future expansion.</para>
145 <para>When the file system is in use and a file is created, the metadata associated with that file is stored in one of the pre-allocated inodes and does not consume any of the free space used to store file data.</para>
147 <para>By default, the ldiskfs file system used by Lustre servers to store user-data objects
148 and system data reserves 5% of space that cannot be used by the Lustre file system.
149 Additionally, a Lustre file system reserves up to 400 MB on each OST for journal use and a
150 small amount of space outside the journal to store accounting data. This reserved space is
151 unusable for general storage. Thus, at least 400 MB of space is used on each OST before any
152 file object data is saved.</para>
154 <para condition="l24">With a ZFS backing filesystem for the MDT or OST,
155 the space allocation for inodes and file data is dynamic, and inodes are
156 allocated as needed. A minimum of 2kB of usable space (before RAID) is
157 needed for each inode, exclusive of other overhead such as directories,
158 internal log files, extended attributes, ACLs, etc.
159 Since the size of extended attributes and ACLs is highly dependent on
160 kernel versions and site-specific policies, it is best to over-estimate
161 the amount of space needed for the desired number of inodes, and any
162 excess space will be utilized to store more inodes.</para>
165 <primary>setup</primary>
166 <secondary>MGT</secondary>
169 <primary>space</primary>
170 <secondary>determining MGT requirements</secondary>
171 </indexterm> Determining MGT Space Requirements</title>
172 <para>Less than 100 MB of space is required for the MGT. The size is determined by the number
173 of servers in the Lustre file system cluster(s) that are managed by the MGS.</para>
175 <section xml:id="dbdoclet.50438256_87676">
177 <primary>setup</primary>
178 <secondary>MDT</secondary>
181 <primary>space</primary>
182 <secondary>determining MDT requirements</secondary>
183 </indexterm> Determining MDT Space Requirements</title>
184 <para>When calculating the MDT size, the important factor to consider
185 is the number of files to be stored in the file system. This determines
186 the number of inodes needed, which drives the MDT sizing. To be on the
187 safe side, plan for 2 KB per ldiskfs inode on the MDT, which is the
188 default value. Attached storage required for Lustre file system metadata
189 is typically 1-2 percent of the file system capacity depending upon
191 <note condition='l24'><para>Starting in release 2.4, using the DNE
192 remote directory feature it is possible to increase the metadata
193 capacity of a single filesystem by configuting additional MDTs into
194 the filesystem, see <xref linkend="dbdoclet.addingamdt"/>. In order
195 to start creating new files and directories on the new MDT(s) they
196 need to be attached into the namespace at one or more subdirectories
197 using the <literal>lfs mkdir</literal> command.</para></note>
198 <para>For example, if the average file size is 5 MB and you have
199 100 TB of usable OST space, then you can calculate the minimum number
200 of inodes as follows:</para>
202 <para>(100 TB * 1024 GB/TB * 1024 MB/GB) / 5 MB/inode = 20 million inodes</para>
204 <para>It is recommended that the MDT have at least twice the minimum
205 number of inodes to allow for future expansion and allow for an average
206 file size smaller than expected. Thus, the required space is:</para>
208 <para>2 KB/inode x 20 million inodes x 2 = 80 GB</para>
210 <para>If the average file size is small, 4 KB for example, the Lustre
211 file system is not very efficient as the MDT will use as much space
212 for each file as the space used on the OST. However, this is not a
213 common configuration for a Lustre environment.</para>
215 <para>If the MDT is too small, this can cause the space on the OSTs
216 to be inaccessible since no new files can be created. Be sure to
217 determine the appropriate size of the MDT needed to support the file
218 system before formatting the file system. It is possible to increase the
219 number of inodes after the file system is formatted, depending on the
220 storage. For ldiskfs MDT filesystems the <literal>resize2fs</literal>
221 tool can be used if the underlying block device is on a LVM logical
222 volume. For ZFS new (mirrored) VDEVs can be added to the MDT pool.
223 Inodes will be added approximately in proportion to space added.</para>
225 <note condition='l24'><para>It is also possible to increase the number
226 of inodes available, as well as increasing the aggregate metadata
227 performance, by adding additional MDTs using the DNE remote directory
228 feature available in Lustre release 2.4 and later, see
229 <xref linkend="dbdoclet.addingamdt"/>.</para>
234 <primary>setup</primary>
235 <secondary>OST</secondary>
238 <primary>space</primary>
239 <secondary>determining OST requirements</secondary>
240 </indexterm> Determining OST Space Requirements</title>
241 <para>For the OST, the amount of space taken by each object depends on the usage pattern of
242 the users/applications running on the system. The Lustre software defaults to a conservative
243 estimate for the object size (16 KB per object). If you are confident that the average file
244 size for your applications will be larger than this, you can specify a larger average file
245 size (fewer total inodes) to reduce file system overhead and minimize file system check
246 time. See <xref linkend="dbdoclet.50438256_53886"/> for more details.</para>
249 <section xml:id="dbdoclet.50438256_84701">
251 <indexterm><primary>file system</primary><secondary>formatting options</secondary></indexterm>
252 <indexterm><primary>setup</primary><secondary>file system</secondary></indexterm>
253 Setting File System Formatting Options</title>
254 <para>By default, the <literal>mkfs.lustre</literal> utility applies these options to the Lustre
255 backing file system used to store data and metadata in order to enhance Lustre file system
256 performance and scalability. These options include:</para>
259 <para><literal>flex_bg</literal> - When the flag is set to enable this
260 flexible-block-groups feature, block and inode bitmaps for multiple groups are aggregated
261 to minimize seeking when bitmaps are read or written and to reduce read/modify/write
262 operations on typical RAID storage (with 1 MB RAID stripe widths). This flag is enabled on
263 both OST and MDT file systems. On MDT file systems the <literal>flex_bg</literal> factor
264 is left at the default value of 16. On OSTs, the <literal>flex_bg</literal> factor is set
265 to 256 to allow all of the block or inode bitmaps in a single <literal>flex_bg</literal>
266 to be read or written in a single I/O on typical RAID storage.</para>
269 <para><literal>huge_file</literal> - Setting this flag allows files on OSTs to be
270 larger than 2 TB in size.</para>
273 <para><literal>lazy_journal_init</literal> - This extended option is enabled to
274 prevent a full overwrite of the 400 MB journal that is allocated by default in a Lustre
275 file system, which reduces the file system format time.</para>
278 <para>To override the default formatting options, use arguments to
279 <literal>mkfs.lustre</literal> to pass formatting options to the backing file system:</para>
280 <screen>--mkfsoptions='backing fs options'</screen>
281 <para>For other <literal>mkfs.lustre</literal> options, see the Linux man page for
282 <literal>mke2fs(8)</literal>.</para>
283 <section xml:id="dbdoclet.50438256_pgfId-1293228">
285 <primary>inodes</primary>
286 <secondary>MDS</secondary>
287 </indexterm><indexterm>
288 <primary>setup</primary>
289 <secondary>inodes</secondary>
290 </indexterm>Setting Formatting Options for an MDT</title>
291 <para>The number of inodes on the MDT is determined at format time based on the total size of
292 the file system to be created. The default <emphasis role="italic"
293 >bytes-per-inode</emphasis> ratio ("inode ratio") for an MDT is optimized at one inode for
294 every 2048 bytes of file system space. It is recommended that this value not be changed for
296 <para>This setting takes into account the space needed for additional metadata, such as the
297 journal (up to 400 MB), bitmaps and directories, and a few files that the Lustre file system
298 uses to maintain cluster consistency.</para>
300 <section xml:id="dbdoclet.50438256_53886">
302 <primary>inodes</primary>
303 <secondary>OST</secondary>
304 </indexterm>Setting Formatting Options for an OST</title>
305 <para>When formatting OST file systems, it is normally advantageous to take local file system
306 usage into account. When doing so, try to minimize the number of inodes on each OST, while
307 keeping enough margin for potential variations in future usage. This helps reduce the format
308 and file system check time and makes more space available for data.</para>
309 <para>The table below shows the default <emphasis role="italic">bytes-per-inode
310 </emphasis>ratio ("inode ratio") used for OSTs of various sizes when they are formatted. </para>
313 <title xml:id="settinguplustresystem.tab1">Inode Ratios Used for Newly Formatted
316 <colspec colname="c1" colwidth="3*"/>
317 <colspec colname="c2" colwidth="2*"/>
318 <colspec colname="c3" colwidth="4*"/>
322 <para><emphasis role="bold">LUN/OST size</emphasis></para>
325 <para><emphasis role="bold">Inode ratio</emphasis></para>
328 <para><emphasis role="bold">Total inodes</emphasis></para>
335 <para> over 10GB </para>
338 <para> 1 inode/16KB </para>
341 <para> 640 - 655k </para>
346 <para> 10GB - 1TB </para>
349 <para> 1 inode/68kiB </para>
352 <para> 153k - 15.7M </para>
357 <para> 1TB - 8TB </para>
360 <para> 1 inode/256kB </para>
363 <para> 4.2M - 33.6M </para>
368 <para> over 8TB </para>
371 <para> 1 inode/1MB </para>
374 <para> 8.4M - 134M </para>
381 <para>In environments with few small files, the default inode ratio may result in far too many
382 inodes for the average file size. In this case, performance can be improved by increasing
383 the number of <emphasis role="italic">bytes-per-inode</emphasis>.To set the inode ratio, use
384 the <literal>-i</literal> argument to <literal>mkfs.lustre</literal> to specify the
385 <emphasis role="italic">bytes-per-inode</emphasis> value. </para>
387 <para>File system check time on OSTs is affected by a number of variables in addition to
388 the number of inodes, including the size of the file system, the number of allocated
389 blocks, the distribution of allocated blocks on the disk, disk speed, CPU speed, and the
390 amount of RAM on the server. Reasonable file system check times are 5-30 minutes per
393 <para>For more details about formatting MDT and OST file systems, see <xref
394 linkend="dbdoclet.50438208_51921"/>.</para>
398 <primary>setup</primary>
399 <secondary>limits</secondary>
400 </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
401 <primary>wide striping</primary>
402 </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
403 <primary>xattr</primary>
404 <secondary><emphasis role="italic">See</emphasis> wide striping</secondary>
405 </indexterm><indexterm>
406 <primary>large_xattr</primary>
407 <secondary>ea_inode</secondary>
408 </indexterm><indexterm>
409 <primary>wide striping</primary>
410 <secondary>large_xattr</secondary>
411 <tertiary>ea_inode</tertiary>
412 </indexterm>File and File System Limits</title>
414 <para><xref linkend="settinguplustresystem.tab2"/> describes
415 file and file system size limits. These limits are imposed by either
416 the Lustre architecture or the Linux virtual file system (VFS) and
417 virtual memory subsystems. In a few cases, a limit is defined within
418 the code and can be changed by re-compiling the Lustre software.
419 Instructions to install from source code are beyond the scope of this
420 document, and can be found elsewhere online. In these cases, the
421 indicated limit was used for testing of the Lustre software. </para>
424 <title xml:id="settinguplustresystem.tab2">File and file system limits</title>
426 <colspec colname="c1" colwidth="3*"/>
427 <colspec colname="c2" colwidth="2*"/>
428 <colspec colname="c3" colwidth="4*"/>
432 <para><emphasis role="bold">Limit</emphasis></para>
435 <para><emphasis role="bold">Value</emphasis></para>
438 <para><emphasis role="bold">Description</emphasis></para>
445 <para> Maximum number of MDTs</para>
449 <para condition='l24'>4096</para>
452 <para>The Lustre software release 2.3 and earlier allows a maximum of 1 MDT per file
453 system, but a single MDS can host multiple MDTs, each one for a separate file
455 <para condition="l24">The Lustre software release 2.4 and later requires one MDT for
456 the filesystem root. Up to 4095 additional MDTs can be added to the file system and attached
457 into the namespace with remote directories.</para>
462 <para> Maximum number of OSTs</para>
468 <para>The maximum number of OSTs is a constant that can be changed at compile time.
469 Lustre file systems with up to 4000 OSTs have been tested.</para>
474 <para> Maximum OST size</para>
477 <para> 128TB (ldiskfs), 256TB (ZFS)</para>
480 <para>This is not a <emphasis>hard</emphasis> limit. Larger OSTs are possible but
481 today typical production systems do not go beyond the stated limit per OST. </para>
486 <para> Maximum number of clients</para>
492 <para>The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production.</para>
497 <para> Maximum size of a file system</para>
500 <para> 512 PB (ldiskfs), 1EB (ZFS)</para>
503 <para>Each OST or MDT on 64-bit kernel servers can have a file system up to the above limit. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para>
504 <para>You can have multiple OST file systems on a single OSS node.</para>
509 <para> Maximum stripe count</para>
515 <para>This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol.</para>
520 <para> Maximum stripe size</para>
523 <para> < 4 GB</para>
526 <para>The amount of data written to each object before moving on to next object.</para>
531 <para> Minimum stripe size</para>
537 <para>Due to the 64 KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB.</para>
541 <para> Maximum object size</para> </entry>
543 <para> 16TB (ldiskfs), 256TB (ZFS)</para>
546 <para>The amount of data that can be stored in a single object. An object
547 corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies.
548 For ZFS the limit is the size of the underlying OST.
549 Files can consist of up to 2000 stripes, each stripe can contain the maximum object size. </para>
554 <para> Maximum <anchor xml:id="dbdoclet.50438256_marker-1290761" xreflabel=""/>file size</para>
557 <para> 16 TB on 32-bit systems</para>
559 <para> 31.25 PB on 64-bit ldiskfs systems, 8EB on 64-bit ZFS systems</para>
562 <para>Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed
563 by the kernel memory subsystem. On 64-bit systems this limit does not exist.
564 Hence, files can be 2^63 bits (8EB) in size if the backing filesystem can support large enough objects.</para>
565 <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para>
570 <para> Maximum number of files or subdirectories in a single directory</para>
573 <para> 10 million files (ldiskfs), 2^48 (ZFS)</para>
576 <para>The Lustre software uses the ldiskfs hashed directory
577 code, which has a limit of about 10 million files, depending
578 on the length of the file name. The limit on subdirectories
579 is the same as the limit on regular files.</para>
580 <note condition='l28'><para>Starting in the 2.8 release it is
581 possible to exceed this limit by striping a single directory
582 over multiple MDTs with the <literal>lfs mkdir -c</literal>
583 command, which increases the single directory limit by a
584 factor of the number of directory stripes used.</para></note>
585 <para>Lustre file systems are tested with ten million files
586 in a single directory.</para>
591 <para> Maximum number of files in the file system</para>
594 <para> 4 billion (ldiskfs), 256 trillion (ZFS)</para>
595 <para condition='l24'>up to 256 times the per-MDT limit</para>
598 <para>The ldiskfs filesystem imposes an upper limit of
599 4 billion inodes per filesystem. By default, the MDT
600 filesystem is formatted with one inode per 2KB of space,
601 meaning 512 million inodes per TB of MDT space. This can be
602 increased initially at the time of MDT filesystem creation.
603 For more information, see
604 <xref linkend="settinguplustresystem"/>.</para>
605 <para condition="l24">The ZFS filesystem
606 dynamically allocates inodes and does not have a fixed ratio
607 of inodes per unit of MDT space, but consumes approximately
608 4KB of space per inode, depending on the configuration.</para>
609 <para condition="l24">Each additional MDT can hold up to the
610 above maximum number of additional files, depending on
611 available space and the distribution directories and files
612 in the filesystem.</para>
617 <para> Maximum length of a filename</para>
620 <para> 255 bytes (filename)</para>
623 <para>This limit is 255 bytes for a single filename, the
624 same as the limit in the underlying filesystems.</para>
629 <para> Maximum length of a pathname</para>
632 <para> 4096 bytes (pathname)</para>
635 <para>The Linux VFS imposes a full pathname length of 4096 bytes.</para>
640 <para> Maximum number of open files for a Lustre file system</para>
643 <para> No limit</para>
646 <para>The Lustre software does not impose a maximum for the number of open files,
647 but the practical limit depends on the amount of RAM on the MDS. No
648 "tables" for open files exist on the MDS, as they are only linked in a
649 list to a given client's export. Each client process probably has a limit of
650 several thousands of open files which depends on the ulimit.</para>
658 <para condition="l22">In Lustre software releases prior to release 2.2, the maximum stripe
659 count for a single file was limited to 160 OSTs. In Lustre software release 2.2, the large
660 <literal>xattr</literal> feature ("wide striping") was added to support up to 2000 OSTs.
661 This feature is disabled by default at <literal>mkfs.lustre</literal> time. In order to
662 enable this feature, set the "<literal>-O large_xattr</literal>" or "<literal>-O ea_inode</literal>"
663 option on the MDT either by using <literal>--mkfsoptions</literal> at format time or by using
664 <literal>tune2fs</literal>. Using either "<literal>large_xattr</literal>" or "<literal>ea_inode</literal>"
665 results in "<literal>ea_inode</literal>" in the file system feature list.</para>
669 <section xml:id="dbdoclet.50438256_26456">
670 <title><indexterm><primary>setup</primary><secondary>memory</secondary></indexterm>Determining Memory Requirements</title>
671 <para>This section describes the memory requirements for each Lustre file system component.</para>
674 <indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>client</tertiary></indexterm>
675 Client Memory Requirements</title>
676 <para>A minimum of 2 GB RAM is recommended for clients.</para>
679 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>MDS Memory Requirements</title>
680 <para>MDS memory requirements are determined by the following factors:</para>
683 <para>Number of clients</para>
686 <para>Size of the directories</para>
689 <para>Load placed on server</para>
692 <para>The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The number of locks held by clients varies by load and memory availability on the server. Interactive clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is approximately 2 KB per file, including the Lustre distributed lock manager (DLM) lock and kernel data structures for the files currently in use. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk.</para>
693 <para>MDS memory requirements include:</para>
696 <para><emphasis role="bold">File system metadata</emphasis> : A reasonable amount of RAM needs to be available for file system metadata. While no hard limit can be placed on the amount of file system metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata.</para>
699 <para><emphasis role="bold">Network transport</emphasis> : If you are using TCP or other network transport that uses system memory for send/receive buffers, this memory requirement must also be taken into consideration.</para>
702 <para><emphasis role="bold">Journal size</emphasis> : By default, the journal size is 400 MB for each Lustre ldiskfs file system. This can pin up to an equal amount of RAM on the MDS node per file system.</para>
705 <para><emphasis role="bold">Failover configuration</emphasis> : If the MDS node will be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.</para>
709 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>Calculating MDS Memory Requirements</title>
710 <para>By default, 400 MB are used for the file system journal. Additional RAM is used for caching file data for the larger working set, which is not actively in use by clients but should be kept "hot" for improved access times. Approximately 1.5 KB per file is needed to keep a file in cache without a lock.</para>
711 <para>For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes, and a 2 million file working set (of which 400,000 files are cached on the clients):</para>
713 <para>Operating system overhead = 512 MB</para>
714 <para>File system journal = 400 MB</para>
715 <para>1000 * 4-core clients * 100 files/core * 2kB = 800 MB</para>
716 <para>16 interactive clients * 10,000 files * 2kB = 320 MB</para>
717 <para>1,600,000 file extra working set * 1.5kB/file = 2400 MB</para>
719 <para>Thus, the minimum requirement for a system with this configuration is at least 4 GB of RAM. However, additional memory may significantly improve performance.</para>
720 <para>For directories containing 1 million or more files, more memory may provide a significant benefit. For example, in an environment where clients randomly access one of 10 million files, having extra memory for the cache significantly improves performance.</para>
724 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>OSS Memory Requirements</title>
725 <para>When planning the hardware for an OSS node, consider the memory usage of several
726 components in the Lustre file system (i.e., journal, service threads, file system metadata,
727 etc.). Also, consider the effect of the OSS read cache feature, which consumes memory as it
728 caches data on the OSS node.</para>
729 <para>In addition to the MDS memory requirements mentioned in <xref linkend="dbdoclet.50438256_87676"/>, the OSS requirements include:</para>
732 <para><emphasis role="bold">Service threads</emphasis> : The service threads on the OSS node pre-allocate a 4 MB I/O buffer for each ost_io service thread, so these buffers do not need to be allocated and freed for each I/O request.</para>
735 <para><emphasis role="bold">OSS read cache</emphasis> : OSS read cache provides read-only
736 caching of data on an OSS, using the regular Linux page cache to store the data. Just
737 like caching from a regular file system in the Linux operating system, OSS read cache
738 uses as much physical memory as is available.</para>
741 <para>The same calculation applies to files accessed from the OSS as for the MDS, but the load is distributed over many more OSSs nodes, so the amount of memory required for locks, inode cache, etc. listed under MDS is spread out over the OSS nodes.</para>
742 <para>Because of these memory requirements, the following calculations should be taken as determining the absolute minimum RAM required in an OSS node.</para>
744 <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>Calculating OSS Memory Requirements</title>
745 <para>The minimum recommended RAM size for an OSS with two OSTs is computed below:</para>
747 <para>Ethernet/TCP send/receive buffers (4 MB * 512 threads) = 2048 MB</para>
748 <para>400 MB journal size * 2 OST devices = 800 MB</para>
749 <para>1.5 MB read/write per OST IO thread * 512 threads = 768 MB</para>
750 <para>600 MB file system read cache * 2 OSTs = 1200 MB</para>
751 <para>1000 * 4-core clients * 100 files/core * 2kB = 800MB</para>
752 <para>16 interactive clients * 10,000 files * 2kB = 320MB</para>
753 <para>1,600,000 file extra working set * 1.5kB/file = 2400MB</para>
754 <para> DLM locks + file system metadata TOTAL = 3520MB</para>
755 <para>Per OSS DLM locks + file system metadata = 3520MB/6 OSS = 600MB (approx.)</para>
756 <para>Per OSS RAM minimum requirement = 4096MB (approx.)</para>
758 <para>This consumes about 1,400 MB just for the pre-allocated buffers, and an additional 2 GB for minimal file system and kernel usage. Therefore, for a non-failover configuration, the minimum RAM would be 4 GB for an OSS node with two OSTs. Adding additional memory on the OSS will improve the performance of reading smaller, frequently-accessed files.</para>
759 <para>For a failover configuration, the minimum RAM would be at least 6 GB. For 4 OSTs on each OSS in a failover configuration 10GB of RAM is reasonable. When the OSS is not handling any failed-over OSTs the extra RAM will be used as a read cache.</para>
760 <para>As a reasonable rule of thumb, about 2 GB of base memory plus 1 GB per OST can be used. In failover configurations, about 2 GB per OST is needed.</para>
764 <section xml:id="dbdoclet.50438256_78272">
766 <primary>setup</primary>
767 <secondary>network</secondary>
768 </indexterm>Implementing Networks To Be Used by the Lustre File System</title>
769 <para>As a high performance file system, the Lustre file system places heavy loads on networks.
770 Thus, a network interface in each Lustre server and client is commonly dedicated to Lustre
771 file system traffic. This is often a dedicated TCP/IP subnet, although other network hardware
772 can also be used.</para>
773 <para>A typical Lustre file system implementation may include the following:</para>
776 <para>A high-performance backend network for the Lustre servers, typically an InfiniBand (IB) network.</para>
779 <para>A larger client network.</para>
782 <para>Lustre routers to connect the two networks.</para>
785 <para>Lustre networks and routing are configured and managed by specifying parameters to the
786 Lustre networking (<literal>lnet</literal>) module in
787 <literal>/etc/modprobe.d/lustre.conf</literal>.</para>
788 <para>To prepare to configure Lustre networking, complete the following steps:</para>
791 <para><emphasis role="bold">Identify all machines that will be running Lustre software and
792 the network interfaces they will use to run Lustre file system traffic. These machines
793 will form the Lustre network .</emphasis></para>
794 <para>A network is a group of nodes that communicate directly with one another. The Lustre
795 software includes Lustre network drivers (LNDs) to support a variety of network types and
796 hardware (see <xref linkend="understandinglustrenetworking"/> for a complete list). The
797 standard rules for specifying networks applies to Lustre networks. For example, two TCP
798 networks on two different subnets (<literal>tcp0</literal> and <literal>tcp1</literal>)
799 are considered to be two different Lustre networks.</para>
802 <para><emphasis role="bold">If routing is needed, identify the nodes to be used to route traffic between networks.</emphasis></para>
803 <para>If you are using multiple network types, then you will need a router. Any node with
804 appropriate interfaces can route Lustre networking (LNET) traffic between different
805 network hardware types or topologies --the node may be a server, a client, or a standalone
806 router. LNET can route messages between different network types (such as
807 TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or
808 TCP/IP networks). Routing will be configured in <xref linkend="configuringlnet"/>.</para>
811 <para><emphasis role="bold">Identify the network interfaces to include in or exclude from LNET. </emphasis>
813 <para>If not explicitly specified, LNET uses either the first available interface or a pre-defined default for a given network type. Interfaces that LNET should not use (such as an administrative network or IP-over-IB), can be excluded.</para>
814 <para>Network interfaces to be used or excluded will be specified using the lnet kernel module parameters networks and <literal>ip2netsas</literal> described in <xref linkend="configuringlnet"/>.</para>
817 <para><emphasis role="bold">To ease the setup of networks with complex network configurations, determine a cluster-wide module configuration.</emphasis></para>
818 <para>For large clusters, you can configure the networking setup for all nodes by using a single, unified set of parameters in the <literal>lustre.conf</literal> file on each node. Cluster-wide configuration is described in <xref linkend="configuringlnet"/>.</para>
822 <para>We recommend that you use 'dotted-quad' notation for IP addresses rather than host names to make it easier to read debug logs and debug configurations with multiple interfaces.</para>