X-Git-Url: https://git.whamcloud.com/?a=blobdiff_plain;f=SettingUpLustreSystem.xml;h=ffca6a8c7811dab39c8b82c8c62e521171af0d9e;hb=2c82bcd0e7356e9c228a95800789febec5fdc6b7;hp=b59f6c64379cbe8ad0316d1c1d39a4cff5fa9d14;hpb=0960c9869cb3514ba0b5c0c2ec74a1dbdf8f773c;p=doc%2Fmanual.git diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml index b59f6c6..ffca6a8 100644 --- a/SettingUpLustreSystem.xml +++ b/SettingUpLustreSystem.xml @@ -1,5 +1,7 @@ - + Determining Hardware Configuration Requirements and Formatting Options This chapter describes hardware configuration requirements for a Lustre file system @@ -7,7 +9,7 @@ - + @@ -22,16 +24,16 @@ - + - + -
+
<indexterm><primary>setup</primary></indexterm> <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm> <indexterm><primary>design</primary><see>setup</see></indexterm> @@ -53,12 +55,14 @@ </listitem> </itemizedlist> </warning> - <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are - typically used for testing to match expected customer usage and avoid limitations due to the 4 - GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs. - Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file - systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit - inode number.</para> + <para>Only servers running on 64-bit CPUs are tested and supported. + 64-bit CPU clients are typically used for testing to match expected + customer usage and avoid limitations due to the 4 GB limit for RAM + size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit + CPUs. Also, due to kernel API limitations, performing backups of Lustre + filesystems on 32-bit clients may cause backup tools to confuse files + that report the same 32-bit inode number, if the backup tools depend + on the inode number for correct operation.</para> <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can optionally be organized with logical volume management (LVM), which is then formatted as a Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format @@ -68,7 +72,11 @@ a separate device.</para> <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para> <note> - <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). </para> + <para>Lustre clients running on different CPU architectures is supported. + One limitation is that the PAGE_SIZE kernel macro on the client must be + as large as the PAGE_SIZE of the server. In particular, ARM or PPC + clients with large pages (up to 64kB pages) can run with x86 servers + (4kB pages).</para> </note> <section remap="h3"> <title><indexterm> @@ -95,24 +103,22 @@ chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.</para> - <para condition='l24'>If multiple MDTs are going to be present in the + <para>If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load. For details on how to add additional MDTs to the filesystem, see - <xref linkend="dbdoclet.adding_new_mdt"/>.</para> - <warning condition='l24'><para>MDT0 contains the root of the Lustre file - system. If MDT0 is unavailable for any reason, the file system cannot be - used.</para></warning> - <note condition='l24'><para>Using the DNE feature it is possible to - dedicate additional MDTs to sub-directories off the file system root - directory stored on MDT0, or arbitrarily for lower-level subdirectories. - using the <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal> command. - If an MDT serving a subdirectory becomes unavailable, any subdirectories - on that MDT and all directories beneath it will also become inaccessible. - Configuring multiple levels of MDTs is an experimental feature for the - 2.4 release, and is fully functional in the 2.8 release. This is - typically useful for top-level directories to assign different users - or projects to separate MDTs, or to distribute other large working sets - of files to multiple MDTs.</para></note> + <xref linkend="lustremaint.adding_new_mdt"/>.</para> + <warning><para>MDT0000 contains the root of the Lustre file system. If + MDT0000 is unavailable for any reason, the file system cannot be used. + </para></warning> + <note><para>Using the DNE feature it is possible to dedicate additional + MDTs to sub-directories off the file system root directory stored on + MDT0000, or arbitrarily for lower-level subdirectories, using the + <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal> + command. If an MDT serving a subdirectory becomes unavailable, any + subdirectories on that MDT and all directories beneath it will also + become inaccessible. This is typically useful for top-level directories + to assign different users or projects to separate MDTs, or to distribute + other large working sets of files to multiple MDTs.</para></note> <note condition='l28'><para>Starting in the 2.8 release it is possible to spread a single large directory across multiple MDTs using the DNE striped directory feature by specifying multiple stripes (or shards) @@ -185,7 +191,7 @@ data. This reserved space is unusable for general storage. Thus, at least this much space will be used per OST before any file object data is saved. </para> - <para condition="l24">With a ZFS backing filesystem for the MDT or OST, + <para>With a ZFS backing filesystem for the MDT or OST, the space allocation for inodes and file data is dynamic, and inodes are allocated as needed. A minimum of 4kB of usable space (before mirroring) is needed for each inode, exclusive of other overhead such as directories, @@ -210,7 +216,7 @@ The size is determined by the total number of servers in the Lustre file system cluster(s) that are managed by the MGS.</para> </section> - <section xml:id="dbdoclet.50438256_87676"> + <section xml:id="dbdoclet.mdt_space_requirements"> <title><indexterm> <primary>setup</primary> <secondary>MDT</secondary> @@ -283,7 +289,7 @@ Inodes will be added approximately in proportion to space added. </para> </note> - <note condition='l24'> + <note> <para>Note that the number of total and free inodes reported by <literal>lfs df -i</literal> for ZFS MDTs and OSTs is estimated based on the current average space used per inode. When a ZFS filesystem is @@ -294,12 +300,12 @@ better reflect actual site usage. </para> </note> - <note condition='l24'> - <para>Starting in release 2.4, using the DNE remote directory feature + <note> + <para>Using the DNE remote directory feature it is possible to increase the total number of inodes of a Lustre filesystem, as well as increasing the aggregate metadata performance, by configuring additional MDTs into the filesystem, see - <xref linkend="dbdoclet.adding_new_mdt"/> for details. + <xref linkend="lustremaint.adding_new_mdt"/> for details. </para> </note> </section> @@ -382,7 +388,7 @@ <para>The number of inodes on the MDT is determined at format time based on the total size of the file system to be created. The default <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio") - for an ldiskfs MDT is optimized at one inode for every 2048 bytes of file + for an ldiskfs MDT is optimized at one inode for every 2560 bytes of file system space.</para> <para>This setting takes into account the space needed for additional ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB), @@ -391,12 +397,14 @@ such as file layout for files with a large number of stripes, Access Control Lists (ACLs), and user extended attributes.</para> <para condition="l2B"> Starting in Lustre 2.11, the <xref linkend= - "dataonmdt.title"/> feature allows storing small files on the MDT + "dataonmdt.title"/> (DoM) feature allows storing small files on the MDT to take advantage of high-performance flash storage, as well as reduce space and network overhead. If you are planning to use the DoM feature with an ldiskfs MDT, it is recommended to <emphasis>increase</emphasis> - the inode ratio to have enough space on the MDT for small files.</para> - <para>It is possible to change the recommended 2048 bytes + the bytes-per-inode ratio to have enough space on the MDT for small files, + as described below. + </para> + <para>It is possible to change the recommended default of 2560 bytes per inode for an ldiskfs MDT when it is first formatted by adding the <literal>--mkfsoptions="-i bytes-per-inode"</literal> option to <literal>mkfs.lustre</literal>. Decreasing the inode ratio tunable @@ -404,11 +412,11 @@ MDT size, but will leave less space for extra per-file metadata and is not recommended. The inode ratio must always be strictly larger than the MDT inode size, which is 1024 bytes by default. It is recommended - to use an inode ratio at least 1024 bytes larger than the inode size to + to use an inode ratio at least 1536 bytes larger than the inode size to ensure the MDT does not run out of space. Increasing the inode ratio - to at least hold the most common file size (e.g. 5120 or 66560 bytes if - 4KB or 64KB files are widely used) is recommended for DoM.</para> - <para>The size of the inode may be changed by adding the + with enough space for the most commonly file size (e.g. 5632 or 66560 + bytes if 4KB or 64KB files are widely used) is recommended for DoM.</para> + <para>The size of the inode may be changed at format time by adding the <literal>--stripe-count-hint=N</literal> to have <literal>mkfs.lustre</literal> automatically calculate a reasonable inode size based on the default stripe count that will be used by the @@ -416,9 +424,9 @@ <literal>--mkfsoptions="-I inode-size"</literal> option. Increasing the inode size will provide more space in the inode for a larger Lustre file layout, ACLs, user and system extended attributes, SELinux and - other security labels, and other internal metadata. However, if these - features or other in-inode xattrs are not needed, the larger inode size - will hurt metadata performance as 2x, 4x, or 8x as much data would be + other security labels, and other internal metadata and DoM data. However, + if these features or other in-inode xattrs are not needed, a larger inode + size may hurt metadata performance as 2x, 4x, or 8x as much data would be read or written for each MDT inode access. </para> </section> @@ -428,10 +436,14 @@ <secondary>OST</secondary> </indexterm>Setting Formatting Options for an ldiskfs OST When formatting an OST file system, it can be beneficial - to take local file system usage into account. When doing so, try to - reduce the number of inodes on each OST, while keeping enough margin - for potential variations in future usage. This helps reduce the format - and file system check time and makes more space available for data. + to take local file system usage into account, for example by running + df and df -i on a current filesystem + to get the used bytes and used inodes respectively, then computing the + average bytes-per-inode value. When deciding on the ratio for a new + filesystem, try to avoid having too many inodes on each OST, while keeping + enough margin to allow for future usage of smaller files. This helps + reduce the format and e2fsck time and makes more space available for data. + The table below shows the default bytes-per-inode ratio ("inode ratio") used for OSTs of various sizes when they are formatted. @@ -515,13 +527,13 @@ [oss#] mkfs.lustre --ost --mkfsoptions="-i $((8192 * 1024))" ... - OSTs formatted with ldiskfs are limited to a maximum of - 320 million to 1 billion objects. Specifying a very small - bytes-per-inode ratio for a large OST that causes this limit to be - exceeded can cause either premature out-of-space errors and prevent - the full OST space from being used, or will waste space and slow down - e2fsck more than necessary. The default inode ratios are chosen to - ensure that the total number of inodes remain below this limit. + OSTs formatted with ldiskfs should preferably have fewer than + 320 million objects per MDT, and up to a maximum of 4 billion inodes. + Specifying a very small bytes-per-inode ratio for a large OST that + exceeds this limit can cause either premature out-of-space errors and + prevent the full OST space from being used, or will waste space and + slow down e2fsck more than necessary. The default inode ratios are + chosen to ensure the total number of inodes remain below this limit. @@ -533,7 +545,7 @@ filesystems are 5-30 minutes per TiB, but may increase significantly if substantial errors are detected and need to be repaired. - For more details about formatting MDT and OST file systems, + For further details about optimizing MDT and OST file systems, see .
@@ -556,13 +568,12 @@ File and File System Limits describes - current known limits of Lustre. These limits are imposed by either - the Lustre architecture or the Linux virtual file system (VFS) and - virtual memory subsystems. In a few cases, a limit is defined within - the code and can be changed by re-compiling the Lustre software. - Instructions to install from source code are beyond the scope of this - document, and can be found elsewhere online. In these cases, the - indicated limit was used for testing of the Lustre software. + current known limits of Lustre. These limits may be imposed by either + the Lustre architecture or the Linux virtual file system (VFS) and + virtual memory subsystems. In a few cases, a limit is defined within + the code Lustre based on tested values and could be changed by editing + and re-compiling the Lustre software. In these cases, the indicated + limit was used for testing of the Lustre software.
File and file system limits @@ -586,45 +597,45 @@ - Maximum number of MDTs + Maximum number of MDTs - 256 + 256 - The Lustre software release 2.3 and earlier allows a - maximum of 1 MDT per file system, but a single MDS can host - multiple MDTs, each one for a separate file system. - The Lustre software release 2.4 and later - requires one MDT for the filesystem root. At least 255 more - MDTs can be added to the filesystem and attached into - the namespace with DNE remote or striped directories. + A single MDS can host one or more MDTs, either for separate + filesystems, or aggregated into a single namespace. Each + filesystem requires a separate MDT for the filesystem root + directory. + Up to 255 more MDTs can be added to the filesystem and are + attached into the filesystem namespace with creation of DNE + remote or striped directories. - Maximum number of OSTs + Maximum number of OSTs 8150 The maximum number of OSTs is a constant that can be - changed at compile time. Lustre file systems with up to - 4000 OSTs have been tested. Multiple OST file systems can - be configured on a single OSS node. + changed at compile time. Lustre file systems with up to 4000 + OSTs have been configured in the past. Multiple OST targets + can be configured on a single OSS node. - Maximum OST size + Maximum OST size - 256TiB (ldiskfs), 256TiB (ZFS) + 1024TiB (ldiskfs), 1024TiB (ZFS) This is not a hard limit. Larger - OSTs are possible but most production systems do not + OSTs are possible, but most production systems do not typically go beyond the stated limit per OST because Lustre can add capacity and performance with additional OSTs, and having more OSTs improves aggregate I/O performance, @@ -634,13 +645,13 @@ With 32-bit kernels, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the - size of OST. It is strongly recommended to run Lustre - clients and servers with 64-bit kernels. + size of OST. It is strongly recommended + to run Lustre clients and servers with 64-bit kernels. - Maximum number of clients + Maximum number of clients 131072 @@ -653,21 +664,21 @@ - Maximum size of a single file system + Maximum size of a single file system - at least 1EiB + 2EiB or larger - Each OST can have a file system up to the - Maximum OST size limit, and the Maximum number of OSTs - can be combined into a single filesystem. + Each OST can have a file system up to the "Maximum OST + size" limit, and the Maximum number of OSTs can be combined + into a single filesystem. - Maximum stripe count + Maximum stripe count 2000 @@ -676,13 +687,22 @@ This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol. The number of OSTs in the - filesystem can exceed the stripe count, but this limits the - number of OSTs across which a single file can be striped. + filesystem can exceed the stripe count, but this is the maximum + number of OSTs on which a single file + can be striped. + Before 2.13, the default for ldiskfs + MDTs the maximum stripe count for a + single file is limited to 160 OSTs. In order to + increase the maximum file stripe count, use + --mkfsoptions="-O ea_inode" when formatting the MDT, + or use tune2fs -O ea_inode to enable it after the + MDT has been formatted. + - Maximum stripe size + Maximum stripe size < 4 GiB @@ -694,7 +714,7 @@ - Minimum stripe size + Minimum stripe size 64 KiB @@ -703,12 +723,13 @@ Due to the use of 64 KiB PAGE_SIZE on some CPU architectures such as ARM and POWER, the minimum stripe size is 64 KiB so that a single page is not split over - multiple servers. + multiple servers. This is also the minimum Data-on-MDT + component size that can be specified. - Maximum object size + Maximum single object size 16TiB (ldiskfs), 256TiB (ZFS) @@ -723,7 +744,7 @@ - Maximum file size + Maximum file size 16 TiB on 32-bit systems @@ -736,9 +757,9 @@ 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 2^63 bits (8EiB) in size if the backing filesystem can - support large enough objects. + support large enough objects and/or the files are sparse. A single file can have a maximum of 2000 stripes, which - gives an upper single file limit of 31.25 PiB for 64-bit + gives an upper single file data capacity of 31.25 PiB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped. @@ -746,14 +767,14 @@ - Maximum number of files or subdirectories in a single directory + Maximum number of files or subdirectories in a single directory - 10 million files (ldiskfs), 2^48 (ZFS) + 600M-3.8B files (ldiskfs), 16T (ZFS) The Lustre software uses the ldiskfs hashed directory - code, which has a limit of about 10 million files, depending + code, which has a limit of at least 600 million files, depending on the length of the file name. The limit on subdirectories is the same as the limit on regular files. Starting in the 2.8 release it is @@ -761,17 +782,19 @@ over multiple MDTs with the lfs mkdir -c command, which increases the single directory limit by a factor of the number of directory stripes used. - Lustre file systems are tested with ten million files - in a single directory. + Starting in the 2.14 release, the + large_dir feature of ldiskfs is enabled by + default to allow directories with more than 10M entries. In + the 2.12 release, the large_dir feature was + present but not enabled by default. - Maximum number of files in the file system + Maximum number of files in the file system - 4 billion (ldiskfs), 256 trillion (ZFS) - up to 256 times the per-MDT limit + 4 billion (ldiskfs), 256 trillion (ZFS) per MDT The ldiskfs filesystem imposes an upper limit of @@ -781,11 +804,11 @@ increased initially at the time of MDT filesystem creation. For more information, see . - The ZFS filesystem dynamically allocates + The ZFS filesystem dynamically allocates inodes and does not have a fixed ratio of inodes per unit of MDT space, but consumes approximately 4KiB of mirrored space per inode, depending on the configuration. - Each additional MDT can hold up to the + Each additional MDT can hold up to the above maximum number of additional files, depending on available space and the distribution directories and files in the filesystem. @@ -793,7 +816,7 @@ - Maximum length of a filename + Maximum length of a filename 255 bytes (filename) @@ -805,7 +828,7 @@ - Maximum length of a pathname + Maximum length of a pathname 4096 bytes (pathname) @@ -816,7 +839,7 @@ - Maximum number of open files for a Lustre file system + Maximum number of open files for a Lustre file system No limit @@ -826,23 +849,16 @@ of open files, but the practical limit depends on the amount of RAM on the MDS. No "tables" for open files exist on the MDS, as they are only linked in a list to a given client's - export. Each client process probably has a limit of several - thousands of open files which depends on the ulimit. + export. Each client process has a limit of several + thousands of open files which depends on its ulimit.
  - By default for ldiskfs MDTs the maximum stripe count for a - single file is limited to 160 OSTs. In order to - increase the maximum file stripe count, use - --mkfsoptions="-O ea_inode" when formatting the MDT, - or use tune2fs -O ea_inode to enable it after the - MDT has been formatted. - -
+
<indexterm><primary>setup</primary><secondary>memory</secondary></indexterm>Determining Memory Requirements This section describes the memory requirements for each Lustre file system component.
@@ -865,79 +881,126 @@ Load placed on server - The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The number of locks held by clients varies by load and memory availability on the server. Interactive clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is approximately 2 KB per file, including the Lustre distributed lock manager (DLM) lock and kernel data structures for the files currently in use. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk. + The amount of memory used by the MDS is a function of how many clients are on + the system, and how many files they are using in their working set. This is driven, + primarily, by the number of locks a client can hold at one time. The number of locks + held by clients varies by load and memory availability on the server. Interactive + clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is + approximately 2 KB per file, including the Lustre distributed lock manager (LDLM) + lock and kernel data structures for the files currently in use. Having file data + in cache can improve metadata performance by a factor of 10x or more compared to + reading it from storage. MDS memory requirements include: - File system metadata : A reasonable amount of RAM needs to be available for file system metadata. While no hard limit can be placed on the amount of file system metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata. + File system metadata: + A reasonable amount of RAM needs to be available for file system metadata. + While no hard limit can be placed on the amount of file system metadata, + if more RAM is available, then the disk I/O is needed less often to retrieve + the metadata. - Network transport : If you are using TCP or other network transport that uses system memory for send/receive buffers, this memory requirement must also be taken into consideration. + Network transport: + If you are using TCP or other network transport that uses system memory for + send/receive buffers, this memory requirement must also be taken into + consideration. - Journal size : By default, the journal size is 400 MB for each Lustre ldiskfs file system. This can pin up to an equal amount of RAM on the MDS node per file system. + Journal size: + By default, the journal size is 4096 MB for each MDT ldiskfs file system. + This can pin up to an equal amount of RAM on the MDS node per file system. - Failover configuration : If the MDS node will be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails. + Failover configuration: + If the MDS node will be used for failover from another node, then the RAM + for each journal should be doubled, so the backup server can handle the + additional load if the primary server fails.
<indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>Calculating MDS Memory Requirements - By default, 400 MB are used for the file system journal. Additional RAM is used for caching file data for the larger working set, which is not actively in use by clients but should be kept "hot" for improved access times. Approximately 1.5 KB per file is needed to keep a file in cache without a lock. - For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes, and a 2 million file working set (of which 400,000 files are cached on the clients): + By default, 4096 MB are used for the ldiskfs filesystem journal. Additional + RAM is used for caching file data for the larger working set, which is not + actively in use by clients but should be kept "hot" for improved + access times. Approximately 1.5 KB per file is needed to keep a file in cache + without a lock. + For example, for a single MDT on an MDS with 1,024 clients, 12 interactive + login nodes, and a 6 million file working set (of which 4M files are cached + on the clients): - Operating system overhead = 512 MB - File system journal = 400 MB - 1000 * 4-core clients * 100 files/core * 2kB = 800 MB - 16 interactive clients * 10,000 files * 2kB = 320 MB - 1,600,000 file extra working set * 1.5kB/file = 2400 MB + Operating system overhead = 1024 MB + File system journal = 4096 MB + 1024 * 4-core clients * 1024 files/core * 2kB = 4096 MB + 12 interactive clients * 100,000 files * 2kB = 2400 MB + 2M file extra working set * 1.5kB/file = 3096 MB - Thus, the minimum requirement for a system with this configuration is at least 4 GB of RAM. However, additional memory may significantly improve performance. - For directories containing 1 million or more files, more memory may provide a significant benefit. For example, in an environment where clients randomly access one of 10 million files, having extra memory for the cache significantly improves performance. + Thus, the minimum requirement for an MDT with this configuration is at least + 16 GB of RAM. Additional memory may significantly improve performance. + For directories containing 1 million or more files, more memory can provide + a significant benefit. For example, in an environment where clients randomly + access one of 10 million files, having extra memory for the cache significantly + improves performance.
<indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>OSS Memory Requirements - When planning the hardware for an OSS node, consider the memory usage of several - components in the Lustre file system (i.e., journal, service threads, file system metadata, - etc.). Also, consider the effect of the OSS read cache feature, which consumes memory as it - caches data on the OSS node. - In addition to the MDS memory requirements mentioned in , the OSS requirements include: + When planning the hardware for an OSS node, consider the memory usage of + several components in the Lustre file system (i.e., journal, service threads, + file system metadata, etc.). Also, consider the effect of the OSS read cache + feature, which consumes memory as it caches data on the OSS node. + In addition to the MDS memory requirements mentioned above, + the OSS requirements also include: - Service threads : The service threads on the OSS node pre-allocate a 4 MB I/O buffer for each ost_io service thread, so these buffers do not need to be allocated and freed for each I/O request. + Service threads: + The service threads on the OSS node pre-allocate an RPC-sized MB I/O buffer + for each ost_io service thread, so these buffers do not need to be allocated + and freed for each I/O request. - OSS read cache : OSS read cache provides read-only - caching of data on an OSS, using the regular Linux page cache to store the data. Just - like caching from a regular file system in the Linux operating system, OSS read cache - uses as much physical memory as is available. + OSS read cache: + OSS read cache provides read-only caching of data on an OSS, using the regular + Linux page cache to store the data. Just like caching from a regular file + system in the Linux operating system, OSS read cache uses as much physical + memory as is available. - The same calculation applies to files accessed from the OSS as for the MDS, but the load is distributed over many more OSSs nodes, so the amount of memory required for locks, inode cache, etc. listed under MDS is spread out over the OSS nodes. - Because of these memory requirements, the following calculations should be taken as determining the absolute minimum RAM required in an OSS node. + The same calculation applies to files accessed from the OSS as for the MDS, + but the load is distributed over many more OSSs nodes, so the amount of memory + required for locks, inode cache, etc. listed under MDS is spread out over the + OSS nodes. + Because of these memory requirements, the following calculations should be + taken as determining the absolute minimum RAM required in an OSS node.
<indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>Calculating OSS Memory Requirements - The minimum recommended RAM size for an OSS with two OSTs is computed below: + The minimum recommended RAM size for an OSS with eight OSTs is: - Ethernet/TCP send/receive buffers (4 MB * 512 threads) = 2048 MB - 400 MB journal size * 2 OST devices = 800 MB - 1.5 MB read/write per OST IO thread * 512 threads = 768 MB - 600 MB file system read cache * 2 OSTs = 1200 MB - 1000 * 4-core clients * 100 files/core * 2kB = 800MB - 16 interactive clients * 10,000 files * 2kB = 320MB - 1,600,000 file extra working set * 1.5kB/file = 2400MB - DLM locks + file system metadata TOTAL = 3520MB - Per OSS DLM locks + file system metadata = 3520MB/6 OSS = 600MB (approx.) - Per OSS RAM minimum requirement = 4096MB (approx.) + Linux kernel and userspace daemon memory = 1024 MB + Network send/receive buffers (16 MB * 512 threads) = 8192 MB + 1024 MB ldiskfs journal size * 8 OST devices = 8192 MB + 16 MB read/write buffer per OST IO thread * 512 threads = 8192 MB + 2048 MB file system read cache * 8 OSTs = 16384 MB + 1024 * 4-core clients * 1024 files/core * 2kB/file = 8192 MB + 12 interactive clients * 100,000 files * 2kB/file = 2400 MB + 2M file extra working set * 2kB/file = 4096 MB + DLM locks + file cache TOTAL = 31072 MB + Per OSS DLM locks + file system metadata = 31072 MB/4 OSS = 7768 MB (approx.) + Per OSS RAM minimum requirement = 32 GB (approx.) - This consumes about 1,400 MB just for the pre-allocated buffers, and an additional 2 GB for minimal file system and kernel usage. Therefore, for a non-failover configuration, the minimum RAM would be 4 GB for an OSS node with two OSTs. Adding additional memory on the OSS will improve the performance of reading smaller, frequently-accessed files. - For a failover configuration, the minimum RAM would be at least 6 GB. For 4 OSTs on each OSS in a failover configuration 10GB of RAM is reasonable. When the OSS is not handling any failed-over OSTs the extra RAM will be used as a read cache. - As a reasonable rule of thumb, about 2 GB of base memory plus 1 GB per OST can be used. In failover configurations, about 2 GB per OST is needed. + This consumes about 16 GB just for pre-allocated buffers, and an + additional 1 GB for minimal file system and kernel usage. Therefore, for a + non-failover configuration, the minimum RAM would be about 32 GB for an OSS node + with eight OSTs. Adding additional memory on the OSS will improve the performance + of reading smaller, frequently-accessed files. + For a failover configuration, the minimum RAM would be at least 48 GB, + as some of the memory is per-node. When the OSS is not handling any failed-over + OSTs the extra RAM will be used as a read cache. + As a reasonable rule of thumb, about 8 GB of base memory plus 3 GB per OST + can be used. In failover configurations, about 6 GB per OST is needed.
-
+
<indexterm> <primary>setup</primary> <secondary>network</secondary> @@ -1010,3 +1073,6 @@ </note> </section> </chapter> +<!-- + vim:expandtab:shiftwidth=2:tabstop=8: + -->