LUDOC-394 manual: Add meaningful ref names under InstallingLustre.xml

[doc/manual.git] / SettingUpLustreSystem.xml
diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml

index 9e64b6c..ffca6a8 100644 (file)
--- a/SettingUpLustreSystem.xml
+++ b/SettingUpLustreSystem.xml
@@ -1,5 +1,7 @@
  <?xml version='1.0' encoding='UTF-8'?>
  <?xml version='1.0' encoding='UTF-8'?>
-<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="settinguplustresystem">
+<chapter xmlns="http://docbook.org/ns/docbook"
+ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
+ xml:id="settinguplustresystem">
    <title xml:id="settinguplustresystem.title">Determining Hardware Configuration Requirements and
      Formatting Options</title>
    <para>This chapter describes hardware configuration requirements for a Lustre file system
    <title xml:id="settinguplustresystem.title">Determining Hardware Configuration Requirements and
      Formatting Options</title>
    <para>This chapter describes hardware configuration requirements for a Lustre file system
@@ -7,31 +9,31 @@
    <itemizedlist>
      <listitem>
        <para>
    <itemizedlist>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_49017"/>
+          <xref linkend="dbdoclet.storage_hardware_considerations"/>
        </para>
      </listitem>
      <listitem>
        <para>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_31079"/>
+          <xref linkend="dbdoclet.space_requirements"/>
        </para>
      </listitem>
      <listitem>
        <para>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_84701"/>
+          <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>
        </para>
      </listitem>
      <listitem>
        <para>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_26456"/>
+          <xref linkend="dbdoclet.mds_oss_memory"/>
        </para>
      </listitem>
      <listitem>
        <para>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_78272"/>
+          <xref linkend="dbdoclet.network_considerations"/>
        </para>
      </listitem>
    </itemizedlist>
        </para>
      </listitem>
    </itemizedlist>
-  <section xml:id="dbdoclet.50438256_49017">
+  <section xml:id="dbdoclet.storage_hardware_considerations">
        <title><indexterm><primary>setup</primary></indexterm>
    <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm>        
    <indexterm><primary>design</primary><see>setup</see></indexterm>        
        <title><indexterm><primary>setup</primary></indexterm>
    <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm>        
    <indexterm><primary>design</primary><see>setup</see></indexterm>        
@@ -52,13 +54,15 @@
          <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
        </listitem>
      </itemizedlist>
          <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
        </listitem>
      </itemizedlist>
-       </warning>
-    <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
-      typically used for testing to match expected customer usage and avoid limitations due to the 4
-      GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs.
-      Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file
-      systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit
-      inode number.</para>
+    </warning>
+    <para>Only servers running on 64-bit CPUs are tested and supported.
+      64-bit CPU clients are typically used for testing to match expected
+      customer usage and avoid limitations due to the 4 GB limit for RAM
+      size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit
+      CPUs.  Also, due to kernel API limitations, performing backups of Lustre
+      filesystems on 32-bit clients may cause backup tools to confuse files
+      that report the same 32-bit inode number, if the backup tools depend
+      on the inode number for correct operation.</para>
      <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
        optionally be organized with logical volume management (LVM), which is then formatted as a
        Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
      <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
        optionally be organized with logical volume management (LVM), which is then formatted as a
        Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
@@ -68,76 +72,137 @@
        a separate device.</para>
      <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para>
      <note>
        a separate device.</para>
      <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para>
      <note>
-      <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). </para>
+      <para>Lustre clients running on different CPU architectures is supported.
+      One limitation is that the PAGE_SIZE kernel macro on the client must be
+      as large as the PAGE_SIZE of the server. In particular, ARM or PPC
+      clients with large pages (up to 64kB pages) can run with x86 servers
+      (4kB pages).</para>
      </note>
      <section remap="h3">
          <title><indexterm>
            <primary>setup</primary>
            <secondary>MDT</secondary>
          </indexterm> MGT and MDT Storage Hardware Considerations</title>
      </note>
      <section remap="h3">
          <title><indexterm>
            <primary>setup</primary>
            <secondary>MDT</secondary>
          </indexterm> MGT and MDT Storage Hardware Considerations</title>
-      <para>MGT storage requirements are small (less than 100 MB even in the largest Lustre file
-        systems), and the data on an MGT is only accessed on a server/client mount, so disk
-        performance is not a consideration.  However, this data is vital for file system access, so
+      <para>MGT storage requirements are small (less than 100 MB even in the
+      largest Lustre file systems), and the data on an MGT is only accessed
+      on a server/client mount, so disk performance is not a consideration.
+      However, this data is vital for file system access, so
          the MGT should be reliable storage, preferably mirrored RAID1.</para>
          the MGT should be reliable storage, preferably mirrored RAID1.</para>
-      <para>MDS storage is accessed in a database-like access pattern with many seeks and
-        read-and-writes of small amounts of data. High throughput to MDS storage is not important.
-        Storage types that provide much lower seek times, such as high-RPM SAS or SSD drives can be
-        used for the MDT.</para>
-      <para>For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.</para>
-      <para>If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.</para>
-      <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.</para>
-      <para condition='l24'>If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load.</para>
-      <warning condition='l24'><para>MDT0 contains the root of the Lustre file system. If MDT0 is unavailable for any reason, the
-          file system cannot be used.</para></warning>
-      <note condition='l24'><para>Additional MDTs can be dedicated to sub-directories off the root file system provided by MDT0.
-          Subsequent directories may also be configured to have their own MDT. If an MDT serving a
-          subdirectory becomes unavailable this subdirectory and all directories beneath it will
-          also become unavailable. Configuring multiple levels of MDTs is an experimental feature
-          for the Lustre software release 2.4.</para></note>
+      <para>MDS storage is accessed in a database-like access pattern with
+      many seeks and read-and-writes of small amounts of data.
+      Storage types that provide much lower seek times, such as SSD or NVMe
+      is strongly preferred for the MDT, and high-RPM SAS is acceptable.</para>
+      <para>For maximum performance, the MDT should be configured as RAID1 with
+      an internal journal and two disks from different controllers.</para>
+      <para>If you need a larger MDT, create multiple RAID1 devices from pairs
+      of disks, and then make a RAID0 array of the RAID1 devices.  For ZFS,
+      use <literal>mirror</literal> VDEVs for the MDT.  This ensures
+      maximum reliability because multiple disk failures only have a small
+      chance of hitting both disks in the same RAID1 device.</para>
+      <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50%
+      chance that even two disk failures can cause the loss of the whole MDT
+      device. The first failure disables an entire half of the mirror and the
+      second failure has a 50% chance of disabling the remaining mirror.</para>
+      <para>If multiple MDTs are going to be present in the
+      system, each MDT should be specified for the anticipated usage and load.
+      For details on how to add additional MDTs to the filesystem, see
+      <xref linkend="lustremaint.adding_new_mdt"/>.</para>
+      <warning><para>MDT0000 contains the root of the Lustre file system. If
+        MDT0000 is unavailable for any reason, the file system cannot be used.
+      </para></warning>
+      <note><para>Using the DNE feature it is possible to dedicate additional
+      MDTs to sub-directories off the file system root directory stored on
+      MDT0000, or arbitrarily for lower-level subdirectories, using the
+      <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal>
+      command.  If an MDT serving a subdirectory becomes unavailable, any
+      subdirectories on that MDT and all directories beneath it will also
+      become inaccessible.  This is typically useful for top-level directories
+      to assign different users or projects to separate MDTs, or to distribute
+      other large working sets of files to multiple MDTs.</para></note>
+      <note condition='l28'><para>Starting in the 2.8 release it is possible
+      to spread a single large directory across multiple MDTs using the DNE
+      striped directory feature by specifying multiple stripes (or shards)
+      at creation time using the
+      <literal>lfs mkdir -c <replaceable>stripe_count</replaceable></literal>
+      command, where <replaceable>stripe_count</replaceable> is often the
+      number of MDTs in the filesystem.  Striped directories should typically
+      not be used for all directories in the filesystem, since this incurs
+      extra overhead compared to non-striped directories, but is useful for
+      larger directories (over 50k entries) where many output files are being
+      created at one time.
+      </para></note>
      </section>
      <section remap="h3">
        <title><indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations</title>
      </section>
      <section remap="h3">
        <title><indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations</title>
-      <para>The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target up to 128 terabytes (TBs) in size.</para>
-      <para>Lustre file system capacity is the sum of the capacities provided by the targets. For
-        example, 64 OSSs, each with two 8 TB targets, provide a file system with a capacity of
-        nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a
-        RAID 6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to
-        400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a
-        system network, such as the InfiniBand network, that provides a similar bandwidth, then each
-        OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural
-        constraints described here are simple, in practice it takes careful hardware selection,
-        benchmarking and integration to obtain such results.)</para>
+      <para>The data access pattern for the OSS storage is a streaming I/O
+      pattern that is dependent on the access patterns of applications being
+      used. Each OSS can manage multiple object storage targets (OSTs), one
+      for each volume with I/O traffic load-balanced between servers and
+      targets. An OSS should be configured to have a balance between the
+      network bandwidth and the attached storage bandwidth to prevent
+      bottlenecks in the I/O path. Depending on the server hardware, an OSS
+      typically serves between 2 and 8 targets, with each target between
+      24-48TB, but may be up to 256 terabytes (TBs) in size.</para>
+      <para>Lustre file system capacity is the sum of the capacities provided
+      by the targets. For example, 64 OSSs, each with two 8 TB OSTs,
+      provide a file system with a capacity of nearly 1 PB. If each OST uses
+      ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6
+      configuration), it may be possible to get 50 MB/sec from each drive,
+      providing up to 400 MB/sec of disk bandwidth per OST. If this system
+      is used as storage backend with a system network, such as the InfiniBand
+      network, that provides a similar bandwidth, then each OSS could provide
+      800 MB/sec of end-to-end I/O throughput. (Although the architectural
+      constraints described here are simple, in practice it takes careful
+      hardware selection, benchmarking and integration to obtain such
+      results.)</para>
      </section>
    </section>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_31079">
+  <section xml:id="dbdoclet.space_requirements">
        <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
            <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
            Determining Space Requirements</title>
        <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
            <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
            Determining Space Requirements</title>
-    <para>The desired performance characteristics of the backing file systems on the MDT and OSTs
-      are independent of one another. The size of the MDT backing file system depends on the number
-      of inodes needed in the total Lustre file system, while the aggregate OST space depends on the
-      total amount of data stored on the file system. If MGS data is to be stored on the MDT device
-      (co-located MGT and MDT), add 100 MB to the required size estimate for the MDT.</para>
-    <para>Each time a file is created on a Lustre file system, it consumes one inode on the MDT and one inode for each OST object over which the file is striped. Normally, each file&apos;s stripe count is based on the system-wide default stripe count. However, this can be changed for individual files using the <literal>lfs setstripe</literal> option. For more details, see <xref linkend="managingstripingfreespace"/>.</para>
-    <para>In a Lustre ldiskfs file system, all the inodes are allocated on the MDT and OSTs when the file system is first formatted. The total number of inodes on a formatted MDT or OST cannot be easily changed, although it is possible to add OSTs with additional space and corresponding inodes. Thus, the number of inodes created at format time should be generous enough to anticipate future expansion.</para>
-    <para>When the file system is in use and a file is created, the metadata associated with that file is stored in one of the pre-allocated inodes and does not consume any of the free space used to store file data.</para>
-    <note>
-      <para>By default, the ldiskfs file system used by Lustre servers to store user-data objects
-        and system data reserves 5% of space that cannot be used by the Lustre file system.
-        Additionally, a Lustre file system reserves up to 400 MB on each OST for journal use and a
-        small amount of space outside the journal to store accounting data. This reserved space is
-        unusable for general storage. Thus, at least 400 MB of space is used on each OST before any
-        file object data is saved.</para>
-    </note>
-    <para condition="l24">With a ZFS backing filesystem for the MDT or OST,
+    <para>The desired performance characteristics of the backing file systems
+    on the MDT and OSTs are independent of one another. The size of the MDT
+    backing file system depends on the number of inodes needed in the total
+    Lustre file system, while the aggregate OST space depends on the total
+    amount of data stored on the file system. If MGS data is to be stored
+    on the MDT device (co-located MGT and MDT), add 100 MB to the required
+    size estimate for the MDT.</para>
+    <para>Each time a file is created on a Lustre file system, it consumes
+    one inode on the MDT and one OST object over which the file is striped.
+    Normally, each file&apos;s stripe count is based on the system-wide
+    default stripe count.  However, this can be changed for individual files
+    using the <literal>lfs setstripe</literal> option. For more details,
+    see <xref linkend="managingstripingfreespace"/>.</para>
+    <para>In a Lustre ldiskfs file system, all the MDT inodes and OST
+    objects are allocated when the file system is first formatted.  When
+    the file system is in use and a file is created, metadata associated
+    with that file is stored in one of the pre-allocated inodes and does
+    not consume any of the free space used to store file data.  The total
+    number of inodes on a formatted ldiskfs MDT or OST cannot be easily
+    changed. Thus, the number of inodes created at format time should be
+    generous enough to anticipate near term expected usage, with some room
+    for growth without the effort of additional storage.</para>
+    <para>By default, the ldiskfs file system used by Lustre servers to store
+    user-data objects and system data reserves 5% of space that cannot be used
+    by the Lustre file system.  Additionally, an ldiskfs Lustre file system
+    reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal
+    use and a small amount of space outside the journal to store accounting
+    data. This reserved space is unusable for general storage. Thus, at least
+    this much space will be used per OST before any file object data is saved.
+    </para>
+    <para>With a ZFS backing filesystem for the MDT or OST,
      the space allocation for inodes and file data is dynamic, and inodes are
      the space allocation for inodes and file data is dynamic, and inodes are
-    allocated as needed.  A minimum of 2kB of usable space (before RAID) is
-    needed for each inode, exclusive of other overhead such as directories,
-    internal log files, extended attributes, ACLs, etc.
+    allocated as needed.  A minimum of 4kB of usable space (before mirroring)
+    is needed for each inode, exclusive of other overhead such as directories,
+    internal log files, extended attributes, ACLs, etc.  ZFS also reserves
+    approximately 3% of the total storage space for internal and redundant
+    metadata, which is not usable by Lustre.
      Since the size of extended attributes and ACLs is highly dependent on
      kernel versions and site-specific policies, it is best to over-estimate
      the amount of space needed for the desired number of inodes, and any
      Since the size of extended attributes and ACLs is highly dependent on
      kernel versions and site-specific policies, it is best to over-estimate
      the amount of space needed for the desired number of inodes, and any
-    excess space will be utilized to store more inodes.</para>
+    excess space will be utilized to store more inodes.
+    </para>
      <section>
        <title><indexterm>
            <primary>setup</primary>
      <section>
        <title><indexterm>
            <primary>setup</primary>
@@ -147,10 +212,11 @@
            <primary>space</primary>
            <secondary>determining MGT requirements</secondary>
          </indexterm> Determining MGT Space Requirements</title>
            <primary>space</primary>
            <secondary>determining MGT requirements</secondary>
          </indexterm> Determining MGT Space Requirements</title>
-      <para>Less than 100 MB of space is required for the MGT. The size is determined by the number
-        of servers in the Lustre file system cluster(s) that are managed by the MGS.</para>
+      <para>Less than 100 MB of space is typically required for the MGT.
+      The size is determined by the total number of servers in the Lustre
+      file system cluster(s) that are managed by the MGS.</para>
      </section>
      </section>
-    <section xml:id="dbdoclet.50438256_87676">
+    <section xml:id="dbdoclet.mdt_space_requirements">
          <title><indexterm>
            <primary>setup</primary>
            <secondary>MDT</secondary>
          <title><indexterm>
            <primary>setup</primary>
            <secondary>MDT</secondary>
@@ -159,24 +225,88 @@
            <primary>space</primary>
            <secondary>determining MDT requirements</secondary>
          </indexterm> Determining MDT Space Requirements</title>
            <primary>space</primary>
            <secondary>determining MDT requirements</secondary>
          </indexterm> Determining MDT Space Requirements</title>
-      <para>When calculating the MDT size, the important factor to consider is the number of files
-        to be stored in the file system. This determines the number of inodes needed, which drives
-        the MDT sizing. To be on the safe side, plan for 2 KB per inode on the MDT, which is the
-        default value. Attached storage required for Lustre file system metadata is typically 1-2
-        percent of the file system capacity depending upon file size.</para>
-      <para>For example, if the average file size is 5 MB and you have 100 TB of usable OST space, then you can calculate the minimum number of inodes as follows:</para>
+      <para>When calculating the MDT size, the important factor to consider
+      is the number of files to be stored in the file system, which depends on
+      at least 2 KiB per inode of usable space on the MDT.  Since MDTs typically
+      use RAID-1+0 mirroring, the total storage needed will be double this.
+      </para>
+      <para>Please note that the actual used space per MDT depends on the number
+      of files per directory, the number of stripes per file, whether files
+      have ACLs or user xattrs, and the number of hard links per file.  The
+      storage required for Lustre file system metadata is typically 1-2
+      percent of the total file system capacity depending upon file size.
+      If the <xref linkend="dataonmdt"/> feature is in use for Lustre
+      2.11 or later, MDT space should typically be 5 percent or more of the
+      total space, depending on the distribution of small files within the
+      filesystem and the <literal>lod.*.dom_stripesize</literal> limit on
+      the MDT and file layout used.</para>
+      <para>For ZFS-based MDT filesystems, the number of inodes created on
+      the MDT and OST is dynamic, so there is less need to determine the
+      number of inodes in advance, though there still needs to be some thought
+      given to the total MDT space compared to the total filesystem size.</para>
+      <para>For example, if the average file size is 5 MiB and you have
+      100 TiB of usable OST space, then you can calculate the
+      <emphasis>minimum</emphasis> total number of inodes for MDTs and OSTs
+      as follows:</para>
        <informalexample>
        <informalexample>
-        <para>(100 TB * 1024 GB/TB * 1024 MB/GB) / 5 MB/inode = 20 million inodes</para>
+        <para>(500 TB * 1000000 MB/TB) / 5 MB/inode = 100M inodes</para>
        </informalexample>
        </informalexample>
-      <para>We recommend that you use at least twice the minimum number of inodes to allow for future expansion and allow for an average file size smaller than expected. Thus, the required space is:</para>
+      <para>It is recommended that the MDT(s) have at least twice the minimum
+      number of inodes to allow for future expansion and allow for an average
+      file size smaller than expected. Thus, the minimum space for ldiskfs
+      MDT(s) should be approximately:
+      </para>
        <informalexample>
        <informalexample>
-        <para>2 KB/inode * 40 million inodes = 80 GB</para>
+        <para>2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT</para>
        </informalexample>
        </informalexample>
-      <para>If the average file size is small, 4 KB for example, the Lustre file system is not very
-        efficient as the MDT uses as much space as the OSTs. However, this is not a common
-        configuration for a Lustre environment.</para>
+      <para>For details about formatting options for ldiskfs MDT and OST file
+      systems, see <xref linkend="dbdoclet.ldiskfs_mdt_mkfs"/>.</para>
+      <note>
+        <para>If the median file size is very small, 4 KB for example, the
+        MDT would use as much space for each file as the space used on the OST,
+        so the use of Data-on-MDT is strongly recommended in that case.
+        The MDT space per inode should be increased correspondingly to
+        account for the extra data space usage for each inode:
+      <informalexample>
+        <para>6 KiB/inode x 100 million inodes x 2 = 1200 GiB ldiskfs MDT</para>
+      </informalexample>
+      </para>
+      </note>
+      <note>
+        <para>If the MDT has too few inodes, this can cause the space on the
+        OSTs to be inaccessible since no new files can be created.  In this
+        case, the <literal>lfs df -i</literal> and <literal>df -i</literal>
+        commands will limit the number of available inodes reported for the
+        filesystem to match the total number of available objects on the OSTs.
+        Be sure to determine the appropriate MDT size needed to support the
+        filesystem before formatting. It is possible to increase the
+        number of inodes after the file system is formatted, depending on the
+        storage.  For ldiskfs MDT filesystems the <literal>resize2fs</literal>
+        tool can be used if the underlying block device is on a LVM logical
+        volume and the underlying logical volume size can be increased.
+        For ZFS new (mirrored) VDEVs can be added to the MDT pool to increase
+        the total space available for inode storage.
+        Inodes will be added approximately in proportion to space added.
+        </para>
+      </note>
+      <note>
+        <para>Note that the number of total and free inodes reported by
+        <literal>lfs df -i</literal> for ZFS MDTs and OSTs is estimated based
+        on the current average space used per inode.  When a ZFS filesystem is
+        first formatted, this free inode estimate will be very conservative
+        (low) due to the high ratio of directories to regular files created for
+        internal Lustre metadata storage, but this estimate will improve as
+        more files are created by regular users and the average file size will
+        better reflect actual site usage.
+        </para>
+      </note>
        <note>
        <note>
-        <para>If the MDT is too small, this can cause all the space on the OSTs to be unusable. Be sure to determine the appropriate size of the MDT needed to support the file system before formatting the file system. It is difficult to increase the number of inodes after the file system is formatted.</para>
+        <para>Using the DNE remote directory feature
+        it is possible to increase the total number of inodes of a Lustre
+        filesystem, as well as increasing the aggregate metadata performance,
+        by configuring additional MDTs into the filesystem, see
+        <xref linkend="lustremaint.adding_new_mdt"/> for details.
+        </para>
        </note>
      </section>
      <section remap="h3">
        </note>
      </section>
      <section remap="h3">
@@ -188,41 +318,58 @@
            <primary>space</primary>
            <secondary>determining OST requirements</secondary>
          </indexterm> Determining OST Space Requirements</title>
            <primary>space</primary>
            <secondary>determining OST requirements</secondary>
          </indexterm> Determining OST Space Requirements</title>
-      <para>For the OST, the amount of space taken by each object depends on the usage pattern of
-        the users/applications running on the system. The Lustre software defaults to a conservative
-        estimate for the object size (16 KB per object). If you are confident that the average file
-        size for your applications will be larger than this, you can specify a larger average file
-        size (fewer total inodes) to reduce file system overhead and minimize file system check
-        time. See <xref linkend="dbdoclet.50438256_53886"/> for more details.</para>
+      <para>For the OST, the amount of space taken by each object depends on
+      the usage pattern of the users/applications running on the system. The
+      Lustre software defaults to a conservative estimate for the average
+      object size (between 64 KiB per object for 10 GiB OSTs, and 1 MiB per
+      object for 16 TiB and larger OSTs). If you are confident that the average
+      file size for your applications will be different than this, you can
+      specify a different average file size (number of total inodes for a given
+      OST size) to reduce file system overhead and minimize file system check
+      time.
+      See <xref linkend="dbdoclet.ldiskfs_ost_mkfs"/> for more details.</para>
      </section>
    </section>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_84701">
-      <title>
-          <indexterm><primary>file system</primary><secondary>formatting options</secondary></indexterm>
-          <indexterm><primary>setup</primary><secondary>file system</secondary></indexterm>
-          Setting File System Formatting Options</title>
-    <para>By default, the <literal>mkfs.lustre</literal> utility applies these options to the Lustre
-      backing file system used to store data and metadata in order to enhance Lustre file system
-      performance and scalability. These options include:</para>
+  <section xml:id="dbdoclet.ldiskfs_mkfs_opts">
+    <title>
+      <indexterm>
+        <primary>ldiskfs</primary>
+        <secondary>formatting options</secondary>
+      </indexterm>
+      <indexterm>
+        <primary>setup</primary>
+        <secondary>ldiskfs</secondary>
+      </indexterm>
+      Setting ldiskfs File System Formatting Options
+    </title>
+    <para>By default, the <literal>mkfs.lustre</literal> utility applies these
+    options to the Lustre backing file system used to store data and metadata
+    in order to enhance Lustre file system performance and scalability. These
+    options include:</para>
          <itemizedlist>
              <listitem>
          <itemizedlist>
              <listitem>
-              <para><literal>flex_bg</literal> - When the flag is set to enable this
-          flexible-block-groups feature, block and inode bitmaps for multiple groups are aggregated
-          to minimize seeking when bitmaps are read or written and to reduce read/modify/write
-          operations on typical RAID storage (with 1 MB RAID stripe widths). This flag is enabled on
-          both OST and MDT file systems. On MDT file systems the <literal>flex_bg</literal> factor
-          is left at the default value of 16. On OSTs, the <literal>flex_bg</literal> factor is set
-          to 256 to allow all of the block or inode bitmaps in a single <literal>flex_bg</literal>
-          to be read or written in a single I/O on typical RAID storage.</para>
+              <para><literal>flex_bg</literal> - When the flag is set to enable
+              this flexible-block-groups feature, block and inode bitmaps for
+              multiple groups are aggregated to minimize seeking when bitmaps
+              are read or written and to reduce read/modify/write operations
+              on typical RAID storage (with 1 MiB RAID stripe widths). This flag
+              is enabled on both OST and MDT file systems. On MDT file systems
+              the <literal>flex_bg</literal> factor is left at the default value
+              of 16. On OSTs, the <literal>flex_bg</literal> factor is set
+              to 256 to allow all of the block or inode bitmaps in a single
+              <literal>flex_bg</literal> to be read or written in a single
+              1MiB I/O typical for RAID storage.</para>
              </listitem>
              <listitem>
              </listitem>
              <listitem>
-              <para><literal>huge_file</literal> - Setting this flag allows files on OSTs to be
-          larger than 2 TB in size.</para>
+              <para><literal>huge_file</literal> - Setting this flag allows
+              files on OSTs to be larger than 2 TiB in size.</para>
              </listitem>
              <listitem>
              </listitem>
              <listitem>
-              <para><literal>lazy_journal_init</literal> - This extended option is enabled to
-          prevent a full overwrite of the 400 MB journal that is allocated by default in a Lustre
-          file system, which reduces the file system format time.</para>
+              <para><literal>lazy_journal_init</literal> - This extended option
+              is enabled to prevent a full overwrite to zero out the large
+              journal that is allocated by default in a Lustre file system
+              (up to 400 MiB for OSTs, up to 4GiB for MDTs), to reduce the
+              formatting time.</para>
              </listitem>
          </itemizedlist>
      <para>To override the default formatting options, use arguments to
              </listitem>
          </itemizedlist>
      <para>To override the default formatting options, use arguments to
@@ -230,38 +377,79 @@
      <screen>--mkfsoptions=&apos;backing fs options&apos;</screen>
      <para>For other <literal>mkfs.lustre</literal> options, see the Linux man page for
          <literal>mke2fs(8)</literal>.</para>
      <screen>--mkfsoptions=&apos;backing fs options&apos;</screen>
      <para>For other <literal>mkfs.lustre</literal> options, see the Linux man page for
          <literal>mke2fs(8)</literal>.</para>
-    <section xml:id="dbdoclet.50438256_pgfId-1293228">
+    <section xml:id="dbdoclet.ldiskfs_mdt_mkfs">
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>MDS</secondary>
          </indexterm><indexterm>
            <primary>setup</primary>
            <secondary>inodes</secondary>
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>MDS</secondary>
          </indexterm><indexterm>
            <primary>setup</primary>
            <secondary>inodes</secondary>
-        </indexterm>Setting Formatting Options for an MDT</title>
-      <para>The number of inodes on the MDT is determined at format time based on the total size of
-        the file system to be created. The default <emphasis role="italic"
-          >bytes-per-inode</emphasis> ratio ("inode ratio") for an MDT is optimized at one inode for
-        every 2048 bytes of file system space. It is recommended that this value not be changed for
-        MDTs.</para>
-      <para>This setting takes into account the space needed for additional metadata, such as the
-        journal (up to 400 MB), bitmaps and directories, and a few files that the Lustre file system
-        uses to maintain cluster consistency.</para>
+        </indexterm>Setting Formatting Options for an ldiskfs MDT</title>
+      <para>The number of inodes on the MDT is determined at format time
+      based on the total size of the file system to be created. The default
+      <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
+      for an ldiskfs MDT is optimized at one inode for every 2560 bytes of file
+      system space.</para>
+      <para>This setting takes into account the space needed for additional
+      ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB),
+      bitmaps, and directories, as well as files that Lustre uses internally
+      to maintain cluster consistency.  There is additional per-file metadata
+      such as file layout for files with a large number of stripes, Access
+      Control Lists (ACLs), and user extended attributes.</para>
+      <para condition="l2B"> Starting in Lustre 2.11, the <xref linkend=
+      "dataonmdt.title"/> (DoM) feature allows storing small files on the MDT
+      to take advantage of high-performance flash storage, as well as reduce
+      space and network overhead.  If you are planning to use the DoM feature
+      with an ldiskfs MDT, it is recommended to <emphasis>increase</emphasis>
+      the bytes-per-inode ratio to have enough space on the MDT for small files,
+      as described below.
+      </para>
+      <para>It is possible to change the recommended default of 2560 bytes
+      per inode for an ldiskfs MDT when it is first formatted by adding the
+      <literal>--mkfsoptions="-i bytes-per-inode"</literal> option to
+      <literal>mkfs.lustre</literal>.  Decreasing the inode ratio tunable
+      <literal>bytes-per-inode</literal> will create more inodes for a given
+      MDT size, but will leave less space for extra per-file metadata and is
+      not recommended.  The inode ratio must always be strictly larger than
+      the MDT inode size, which is 1024 bytes by default.  It is recommended
+      to use an inode ratio at least 1536 bytes larger than the inode size to
+      ensure the MDT does not run out of space.  Increasing the inode ratio
+      with enough space for the most commonly file size (e.g. 5632 or 66560
+      bytes if 4KB or 64KB files are widely used) is recommended for DoM.</para>
+      <para>The size of the inode may be changed at format time by adding the
+      <literal>--stripe-count-hint=N</literal> to have
+      <literal>mkfs.lustre</literal> automatically calculate a reasonable
+      inode size based on the default stripe count that will be used by the
+      filesystem, or directly by specifying the
+      <literal>--mkfsoptions="-I inode-size"</literal> option.  Increasing
+      the inode size will provide more space in the inode for a larger Lustre
+      file layout, ACLs, user and system extended attributes, SELinux and
+      other security labels, and other internal metadata and DoM data.  However,
+      if these features or other in-inode xattrs are not needed, a larger inode
+      size may hurt metadata performance as 2x, 4x, or 8x as much data would be
+      read or written for each MDT inode access.
+      </para>
      </section>
      </section>
-    <section xml:id="dbdoclet.50438256_53886">
+    <section xml:id="dbdoclet.ldiskfs_ost_mkfs">
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>OST</secondary>
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>OST</secondary>
-        </indexterm>Setting Formatting Options for an OST</title>
-      <para>When formatting OST file systems, it is normally advantageous to take local file system
-        usage into account. When doing so, try to minimize the number of inodes on each OST, while
-        keeping enough margin for potential variations in future usage. This helps reduce the format
-        and file system check time and makes more space available for data.</para>
-      <para>The table below shows the default <emphasis role="italic">bytes-per-inode
-        </emphasis>ratio ("inode ratio") used for OSTs of various sizes when they are formatted. </para>
+        </indexterm>Setting Formatting Options for an ldiskfs OST</title>
+      <para>When formatting an OST file system, it can be beneficial
+      to take local file system usage into account, for example by running
+      <literal>df</literal> and <literal>df -i</literal> on a current filesystem
+      to get the used bytes and used inodes respectively, then computing the
+      average bytes-per-inode value. When deciding on the ratio for a new
+      filesystem, try to avoid having too many inodes on each OST, while keeping
+      enough margin to allow for future usage of smaller files. This helps
+      reduce the format and e2fsck time and makes more space available for data.
+      </para>
+      <para>The table below shows the default
+      <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
+      used for OSTs of various sizes when they are formatted.</para>
        <para>
        <para>
-        <table frame="all">
-          <title xml:id="settinguplustresystem.tab1">Inode Ratios Used for Newly Formatted
-            OSTs</title>
+        <table frame="all" xml:id="settinguplustresystem.tab1">
+          <title>Default Inode Ratios Used for Newly Formatted OSTs</title>
            <tgroup cols="3">
              <colspec colname="c1" colwidth="3*"/>
              <colspec colname="c2" colwidth="2*"/>
            <tgroup cols="3">
              <colspec colname="c1" colwidth="3*"/>
              <colspec colname="c2" colwidth="2*"/>
@@ -272,7 +460,7 @@
                    <para><emphasis role="bold">LUN/OST size</emphasis></para>
                  </entry>
                  <entry>
                    <para><emphasis role="bold">LUN/OST size</emphasis></para>
                  </entry>
                  <entry>
-                  <para><emphasis role="bold">Inode ratio</emphasis></para>
+                  <para><emphasis role="bold">Default Inode ratio</emphasis></para>
                  </entry>
                  <entry>
                    <para><emphasis role="bold">Total inodes</emphasis></para>
                  </entry>
                  <entry>
                    <para><emphasis role="bold">Total inodes</emphasis></para>
@@ -282,320 +470,395 @@
              <tbody>
                <row>
                  <entry>
              <tbody>
                <row>
                  <entry>
-                  <para> over 10GB </para>
+                  <para>under 10GiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 1 inode/16KB </para>
+                  <para>1 inode/16KiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 640 - 655k </para>
+                  <para>640 - 655k </para>
                  </entry>
                </row>
                <row>
                  <entry>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para> 10GB - 1TB </para>
+                  <para>10GiB - 1TiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 1 inode/68kiB </para>
+                  <para>1 inode/68KiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 153k - 15.7M </para>
+                  <para>153k - 15.7M </para>
                  </entry>
                </row>
                <row>
                  <entry>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para> 1TB - 8TB </para>
+                  <para>1TiB - 8TiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 1 inode/256kB </para>
+                  <para>1 inode/256KiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 4.2M - 33.6M </para>
+                  <para>4.2M - 33.6M </para>
                  </entry>
                </row>
                <row>
                  <entry>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para> over 8TB </para>
+                  <para>over 8TiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 1 inode/1MB </para>
+                  <para>1 inode/1MiB </para>
                  </entry>
                  <entry>
                  </entry>
                  <entry>
-                  <para> 8.4M - 134M </para>
+                  <para>8.4M - 268M </para>
                  </entry>
                </row>
              </tbody>
            </tgroup>
          </table>
        </para>
                  </entry>
                </row>
              </tbody>
            </tgroup>
          </table>
        </para>
-      <para>In environments with few small files, the default inode ratio may result in far too many
-        inodes for the average file size. In this case, performance can be improved by increasing
-        the number of <emphasis role="italic">bytes-per-inode</emphasis>.To set the inode ratio, use
-        the <literal>-i</literal> argument to <literal>mkfs.lustre</literal> to specify the
-          <emphasis role="italic">bytes-per-inode</emphasis> value. </para>
+      <para>In environments with few small files, the default inode ratio
+      may result in far too many inodes for the average file size. In this
+      case, performance can be improved by increasing the number of
+      <emphasis role="italic">bytes-per-inode</emphasis>.  To set the inode
+      ratio, use the <literal>--mkfsoptions="-i <replaceable>bytes-per-inode</replaceable>"</literal>
+      argument to <literal>mkfs.lustre</literal> to specify the expected
+      average (mean) size of OST objects.  For example, to create an OST
+      with an expected average object size of 8 MiB run:
+      <screen>[oss#] mkfs.lustre --ost --mkfsoptions=&quot;-i $((8192 * 1024))&quot; ...</screen>
+      </para>
        <note>
        <note>
-        <para>File system check time on OSTs is affected by a number of  variables in addition to
-          the number of inodes, including the size of the file system, the number of allocated
-          blocks, the distribution of allocated blocks on the disk, disk speed, CPU speed, and the
-          amount of RAM on the server. Reasonable file system check times are 5-30 minutes per
-          TB.</para>
+        <para>OSTs formatted with ldiskfs should preferably have fewer than
+        320 million objects per MDT, and up to a maximum of 4 billion inodes.
+        Specifying a very small bytes-per-inode ratio for a large OST that
+        exceeds this limit can cause either premature out-of-space errors and
+        prevent the full OST space from being used, or will waste space and
+        slow down e2fsck more than necessary. The default inode ratios are
+        chosen to ensure the total number of inodes remain below this limit.
+        </para>
        </note>
        </note>
-      <para>For more details about formatting MDT and OST file systems, see <xref
-          linkend="dbdoclet.50438208_51921"/>.</para>
-    </section>
-    <section remap="h3">
-      <title><indexterm>
-          <primary>setup</primary>
-          <secondary>limits</secondary>
-        </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
-          <primary>wide striping</primary>
-        </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
-          <primary>xattr</primary>
-          <secondary><emphasis role="italic">See</emphasis> wide striping</secondary>
-        </indexterm><indexterm>
-          <primary>large_xattr</primary>
-          <secondary>ea_inode</secondary>
-        </indexterm><indexterm>
-          <primary>wide striping</primary>
-          <secondary>large_xattr</secondary>
-          <tertiary>ea_inode</tertiary>
-        </indexterm>File and File System Limits</title>
-      <para><xref linkend="settinguplustresystem.tab2"/> describes file and file system size limits.
-        These limits are imposed by either the Lustre architecture or the Linux virtual file system
-        (VFS) and virtual memory subsystems. In a few cases, a limit is defined within the code and
-        can be changed by re-compiling the Lustre software (see <xref
-          linkend="installinglustrefromsourcecode"/>). In these cases, the indicated limit was used
-        for testing of the Lustre software. </para>
-      <table frame="all">
-        <title xml:id="settinguplustresystem.tab2">File and file system limits</title>
-        <tgroup cols="3">
-          <colspec colname="c1" colwidth="3*"/>
-          <colspec colname="c2" colwidth="2*"/>
-          <colspec colname="c3" colwidth="4*"/>
-          <thead>
-            <row>
-              <entry>
-                <para><emphasis role="bold">Limit</emphasis></para>
-              </entry>
-              <entry>
-                <para><emphasis role="bold">Value</emphasis></para>
-              </entry>
-              <entry>
-                <para><emphasis role="bold">Description</emphasis></para>
-              </entry>
-            </row>
-          </thead>
-          <tbody>
-            <row>
-              <entry>
-                <para> Maximum number of MDTs</para>
-              </entry>
-              <entry>
-                <para> 1</para>
-                <para condition='l24'>4096</para>
-              </entry>
-              <entry>
-                <para>The Lustre software release 2.3 and earlier allows a maximum of 1 MDT per file
-                  system, but a single MDS can host multiple MDTs, each one for a separate file
-                  system.</para>
-                <para condition="l24">The Lustre software release 2.4 and later requires one MDT for
-                  the filesystem root. Up to 4095 additional MDTs can be added to the file system and attached
-                  into the namespace with remote directories.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum number of OSTs</para>
-              </entry>
-              <entry>
-                <para> 8150</para>
-              </entry>
-              <entry>
-                <para>The maximum number of OSTs is a constant that can be changed at compile time.
-                  Lustre file systems with up to 4000 OSTs have been tested.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum OST size</para>
-              </entry>
-              <entry>
-                <para> 128TB (ldiskfs), 256TB (ZFS)</para>
-              </entry>
-              <entry>
-                <para>This is not a <emphasis>hard</emphasis> limit. Larger OSTs are possible but
-                  today typical production systems do not go beyond the stated limit per OST. </para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum number of clients</para>
-              </entry>
-              <entry>
-                <para> 131072</para>
-              </entry>
-              <entry>
-                <para>The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum size of a file system</para>
-              </entry>
-              <entry>
-                <para> 512 PB (ldiskfs), 1EB (ZFS)</para>
-              </entry>
-              <entry>
-                <para>Each OST or MDT on 64-bit kernel servers can have a file system up to the above limit. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para>
-                <para>You can have multiple OST file systems on a single OSS node.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum stripe count</para>
-              </entry>
-              <entry>
-                <para> 2000</para>
-              </entry>
-              <entry>
-                <para>This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum stripe size</para>
-              </entry>
-              <entry>
-                <para> &lt; 4 GB</para>
-              </entry>
-              <entry>
-                <para>The amount of data written to each object before moving on to next object.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Minimum stripe size</para>
-              </entry>
-              <entry>
-                <para> 64 KB</para>
-              </entry>
-              <entry>
-                <para>Due to the 64 KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB.</para>
-              </entry>
-            </row>
-            <row>              <entry>
-                <para> Maximum object size</para>              </entry>
-              <entry>
-                <para> 16TB (ldiskfs), 256TB (ZFS)</para>
-              </entry>
-              <entry>
-                <para>The amount of data that can be stored in a single object. An object
-                  corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies.  
-                  For ZFS the limit is the size of the underlying OST.
-                  Files can consist of up to 2000 stripes, each stripe can contain the maximum object size. </para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum <anchor xml:id="dbdoclet.50438256_marker-1290761" xreflabel=""/>file size</para>
-              </entry>
-              <entry>
-                <para> 16 TB on 32-bit systems</para>
-                <para>&#160;</para>
-                <para> 31.25 PB on 64-bit ldiskfs systems, 8EB on 64-bit ZFS systems</para>
-              </entry>
-              <entry>
-                <para>Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed
-                  by the kernel memory subsystem. On 64-bit systems this limit does not exist.
-                  Hence, files can be 2^63 bits (8EB) in size if the backing filesystem can support large enough objects.</para>
-                <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum number of files or subdirectories in a single directory</para>
-              </entry>
-              <entry>
-                <para> 10 million files (ldiskfs), 2^48 (ZFS)</para>
-              </entry>
-              <entry>
-                <para>The Lustre software uses the ldiskfs hashed directory code, which has a limit
-                  of about 10 million files depending on the length of the file name. The limit on
-                  subdirectories is the same as the limit on regular files.</para>
-                <para>Lustre file systems are tested with ten million files in a single
-                  directory.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum number of files in the file system</para>
-              </entry>
-              <entry>
-                <para> 4 billion (ldiskfs), 256 trillion (ZFS)</para>
-                <para condition='l24'>4096 times the per-MDT limit</para>
-              </entry>
-              <entry>
-                <para>The ldiskfs file system imposes an upper limit of 4 billion inodes. By default, the MDS file system is formatted with 2KB of space per inode, meaning 1 billion inodes per file system of 2 TB.</para>
-                <para>This can be increased initially, at the time of MDS file system creation. For more information, see <xref linkend="settinguplustresystem"/>.</para>
-                               <para condition="l24">Each additional MDT can hold up to the above maximum number of additional files, depending
-                  on available space and the distribution directories and files in the file
-                  system.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum length of a filename</para>
-              </entry>
-              <entry>
-                <para> 255 bytes (filename)</para>
-              </entry>
-              <entry>
-                <para>This limit is 255 bytes for a single filename, the same as the limit in the underlying file systems.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum length of a pathname</para>
-              </entry>
-              <entry>
-                <para> 4096 bytes (pathname)</para>
-              </entry>
-              <entry>
-                <para>The Linux VFS imposes a full pathname length of 4096 bytes.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para> Maximum number of open files for a Lustre file system</para>
-              </entry>
-              <entry>
-                <para> No limit</para>
-              </entry>
-              <entry>
-                <para>The Lustre software does not impose a maximum for the number of open files,
-                  but the practical limit depends on the amount of RAM on the MDS. No
-                  &quot;tables&quot; for open files exist on the MDS, as they are only linked in a
-                  list to a given client&apos;s export. Each client process probably has a limit of
-                  several thousands of open files which depends on the ulimit.</para>
-              </entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </table>
-      <para>&#160;</para>
        <note>
        <note>
-        <para condition="l22">In Lustre software releases prior to release 2.2, the maximum stripe
-          count for a single file was limited to 160 OSTs. In Lustre software release 2.2, the large
-            <literal>xattr</literal> feature ("wide striping") was added to support up to 2000 OSTs.
-          This feature is disabled by default at <literal>mkfs.lustre</literal> time. In order to
-          enable this feature, set the "<literal>-O large_xattr</literal>" or "<literal>-O ea_inode</literal>"
-          option on the MDT either by using <literal>--mkfsoptions</literal> at format time or by using
-            <literal>tune2fs</literal>. Using either "<literal>large_xattr</literal>" or "<literal>ea_inode</literal>"
-          results in "<literal>ea_inode</literal>" in the file system feature list.</para>
+        <para>File system check time on OSTs is affected by a number of
+        variables in addition to the number of inodes, including the size of
+        the file system, the number of allocated blocks, the distribution of
+        allocated blocks on the disk, disk speed, CPU speed, and the amount
+        of RAM on the server. Reasonable file system check times for valid
+        filesystems are 5-30 minutes per TiB, but may increase significantly
+        if substantial errors are detected and need to be repaired.</para>
        </note>
        </note>
+      <para>For further details about optimizing MDT and OST file systems,
+      see <xref linkend="dbdoclet.ldiskfs_raid_opts"/>.</para>
      </section>
    </section>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_26456">
+  <section remap="h3">
+    <title><indexterm>
+        <primary>setup</primary>
+        <secondary>limits</secondary>
+      </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
+        <primary>wide striping</primary>
+      </indexterm><indexterm xmlns:xi="http://www.w3.org/2001/XInclude">
+        <primary>xattr</primary>
+        <secondary><emphasis role="italic">See</emphasis> wide striping</secondary>
+      </indexterm><indexterm>
+        <primary>large_xattr</primary>
+        <secondary>ea_inode</secondary>
+      </indexterm><indexterm>
+        <primary>wide striping</primary>
+        <secondary>large_xattr</secondary>
+        <tertiary>ea_inode</tertiary>
+      </indexterm>File and File System Limits</title>
+
+      <para><xref linkend="settinguplustresystem.tab2"/> describes
+      current known limits of Lustre.  These limits may be imposed by either
+      the Lustre architecture or the Linux virtual file system (VFS) and
+      virtual memory subsystems. In a few cases, a limit is defined within
+      the code Lustre based on tested values and could be changed by editing
+      and re-compiling the Lustre software.  In these cases, the indicated
+      limit was used for testing of the Lustre software.</para>
+
+    <table frame="all" xml:id="settinguplustresystem.tab2">
+      <title>File and file system limits</title>
+      <tgroup cols="3">
+        <colspec colname="c1" colwidth="3*"/>
+        <colspec colname="c2" colwidth="2*"/>
+        <colspec colname="c3" colwidth="4*"/>
+        <thead>
+          <row>
+            <entry>
+              <para><emphasis role="bold">Limit</emphasis></para>
+            </entry>
+            <entry>
+              <para><emphasis role="bold">Value</emphasis></para>
+            </entry>
+            <entry>
+              <para><emphasis role="bold">Description</emphasis></para>
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_mdt_count" xreflabel=""/>Maximum number of MDTs</para>
+            </entry>
+            <entry>
+              <para>256</para>
+            </entry>
+            <entry>
+              <para>A single MDS can host one or more MDTs, either for separate
+              filesystems, or aggregated into a single namespace. Each
+              filesystem requires a separate MDT for the filesystem root
+             directory.
+              Up to 255 more MDTs can be added to the filesystem and are
+              attached into the filesystem namespace with creation of DNE
+              remote or striped directories.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_ost_count" xreflabel=""/>Maximum number of OSTs</para>
+            </entry>
+            <entry>
+              <para>8150</para>
+            </entry>
+            <entry>
+              <para>The maximum number of OSTs is a constant that can be
+              changed at compile time.  Lustre file systems with up to 4000
+              OSTs have been configured in the past.  Multiple OST targets
+              can be configured on a single OSS node.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_ost_size" xreflabel=""/>Maximum OST size</para>
+            </entry>
+            <entry>
+              <para>1024TiB (ldiskfs), 1024TiB (ZFS)</para>
+            </entry>
+            <entry>
+              <para>This is not a <emphasis>hard</emphasis> limit. Larger
+              OSTs are possible, but most production systems do not
+              typically go beyond the stated limit per OST because Lustre
+              can add capacity and performance with additional OSTs, and
+              having more OSTs improves aggregate I/O performance,
+              minimizes contention, and allows parallel recovery (e2fsck
+              for ldiskfs OSTs, scrub for ZFS OSTs).
+              </para>
+              <para>
+              With 32-bit kernels, due to page cache limits, 16TB is the
+              maximum block device size, which in turn applies to the
+              size of OST.  It is <emphasis>strongly</emphasis> recommended
+              to run Lustre clients and servers with 64-bit kernels.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_client_count" xreflabel=""/>Maximum number of clients</para>
+            </entry>
+            <entry>
+              <para>131072</para>
+            </entry>
+            <entry>
+              <para>The maximum number of clients is a constant that can
+              be changed at compile time. Up to 30000 clients have been
+              used in production accessing a single filesystem.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_filesysem_size" xreflabel=""/>Maximum size of a single file system</para>
+            </entry>
+            <entry>
+              <para>2EiB or larger</para>
+            </entry>
+            <entry>
+              <para>Each OST can have a file system up to the "Maximum OST
+              size" limit, and the Maximum number of OSTs can be combined
+              into a single filesystem.
+              </para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_stripe_count" xreflabel=""/>Maximum stripe count</para>
+            </entry>
+            <entry>
+              <para>2000</para>
+            </entry>
+            <entry>
+              <para>This limit is imposed by the size of the layout that
+              needs to be stored on disk and sent in RPC requests, but is
+              not a hard limit of the protocol. The number of OSTs in the
+              filesystem can exceed the stripe count, but this is the maximum
+              number of OSTs on which a <emphasis>single file</emphasis>
+              can be striped.</para>
+              <note condition='l2D'><para>Before 2.13, the default for ldiskfs
+             MDTs the maximum stripe count for a
+              <emphasis>single file</emphasis> is limited to 160 OSTs.  In order to
+              increase the maximum file stripe count, use
+              <literal>--mkfsoptions="-O ea_inode"</literal> when formatting the MDT,
+              or use <literal>tune2fs -O ea_inode</literal> to enable it after the
+              MDT has been formatted.</para>
+              </note>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_stripe_size" xreflabel=""/>Maximum stripe size</para>
+            </entry>
+            <entry>
+              <para>&lt; 4 GiB</para>
+            </entry>
+            <entry>
+              <para>The amount of data written to each object before moving
+              on to next object.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.min_stripe_size" xreflabel=""/>Minimum stripe size</para>
+            </entry>
+            <entry>
+              <para>64 KiB</para>
+            </entry>
+            <entry>
+              <para>Due to the use of 64 KiB PAGE_SIZE on some CPU
+              architectures such as ARM and POWER, the minimum stripe
+              size is 64 KiB so that a single page is not split over
+              multiple servers.  This is also the minimum Data-on-MDT
+             component size that can be specified.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_object_size" xreflabel=""/>Maximum single object size</para>
+            </entry>
+            <entry>
+              <para>16TiB (ldiskfs), 256TiB (ZFS)</para>
+            </entry>
+            <entry>
+              <para>The amount of data that can be stored in a single object.
+              An object corresponds to a stripe. The ldiskfs limit of 16 TB
+              for a single object applies.  For ZFS the limit is the size of
+              the underlying OST.  Files can consist of up to 2000 stripes,
+              each stripe can be up to the maximum object size. </para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_file_size" xreflabel=""/>Maximum file size</para>
+            </entry>
+            <entry>
+              <para>16 TiB on 32-bit systems</para>
+              <para>&#160;</para>
+              <para>31.25 PiB on 64-bit ldiskfs systems,
+              8EiB on 64-bit ZFS systems</para>
+            </entry>
+            <entry>
+              <para>Individual files have a hard limit of nearly 16 TiB on
+              32-bit systems imposed by the kernel memory subsystem. On
+              64-bit systems this limit does not exist.  Hence, files can
+              be 2^63 bits (8EiB) in size if the backing filesystem can
+              support large enough objects and/or the files are sparse.</para>
+              <para>A single file can have a maximum of 2000 stripes, which
+              gives an upper single file data capacity of 31.25 PiB for 64-bit
+              ldiskfs systems. The actual amount of data that can be stored
+              in a file depends upon the amount of free space in each OST
+              on which the file is striped.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_directory_size" xreflabel=""/>Maximum number of files or subdirectories in a single directory</para>
+            </entry>
+            <entry>
+              <para>600M-3.8B files (ldiskfs), 16T (ZFS)</para>
+            </entry>
+            <entry>
+              <para>The Lustre software uses the ldiskfs hashed directory
+              code, which has a limit of at least 600 million files, depending
+              on the length of the file name. The limit on subdirectories
+              is the same as the limit on regular files.</para>
+              <note condition='l28'><para>Starting in the 2.8 release it is
+              possible to exceed this limit by striping a single directory
+              over multiple MDTs with the <literal>lfs mkdir -c</literal>
+              command, which increases the single directory limit by a
+              factor of the number of directory stripes used.</para></note>
+              <note condition='l2E'><para>Starting in the 2.14 release, the
+              <literal>large_dir</literal> feature of ldiskfs is enabled by
+              default to allow directories with more than 10M entries.  In
+              the 2.12 release, the <literal>large_dir</literal> feature was
+              present but not enabled by default.</para></note>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_file_count" xreflabel=""/>Maximum number of files in the file system</para>
+            </entry>
+            <entry>
+              <para>4 billion (ldiskfs), 256 trillion (ZFS) <emphasis>per MDT</emphasis></para>
+            </entry>
+            <entry>
+              <para>The ldiskfs filesystem imposes an upper limit of
+              4 billion inodes per filesystem. By default, the MDT
+              filesystem is formatted with one inode per 2KB of space,
+              meaning 512 million inodes per TiB of MDT space. This can be
+              increased initially at the time of MDT filesystem creation.
+              For more information, see
+              <xref linkend="settinguplustresystem"/>.</para>
+              <para>The ZFS filesystem dynamically allocates
+              inodes and does not have a fixed ratio of inodes per unit of MDT
+              space, but consumes approximately 4KiB of mirrored space per
+              inode, depending on the configuration.</para>
+              <para>Each additional MDT can hold up to the
+              above maximum number of additional files, depending on
+              available space and the distribution directories and files
+              in the filesystem.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_filename_size" xreflabel=""/>Maximum length of a filename</para>
+            </entry>
+            <entry>
+              <para>255 bytes (filename)</para>
+            </entry>
+            <entry>
+              <para>This limit is 255 bytes for a single filename, the
+              same as the limit in the underlying filesystems.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_pathname_size" xreflabel=""/>Maximum length of a pathname</para>
+            </entry>
+            <entry>
+              <para>4096 bytes (pathname)</para>
+            </entry>
+            <entry>
+              <para>The Linux VFS imposes a full pathname length of 4096 bytes.</para>
+            </entry>
+          </row>
+          <row>
+            <entry>
+              <para><anchor xml:id="dbdoclet.max_open_files" xreflabel=""/>Maximum number of open files for a Lustre file system</para>
+            </entry>
+            <entry>
+              <para>No limit</para>
+            </entry>
+            <entry>
+              <para>The Lustre software does not impose a maximum for the number
+              of open files, but the practical limit depends on the amount of
+              RAM on the MDS. No &quot;tables&quot; for open files exist on the
+              MDS, as they are only linked in a list to a given client&apos;s
+              export. Each client process has a limit of several
+              thousands of open files which depends on its ulimit.</para>
+            </entry>
+          </row>
+        </tbody>
+      </tgroup>
+    </table>
+    <para>&#160;</para>
+  </section>
+  <section xml:id="dbdoclet.mds_oss_memory">
      <title><indexterm><primary>setup</primary><secondary>memory</secondary></indexterm>Determining Memory Requirements</title>
      <para>This section describes the memory requirements for each Lustre file system component.</para>
      <section remap="h3">
      <title><indexterm><primary>setup</primary><secondary>memory</secondary></indexterm>Determining Memory Requirements</title>
      <para>This section describes the memory requirements for each Lustre file system component.</para>
      <section remap="h3">
@@ -618,79 +881,126 @@
            <para>Load placed on server</para>
          </listitem>
        </itemizedlist>
            <para>Load placed on server</para>
          </listitem>
        </itemizedlist>
-      <para>The amount of memory used by the MDS is a function of how many clients are on the system, and how many files they are using in their working set. This is driven, primarily, by the number of locks a client can hold at one time. The number of locks held by clients varies by load and memory availability on the server. Interactive clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is approximately 2 KB per file, including the Lustre distributed lock manager (DLM) lock and kernel data structures for the files currently in use. Having file data in cache can improve metadata performance by a factor of 10x or more compared to reading it from disk.</para>
+      <para>The amount of memory used by the MDS is a function of how many clients are on
+      the system, and how many files they are using in their working set. This is driven,
+      primarily, by the number of locks a client can hold at one time. The number of locks
+      held by clients varies by load and memory availability on the server. Interactive
+      clients can hold in excess of 10,000 locks at times. On the MDS, memory usage is
+      approximately 2 KB per file, including the Lustre distributed lock manager (LDLM)
+      lock and kernel data structures for the files currently in use. Having file data
+      in cache can improve metadata performance by a factor of 10x or more compared to
+      reading it from storage.</para>
        <para>MDS memory requirements include:</para>
        <itemizedlist>
          <listitem>
        <para>MDS memory requirements include:</para>
        <itemizedlist>
          <listitem>
-          <para><emphasis role="bold">File system metadata</emphasis> : A reasonable amount of RAM needs to be available for file system metadata. While no hard limit can be placed on the amount of file system metadata, if more RAM is available, then the disk I/O is needed less often to retrieve the metadata.</para>
+          <para><emphasis role="bold">File system metadata</emphasis>:
+         A reasonable amount of RAM needs to be available for file system metadata.
+         While no hard limit can be placed on the amount of file system metadata,
+         if more RAM is available, then the disk I/O is needed less often to retrieve
+         the metadata.</para>
          </listitem>
          <listitem>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">Network transport</emphasis> : If you are using TCP or other network transport that uses system memory for send/receive buffers, this memory requirement must also be taken into consideration.</para>
+          <para><emphasis role="bold">Network transport</emphasis>:
+         If you are using TCP or other network transport that uses system memory for
+         send/receive buffers, this memory requirement must also be taken into
+         consideration.</para>
          </listitem>
          <listitem>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">Journal size</emphasis> : By default, the journal size is 400 MB for each Lustre ldiskfs file system. This can pin up to an equal amount of RAM on the MDS node per file system.</para>
+          <para><emphasis role="bold">Journal size</emphasis>:
+         By default, the journal size is 4096 MB for each MDT ldiskfs file system.
+         This can pin up to an equal amount of RAM on the MDS node per file system.</para>
          </listitem>
          <listitem>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">Failover configuration</emphasis> : If the MDS node will be used for failover from another node, then the RAM for each journal should be doubled, so the backup server can handle the additional load if the primary server fails.</para>
+          <para><emphasis role="bold">Failover configuration</emphasis>:
+         If the MDS node will be used for failover from another node, then the RAM
+         for each journal should be doubled, so the backup server can handle the
+         additional load if the primary server fails.</para>
          </listitem>
        </itemizedlist>
        <section remap="h4">
          <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>Calculating MDS Memory Requirements</title>
          </listitem>
        </itemizedlist>
        <section remap="h4">
          <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>Calculating MDS Memory Requirements</title>
-        <para>By default, 400 MB are used for the file system journal. Additional RAM is used for caching file data for the larger working set, which is not actively in use by clients but should be kept &quot;hot&quot; for improved access times. Approximately 1.5 KB per file is needed to keep a file in cache without a lock.</para>
-        <para>For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes, and a 2 million file working set (of which 400,000 files are cached on the clients):</para>
+        <para>By default, 4096 MB are used for the ldiskfs filesystem journal. Additional
+       RAM is used for caching file data for the larger working set, which is not
+       actively in use by clients but should be kept &quot;hot&quot; for improved
+       access times. Approximately 1.5 KB per file is needed to keep a file in cache
+       without a lock.</para>
+        <para>For example, for a single MDT on an MDS with 1,024 clients, 12 interactive
+       login nodes, and a 6 million file working set (of which 4M files are cached
+       on the clients):</para>
          <informalexample>
          <informalexample>
-          <para>Operating system overhead = 512 MB</para>
-          <para>File system journal = 400 MB</para>
-          <para>1000 * 4-core clients * 100 files/core * 2kB = 800 MB</para>
-          <para>16 interactive clients * 10,000 files * 2kB = 320 MB</para>
-          <para>1,600,000 file extra working set * 1.5kB/file = 2400 MB</para>
+          <para>Operating system overhead = 1024 MB</para>
+          <para>File system journal = 4096 MB</para>
+          <para>1024 * 4-core clients * 1024 files/core * 2kB = 4096 MB</para>
+          <para>12 interactive clients * 100,000 files * 2kB = 2400 MB</para>
+          <para>2M file extra working set * 1.5kB/file = 3096 MB</para>
          </informalexample>
          </informalexample>
-        <para>Thus, the minimum requirement for a system with this configuration is at least 4 GB of RAM. However, additional memory may significantly improve performance.</para>
-        <para>For directories containing 1 million or more files, more memory may provide a significant benefit. For example, in an environment where clients randomly access one of 10 million files, having extra memory for the cache significantly improves performance.</para>
+        <para>Thus, the minimum requirement for an MDT with this configuration is at least
+       16 GB of RAM. Additional memory may significantly improve performance.</para>
+        <para>For directories containing 1 million or more files, more memory can provide
+       a significant benefit. For example, in an environment where clients randomly
+       access one of 10 million files, having extra memory for the cache significantly
+       improves performance.</para>
        </section>
      </section>
      <section remap="h3">
        <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>OSS Memory Requirements</title>
        </section>
      </section>
      <section remap="h3">
        <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>OSS Memory Requirements</title>
-      <para>When planning the hardware for an OSS node, consider the memory usage of several
-        components in the Lustre file system (i.e., journal, service threads, file system metadata,
-        etc.). Also, consider the effect of the OSS read cache feature, which consumes memory as it
-        caches data on the OSS node.</para>
-      <para>In addition to the MDS memory requirements mentioned in <xref linkend="dbdoclet.50438256_87676"/>, the OSS requirements include:</para>
+      <para>When planning the hardware for an OSS node, consider the memory usage of
+      several components in the Lustre file system (i.e., journal, service threads,
+      file system metadata, etc.). Also, consider the effect of the OSS read cache
+      feature, which consumes memory as it caches data on the OSS node.</para>
+      <para>In addition to the MDS memory requirements mentioned above,
+      the OSS requirements also include:</para>
        <itemizedlist>
          <listitem>
        <itemizedlist>
          <listitem>
-          <para><emphasis role="bold">Service threads</emphasis> : The service threads on the OSS node pre-allocate a 4 MB I/O buffer for each ost_io service thread, so these buffers do not need to be allocated and freed for each I/O request.</para>
+          <para><emphasis role="bold">Service threads</emphasis>:
+         The service threads on the OSS node pre-allocate an RPC-sized MB I/O buffer
+         for each ost_io service thread, so these buffers do not need to be allocated
+         and freed for each I/O request.</para>
          </listitem>
          <listitem>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">OSS read cache</emphasis> : OSS read cache provides read-only
-            caching of data on an OSS, using the regular Linux page cache to store the data. Just
-            like caching from a regular file system in the Linux operating system, OSS read cache
-            uses as much physical memory as is available.</para>
+          <para><emphasis role="bold">OSS read cache</emphasis>:
+         OSS read cache provides read-only caching of data on an OSS, using the regular
+         Linux page cache to store the data. Just like caching from a regular file
+         system in the Linux operating system, OSS read cache uses as much physical
+         memory as is available.</para>
          </listitem>
        </itemizedlist>
          </listitem>
        </itemizedlist>
-      <para>The same calculation applies to files accessed from the OSS as for the MDS, but the load is distributed over many more OSSs nodes, so the amount of memory required for locks, inode cache, etc. listed under MDS is spread out over the OSS nodes.</para>
-      <para>Because of these memory requirements, the following calculations should be taken as determining the absolute minimum RAM required in an OSS node.</para>
+      <para>The same calculation applies to files accessed from the OSS as for the MDS,
+      but the load is distributed over many more OSSs nodes, so the amount of memory
+      required for locks, inode cache, etc. listed under MDS is spread out over the
+      OSS nodes.</para>
+      <para>Because of these memory requirements, the following calculations should be
+      taken as determining the absolute minimum RAM required in an OSS node.</para>
        <section remap="h4">
          <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>Calculating OSS Memory Requirements</title>
        <section remap="h4">
          <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>Calculating OSS Memory Requirements</title>
-        <para>The minimum recommended RAM size for an OSS with two OSTs is computed below:</para>
+        <para>The minimum recommended RAM size for an OSS with eight OSTs is:</para>
          <informalexample>
          <informalexample>
-          <para>Ethernet/TCP send/receive buffers (4 MB * 512 threads) = 2048 MB</para>
-          <para>400 MB journal size * 2 OST devices = 800 MB</para>
-          <para>1.5 MB read/write per OST IO thread * 512 threads = 768 MB</para>
-          <para>600 MB file system read cache * 2 OSTs = 1200 MB</para>
-          <para>1000 * 4-core clients * 100 files/core * 2kB = 800MB</para>
-          <para>16 interactive clients * 10,000 files * 2kB = 320MB</para>
-          <para>1,600,000 file extra working set * 1.5kB/file = 2400MB</para>
-          <para> DLM locks + file system metadata TOTAL = 3520MB</para>
-          <para>Per OSS DLM locks + file system metadata = 3520MB/6 OSS = 600MB (approx.)</para>
-          <para>Per OSS RAM minimum requirement = 4096MB (approx.)</para>
+          <para>Linux kernel and userspace daemon memory = 1024 MB</para>
+          <para>Network send/receive buffers (16 MB * 512 threads) = 8192 MB</para>
+          <para>1024 MB ldiskfs journal size * 8 OST devices = 8192 MB</para>
+          <para>16 MB read/write buffer per OST IO thread * 512 threads = 8192 MB</para>
+          <para>2048 MB file system read cache * 8 OSTs = 16384 MB</para>
+          <para>1024 * 4-core clients * 1024 files/core * 2kB/file = 8192 MB</para>
+          <para>12 interactive clients * 100,000 files * 2kB/file = 2400 MB</para>
+          <para>2M file extra working set * 2kB/file = 4096 MB</para>
+          <para>DLM locks + file cache TOTAL = 31072 MB</para>
+          <para>Per OSS DLM locks + file system metadata = 31072 MB/4 OSS = 7768 MB (approx.)</para>
+          <para>Per OSS RAM minimum requirement = 32 GB (approx.)</para>
          </informalexample>
          </informalexample>
-        <para>This consumes about 1,400 MB just for the pre-allocated buffers, and an additional 2 GB for minimal file system and kernel usage. Therefore, for a non-failover configuration, the minimum RAM would be 4 GB for an OSS node with two OSTs. Adding additional memory on the OSS will improve the performance of reading smaller, frequently-accessed files.</para>
-        <para>For a failover configuration, the minimum RAM would be at least 6 GB. For 4 OSTs on each OSS in a failover configuration 10GB of RAM is reasonable. When the OSS is not handling any failed-over OSTs the extra RAM will be used as a read cache.</para>
-        <para>As a reasonable rule of thumb, about 2 GB of base memory plus 1 GB per OST can be used. In failover configurations, about 2 GB per OST is needed.</para>
+        <para>This consumes about 16 GB just for pre-allocated buffers, and an
+       additional 1 GB for minimal file system and kernel usage. Therefore, for a
+       non-failover configuration, the minimum RAM would be about 32 GB for an OSS node
+       with eight OSTs. Adding additional memory on the OSS will improve the performance
+       of reading smaller, frequently-accessed files.</para>
+        <para>For a failover configuration, the minimum RAM would be at least 48 GB,
+       as some of the memory is per-node. When the OSS is not handling any failed-over
+       OSTs the extra RAM will be used as a read cache.</para>
+        <para>As a reasonable rule of thumb, about 8 GB of base memory plus 3 GB per OST
+       can be used. In failover configurations, about 6 GB per OST is needed.</para>
        </section>
      </section>
    </section>
        </section>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_78272">
+  <section xml:id="dbdoclet.network_considerations">
      <title><indexterm>
          <primary>setup</primary>
          <secondary>network</secondary>
      <title><indexterm>
          <primary>setup</primary>
          <secondary>network</secondary>
@@ -712,7 +1022,7 @@
        </listitem>
      </itemizedlist>
      <para>Lustre networks and routing are configured and managed by specifying parameters to the
        </listitem>
      </itemizedlist>
      <para>Lustre networks and routing are configured and managed by specifying parameters to the
-      Lustre networking (<literal>lnet</literal>) module in
+      Lustre Networking (<literal>lnet</literal>) module in
          <literal>/etc/modprobe.d/lustre.conf</literal>.</para>
      <para>To prepare to configure Lustre networking, complete the following steps:</para>
      <orderedlist>
          <literal>/etc/modprobe.d/lustre.conf</literal>.</para>
      <para>To prepare to configure Lustre networking, complete the following steps:</para>
      <orderedlist>
@@ -730,21 +1040,32 @@
        <listitem>
          <para><emphasis role="bold">If routing is needed, identify the nodes to be used to route traffic between networks.</emphasis></para>
          <para>If you are using multiple network types, then you will need a router. Any node with
        <listitem>
          <para><emphasis role="bold">If routing is needed, identify the nodes to be used to route traffic between networks.</emphasis></para>
          <para>If you are using multiple network types, then you will need a router. Any node with
-          appropriate interfaces can route Lustre networking (LNET) traffic between different
+          appropriate interfaces can route Lustre networking (LNet) traffic between different
            network hardware types or topologies --the node may be a server, a client, or a standalone
            network hardware types or topologies --the node may be a server, a client, or a standalone
-          router. LNET can route messages between different network types (such as
+          router. LNet can route messages between different network types (such as
            TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or
            TCP/IP networks). Routing will be configured in <xref linkend="configuringlnet"/>.</para>
        </listitem>
        <listitem>
            TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or
            TCP/IP networks). Routing will be configured in <xref linkend="configuringlnet"/>.</para>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Identify the network interfaces to include in or exclude from LNET. </emphasis>
-    </para>
-        <para>If not explicitly specified, LNET uses either the first available interface or a pre-defined default for a given network type. Interfaces that LNET should not use (such as an administrative network or IP-over-IB), can be excluded.</para>
-        <para>Network interfaces to be used or excluded will be specified using the lnet kernel module parameters networks and <literal>ip2netsas</literal> described in <xref linkend="configuringlnet"/>.</para>
+        <para><emphasis role="bold">Identify the network interfaces to include
+       in or exclude from LNet.</emphasis></para>
+        <para>If not explicitly specified, LNet uses either the first available
+       interface or a pre-defined default for a given network type. Interfaces
+       that LNet should not use (such as an administrative network or
+       IP-over-IB), can be excluded.</para>
+        <para>Network interfaces to be used or excluded will be specified using
+       the lnet kernel module parameters <literal>networks</literal> and
+       <literal>ip2nets</literal> as described in
+       <xref linkend="configuringlnet"/>.</para>
        </listitem>
        <listitem>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">To ease the setup of networks with complex network configurations, determine a cluster-wide module configuration.</emphasis></para>
-        <para>For large clusters, you can configure the networking setup for all nodes by using a single, unified set of parameters in the <literal>lustre.conf</literal> file on each node. Cluster-wide configuration is described in <xref linkend="configuringlnet"/>.</para>
+        <para><emphasis role="bold">To ease the setup of networks with complex
+       network configurations, determine a cluster-wide module configuration.
+       </emphasis></para>
+        <para>For large clusters, you can configure the networking setup for
+       all nodes by using a single, unified set of parameters in the
+       <literal>lustre.conf</literal> file on each node. Cluster-wide
+       configuration is described in <xref linkend="configuringlnet"/>.</para>
        </listitem>
      </orderedlist>
      <note>
        </listitem>
      </orderedlist>
      <note>
@@ -752,3 +1073,6 @@
      </note>
    </section>
  </chapter>
      </note>
    </section>
  </chapter>
+<!--
+  vim:expandtab:shiftwidth=2:tabstop=8:
+  -->