LUDOC-495 dne: add filesystem-wide default directory striping

[doc/manual.git] / SettingUpLustreSystem.xml
diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml

index 03af700..ce887ef 100644 (file)
--- a/SettingUpLustreSystem.xml
+++ b/SettingUpLustreSystem.xml
@@ -9,31 +9,31 @@
    <itemizedlist>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_49017"/>
+          <xref linkend="storage_hardware_considerations"/>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.space_requirements"/>
+          <xref linkend="space_requirements"/>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>
+          <xref linkend="ldiskfs_mkfs_opts"/>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_26456"/>
+          <xref linkend="mds_oss_memory"/>
        </para>
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_78272"/>
+          <xref linkend="network_considerations"/>
        </para>
      </listitem>
    </itemizedlist>
-  <section xml:id="dbdoclet.50438256_49017">
+  <section xml:id="storage_hardware_considerations">
        <title><indexterm><primary>setup</primary></indexterm>
    <indexterm><primary>setup</primary><secondary>hardware</secondary></indexterm>        
    <indexterm><primary>design</primary><see>setup</see></indexterm>        
@@ -55,12 +55,14 @@
        </listitem>
      </itemizedlist>
      </warning>
-    <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
-      typically used for testing to match expected customer usage and avoid limitations due to the 4
-      GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs.
-      Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file
-      systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit
-      inode number.</para>
+    <para>Only servers running on 64-bit CPUs are tested and supported.
+      64-bit CPU clients are typically used for testing to match expected
+      customer usage and avoid limitations due to the 4 GB limit for RAM
+      size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit
+      CPUs.  Also, due to kernel API limitations, performing backups of Lustre
+      filesystems on 32-bit clients may cause backup tools to confuse files
+      that report the same 32-bit inode number, if the backup tools depend
+      on the inode number for correct operation.</para>
      <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
        optionally be organized with logical volume management (LVM), which is then formatted as a
        Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
@@ -70,7 +72,11 @@
        a separate device.</para>
      <para>The MDS can effectively utilize a lot of CPU cycles. A minimum of four processor cores are recommended. More are advisable for files systems with many clients.</para>
      <note>
-      <para>Lustre clients running on architectures with different endianness are supported. One limitation is that the PAGE_SIZE kernel macro on the client must be as large as the PAGE_SIZE of the server. In particular, ia64 or PPC clients with large pages (up to 64kB pages) can run with x86 servers (4kB pages). If you are running x86 clients with ia64 or PPC servers, you must compile the ia64 kernel with a 4kB PAGE_SIZE (so the server page size is not larger than the client page size). </para>
+      <para>Lustre clients running on different CPU architectures is supported.
+      One limitation is that the PAGE_SIZE kernel macro on the client must be
+      as large as the PAGE_SIZE of the server. In particular, ARM or PPC
+      clients with large pages (up to 64kB pages) can run with x86 servers
+      (4kB pages).</para>
      </note>
      <section remap="h3">
          <title><indexterm>
@@ -151,7 +157,7 @@
        results.)</para>
      </section>
    </section>
-  <section xml:id="dbdoclet.space_requirements">
+  <section xml:id="space_requirements">
        <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
            <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
            Determining Space Requirements</title>
@@ -210,7 +216,7 @@
        The size is determined by the total number of servers in the Lustre
        file system cluster(s) that are managed by the MGS.</para>
      </section>
-    <section xml:id="dbdoclet.mdt_space_requirements">
+    <section xml:id="mdt_space_requirements">
          <title><indexterm>
            <primary>setup</primary>
            <secondary>MDT</secondary>
@@ -254,7 +260,7 @@
          <para>2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT</para>
        </informalexample>
        <para>For details about formatting options for ldiskfs MDT and OST file
-      systems, see <xref linkend="dbdoclet.ldiskfs_mdt_mkfs"/>.</para>
+      systems, see <xref linkend="ldiskfs_mdt_mkfs"/>.</para>
        <note>
          <para>If the median file size is very small, 4 KB for example, the
          MDT would use as much space for each file as the space used on the OST,
@@ -321,10 +327,10 @@
        specify a different average file size (number of total inodes for a given
        OST size) to reduce file system overhead and minimize file system check
        time.
-      See <xref linkend="dbdoclet.ldiskfs_ost_mkfs"/> for more details.</para>
+      See <xref linkend="ldiskfs_ost_mkfs"/> for more details.</para>
      </section>
    </section>
-  <section xml:id="dbdoclet.ldiskfs_mkfs_opts">
+  <section xml:id="ldiskfs_mkfs_opts">
      <title>
        <indexterm>
          <primary>ldiskfs</primary>
@@ -371,7 +377,7 @@
      <screen>--mkfsoptions=&apos;backing fs options&apos;</screen>
      <para>For other <literal>mkfs.lustre</literal> options, see the Linux man page for
          <literal>mke2fs(8)</literal>.</para>
-    <section xml:id="dbdoclet.ldiskfs_mdt_mkfs">
+    <section xml:id="ldiskfs_mdt_mkfs">
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>MDS</secondary>
@@ -382,7 +388,7 @@
        <para>The number of inodes on the MDT is determined at format time
        based on the total size of the file system to be created. The default
        <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
-      for an ldiskfs MDT is optimized at one inode for every 2048 bytes of file
+      for an ldiskfs MDT is optimized at one inode for every 2560 bytes of file
        system space.</para>
        <para>This setting takes into account the space needed for additional
        ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB),
@@ -398,7 +404,7 @@
        the bytes-per-inode ratio to have enough space on the MDT for small files,
        as described below.
        </para>
-      <para>It is possible to change the recommended 2048 bytes
+      <para>It is possible to change the recommended default of 2560 bytes
        per inode for an ldiskfs MDT when it is first formatted by adding the
        <literal>--mkfsoptions="-i bytes-per-inode"</literal> option to
        <literal>mkfs.lustre</literal>.  Decreasing the inode ratio tunable
@@ -406,9 +412,9 @@
        MDT size, but will leave less space for extra per-file metadata and is
        not recommended.  The inode ratio must always be strictly larger than
        the MDT inode size, which is 1024 bytes by default.  It is recommended
-      to use an inode ratio at least 1024 bytes larger than the inode size to
+      to use an inode ratio at least 1536 bytes larger than the inode size to
        ensure the MDT does not run out of space.  Increasing the inode ratio
-      to include enough space for the most common file data (e.g. 5120 or 65560
+      with enough space for the most commonly file size (e.g. 5632 or 66560
        bytes if 4KB or 64KB files are widely used) is recommended for DoM.</para>
        <para>The size of the inode may be changed at format time by adding the
        <literal>--stripe-count-hint=N</literal> to have
@@ -424,7 +430,7 @@
        read or written for each MDT inode access.
        </para>
      </section>
-    <section xml:id="dbdoclet.ldiskfs_ost_mkfs">
+    <section xml:id="ldiskfs_ost_mkfs">
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>OST</secondary>
@@ -521,13 +527,13 @@
        <screen>[oss#] mkfs.lustre --ost --mkfsoptions=&quot;-i $((8192 * 1024))&quot; ...</screen>
        </para>
        <note>
-        <para>OSTs formatted with ldiskfs can use a maximum of approximately
-        320 million objects per MDT, up to a maximum of 4 billion inodes.
-       Specifying a very small bytes-per-inode ratio for a large OST that
-       exceeds this limit can cause either premature out-of-space errors and prevent
-        the full OST space from being used, or will waste space and slow down
-        e2fsck more than necessary.  The default inode ratios are chosen to
-        ensure that the total number of inodes remain below this limit.
+        <para>OSTs formatted with ldiskfs should preferably have fewer than
+        320 million objects per MDT, and up to a maximum of 4 billion inodes.
+        Specifying a very small bytes-per-inode ratio for a large OST that
+        exceeds this limit can cause either premature out-of-space errors and
+        prevent the full OST space from being used, or will waste space and
+        slow down e2fsck more than necessary. The default inode ratios are
+        chosen to ensure the total number of inodes remain below this limit.
          </para>
        </note>
        <note>
@@ -540,7 +546,7 @@
          if substantial errors are detected and need to be repaired.</para>
        </note>
        <para>For further details about optimizing MDT and OST file systems,
-      see <xref linkend="dbdoclet.ldiskfs_raid_opts"/>.</para>
+      see <xref linkend="ldiskfs_raid_opts"/>.</para>
      </section>
    </section>
    <section remap="h3">
@@ -562,13 +568,12 @@
        </indexterm>File and File System Limits</title>
  
        <para><xref linkend="settinguplustresystem.tab2"/> describes
-     current known limits of Lustre.  These limits are imposed by either
-     the Lustre architecture or the Linux virtual file system (VFS) and
-     virtual memory subsystems. In a few cases, a limit is defined within
-     the code and can be changed by re-compiling the Lustre software.
-     Instructions to install from source code are beyond the scope of this
-     document, and can be found elsewhere online. In these cases, the
-     indicated limit was used for testing of the Lustre software. </para>
+      current known limits of Lustre.  These limits may be imposed by either
+      the Lustre architecture or the Linux virtual file system (VFS) and
+      virtual memory subsystems. In a few cases, a limit is defined within
+      the code Lustre based on tested values and could be changed by editing
+      and re-compiling the Lustre software.  In these cases, the indicated
+      limit was used for testing of the Lustre software.</para>
  
      <table frame="all" xml:id="settinguplustresystem.tab2">
        <title>File and file system limits</title>
@@ -592,42 +597,45 @@
          <tbody>
            <row>
              <entry>
-              <para>Maximum number of MDTs</para>
+              <para><anchor xml:id="max_mdt_count" xreflabel=""/>Maximum number of MDTs</para>
              </entry>
              <entry>
                <para>256</para>
              </entry>
              <entry>
-              <para>A single MDS can host
-              multiple MDTs, either for separate file systems, or up to 255
-              additional MDTs can be added to the filesystem and attached into
-              the namespace with DNE remote or striped directories.</para>
+              <para>A single MDS can host one or more MDTs, either for separate
+              filesystems, or aggregated into a single namespace. Each
+              filesystem requires a separate MDT for the filesystem root
+             directory.
+              Up to 255 more MDTs can be added to the filesystem and are
+              attached into the filesystem namespace with creation of DNE
+              remote or striped directories.</para>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum number of OSTs</para>
+              <para><anchor xml:id="max_ost_count" xreflabel=""/>Maximum number of OSTs</para>
              </entry>
              <entry>
                <para>8150</para>
              </entry>
              <entry>
                <para>The maximum number of OSTs is a constant that can be
-              changed at compile time.  Lustre file systems with up to
-              4000 OSTs have been tested.  Multiple OST file systems can
-              be configured on a single OSS node.</para>
+              changed at compile time.  Lustre file systems with up to 4000
+              OSTs have been configured in the past.  Multiple OST targets
+              can be configured on a single OSS node.</para>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum OST size</para>
+              <para><anchor xml:id="max_ost_size" xreflabel=""/>Maximum OST size</para>
              </entry>
              <entry>
-              <para>512TiB (ldiskfs), 512TiB (ZFS)</para>
+              <para>1024TiB (ldiskfs), 1024TiB (ZFS)</para>
              </entry>
              <entry>
                <para>This is not a <emphasis>hard</emphasis> limit. Larger
-              OSTs are possible but most production systems do not
+              OSTs are possible, but most production systems do not
                typically go beyond the stated limit per OST because Lustre
                can add capacity and performance with additional OSTs, and
                having more OSTs improves aggregate I/O performance,
@@ -637,13 +645,13 @@
                <para>
                With 32-bit kernels, due to page cache limits, 16TB is the
                maximum block device size, which in turn applies to the
-              size of OST.  It is strongly recommended to run Lustre
-              clients and servers with 64-bit kernels.</para>
+              size of OST.  It is <emphasis>strongly</emphasis> recommended
+              to run Lustre clients and servers with 64-bit kernels.</para>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum number of clients</para>
+              <para><anchor xml:id="max_client_count" xreflabel=""/>Maximum number of clients</para>
              </entry>
              <entry>
                <para>131072</para>
@@ -656,21 +664,21 @@
            </row>
            <row>
              <entry>
-              <para>Maximum size of a single file system</para>
+              <para><anchor xml:id="max_filesysem_size" xreflabel=""/>Maximum size of a single file system</para>
              </entry>
              <entry>
-              <para>at least 1EiB</para>
+              <para>2EiB or larger</para>
              </entry>
              <entry>
-              <para>Each OST can have a file system up to the
-              Maximum OST size limit, and the Maximum number of OSTs
-              can be combined into a single filesystem.
+              <para>Each OST can have a file system up to the "Maximum OST
+              size" limit, and the Maximum number of OSTs can be combined
+              into a single filesystem.
                </para>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum stripe count</para>
+              <para><anchor xml:id="max_stripe_count" xreflabel=""/>Maximum stripe count</para>
              </entry>
              <entry>
                <para>2000</para>
@@ -679,13 +687,22 @@
                <para>This limit is imposed by the size of the layout that
                needs to be stored on disk and sent in RPC requests, but is
                not a hard limit of the protocol. The number of OSTs in the
-              filesystem can exceed the stripe count, but this limits the
-              number of OSTs across which a single file can be striped.</para>
+              filesystem can exceed the stripe count, but this is the maximum
+              number of OSTs on which a <emphasis>single file</emphasis>
+              can be striped.</para>
+              <note condition='l2D'><para>Before 2.13, the default for ldiskfs
+             MDTs the maximum stripe count for a
+              <emphasis>single file</emphasis> is limited to 160 OSTs.  In order to
+              increase the maximum file stripe count, use
+              <literal>--mkfsoptions="-O ea_inode"</literal> when formatting the MDT,
+              or use <literal>tune2fs -O ea_inode</literal> to enable it after the
+              MDT has been formatted.</para>
+              </note>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum stripe size</para>
+              <para><anchor xml:id="max_stripe_size" xreflabel=""/>Maximum stripe size</para>
              </entry>
              <entry>
                <para>&lt; 4 GiB</para>
@@ -697,7 +714,7 @@
            </row>
            <row>
              <entry>
-              <para>Minimum stripe size</para>
+              <para><anchor xml:id="min_stripe_size" xreflabel=""/>Minimum stripe size</para>
              </entry>
              <entry>
                <para>64 KiB</para>
@@ -706,12 +723,13 @@
                <para>Due to the use of 64 KiB PAGE_SIZE on some CPU
                architectures such as ARM and POWER, the minimum stripe
                size is 64 KiB so that a single page is not split over
-              multiple servers.</para>
+              multiple servers.  This is also the minimum Data-on-MDT
+             component size that can be specified.</para>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum single object size</para>
+              <para><anchor xml:id="max_object_size" xreflabel=""/>Maximum single object size</para>
              </entry>
              <entry>
                <para>16TiB (ldiskfs), 256TiB (ZFS)</para>
@@ -726,7 +744,7 @@
            </row>
            <row>
              <entry>
-              <para>Maximum <anchor xml:id="dbdoclet.50438256_marker-1290761" xreflabel=""/>file size</para>
+              <para><anchor xml:id="max_file_size" xreflabel=""/>Maximum file size</para>
              </entry>
              <entry>
                <para>16 TiB on 32-bit systems</para>
@@ -749,14 +767,14 @@
            </row>
            <row>
              <entry>
-              <para>Maximum number of files or subdirectories in a single directory</para>
+              <para><anchor xml:id="max_directory_size" xreflabel=""/>Maximum number of files or subdirectories in a single directory</para>
              </entry>
              <entry>
-              <para>10 million files (ldiskfs), 2^48 (ZFS)</para>
+              <para>600M-3.8B files (ldiskfs), 16T (ZFS)</para>
              </entry>
              <entry>
                <para>The Lustre software uses the ldiskfs hashed directory
-              code, which has a limit of about 10 million files, depending
+              code, which has a limit of at least 600 million files, depending
                on the length of the file name. The limit on subdirectories
                is the same as the limit on regular files.</para>
                <note condition='l28'><para>Starting in the 2.8 release it is
@@ -764,16 +782,19 @@
                over multiple MDTs with the <literal>lfs mkdir -c</literal>
                command, which increases the single directory limit by a
                factor of the number of directory stripes used.</para></note>
-              <para>Lustre file systems are tested with ten million files
-              in a single directory.</para>
+              <note condition='l2E'><para>Starting in the 2.14 release, the
+              <literal>large_dir</literal> feature of ldiskfs is enabled by
+              default to allow directories with more than 10M entries.  In
+              the 2.12 release, the <literal>large_dir</literal> feature was
+              present but not enabled by default.</para></note>
              </entry>
            </row>
            <row>
              <entry>
-              <para>Maximum number of files in the file system</para>
+              <para><anchor xml:id="max_file_count" xreflabel=""/>Maximum number of files in the file system</para>
              </entry>
              <entry>
-              <para>4 billion (ldiskfs), 256 trillion (ZFS) per MDT</para>
+              <para>4 billion (ldiskfs), 256 trillion (ZFS) <emphasis>per MDT</emphasis></para>
              </entry>
              <entry>
                <para>The ldiskfs filesystem imposes an upper limit of
@@ -795,7 +816,7 @@
            </row>
            <row>
              <entry>
-              <para>Maximum length of a filename</para>
+              <para><anchor xml:id="max_filename_size" xreflabel=""/>Maximum length of a filename</para>
              </entry>
              <entry>
                <para>255 bytes (filename)</para>
@@ -807,7 +828,7 @@
            </row>
            <row>
              <entry>
-              <para>Maximum length of a pathname</para>
+              <para><anchor xml:id="max_pathname_size" xreflabel=""/>Maximum length of a pathname</para>
              </entry>
              <entry>
                <para>4096 bytes (pathname)</para>
@@ -818,7 +839,7 @@
            </row>
            <row>
              <entry>
-              <para>Maximum number of open files for a Lustre file system</para>
+              <para><anchor xml:id="max_open_files" xreflabel=""/>Maximum number of open files for a Lustre file system</para>
              </entry>
              <entry>
                <para>No limit</para>
@@ -836,15 +857,8 @@
        </tgroup>
      </table>
      <para>&#160;</para>
-    <note><para>By default for ldiskfs MDTs the maximum stripe count for a
-    <emphasis>single file</emphasis> is limited to 160 OSTs.  In order to
-    increase the maximum file stripe count, use
-    <literal>--mkfsoptions="-O ea_inode"</literal> when formatting the MDT,
-    or use <literal>tune2fs -O ea_inode</literal> to enable it after the
-    MDT has been formatted.</para>
-    </note>
    </section>
-  <section xml:id="dbdoclet.50438256_26456">
+  <section xml:id="mds_oss_memory">
      <title><indexterm><primary>setup</primary><secondary>memory</secondary></indexterm>Determining Memory Requirements</title>
      <para>This section describes the memory requirements for each Lustre file system component.</para>
      <section remap="h3">
@@ -905,88 +919,95 @@
        </itemizedlist>
        <section remap="h4">
          <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>MDS</tertiary></indexterm>Calculating MDS Memory Requirements</title>
-        <para>By default, 4096 MB are used for the ldiskfs filesystem journal. Additional
-       RAM is used for caching file data for the larger working set, which is not
-       actively in use by clients but should be kept &quot;hot&quot; for improved
-       access times. Approximately 1.5 KB per file is needed to keep a file in cache
-       without a lock.</para>
-        <para>For example, for a single MDT on an MDS with 1,024 clients, 12 interactive
-       login nodes, and a 6 million file working set (of which 4M files are cached
-       on the clients):</para>
+        <para>By default, 4096 MB are used for the ldiskfs filesystem journal.
+          Additional RAM is used for caching file data for the larger working
+          set, which is not actively in use by clients but should be kept
+          &quot;hot&quot; for improved access times. Approximately 1.5 KB per
+          file is needed to keep a file in cache without a lock.</para>
+        <para>For example, for a single MDT on an MDS with 1,024 compute nodes,
+          12 interactive login nodes, and a 20 million file working set (of
+          which 9 million files are cached on the clients at one time):</para>
          <informalexample>
-          <para>Operating system overhead = 1024 MB</para>
+          <para>Operating system overhead = 4096 MB (RHEL8)</para>
            <para>File system journal = 4096 MB</para>
-          <para>1024 * 4-core clients * 1024 files/core * 2kB = 4096 MB</para>
-          <para>12 interactive clients * 100,000 files * 2kB = 2400 MB</para>
-          <para>2M file extra working set * 1.5kB/file = 3096 MB</para>
+          <para>1024 * 32-core clients * 256 files/core * 2KB = 16384 MB</para>
+          <para>12 interactive clients * 100,000 files * 2KB = 2400 MB</para>
+          <para>20 million file working set * 1.5KB/file = 30720 MB</para>
          </informalexample>
-        <para>Thus, the minimum requirement for an MDT with this configuration is at least
-       16 GB of RAM. Additional memory may significantly improve performance.</para>
-        <para>For directories containing 1 million or more files, more memory can provide
-       a significant benefit. For example, in an environment where clients randomly
-       access one of 10 million files, having extra memory for the cache significantly
-       improves performance.</para>
+        <para>Thus, a reasonable MDS configuration for this workload is
+          at least 60 GB of RAM.  For active-active DNE MDT failover pairs,
+          each MDS should have at least 96 GB of RAM.  The additional memory
+          can be used during normal operation to allow more metadata and locks
+          to be cached and improve performance, depending on the workload.
+        </para>
+        <para>For directories containing 1 million or more files, more memory
+          can provide a significant benefit. For example, in an environment
+          where clients randomly a single directory with 10 million files can
+          consume as much as 35GB of RAM on the MDS.</para>
        </section>
      </section>
      <section remap="h3">
        <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>OSS Memory Requirements</title>
-      <para>When planning the hardware for an OSS node, consider the memory usage of
-      several components in the Lustre file system (i.e., journal, service threads,
-      file system metadata, etc.). Also, consider the effect of the OSS read cache
-      feature, which consumes memory as it caches data on the OSS node.</para>
+      <para>When planning the hardware for an OSS node, consider the memory
+        usage of several components in the Lustre file system (i.e., journal,
+        service threads, file system metadata, etc.). Also, consider the
+        effect of the OSS read cache feature, which consumes memory as it
+        caches data on the OSS node.</para>
        <para>In addition to the MDS memory requirements mentioned above,
-      the OSS requirements also include:</para>
+        the OSS requirements also include:</para>
        <itemizedlist>
          <listitem>
            <para><emphasis role="bold">Service threads</emphasis>:
-         The service threads on the OSS node pre-allocate an RPC-sized MB I/O buffer
-         for each ost_io service thread, so these buffers do not need to be allocated
-         and freed for each I/O request.</para>
+           The service threads on the OSS node pre-allocate an RPC-sized MB
+            I/O buffer for each <literal>ost_io</literal> service thread, so
+            these large buffers do not need to be allocated and freed for
+            each I/O request.</para>
          </listitem>
          <listitem>
            <para><emphasis role="bold">OSS read cache</emphasis>:
-         OSS read cache provides read-only caching of data on an OSS, using the regular
-         Linux page cache to store the data. Just like caching from a regular file
-         system in the Linux operating system, OSS read cache uses as much physical
-         memory as is available.</para>
+           OSS read cache provides read-only caching of data on an HDD-based
+            OSS, using the regular Linux page cache to store the data. Just
+            like caching from a regular file system in the Linux operating
+            system, OSS read cache uses as much physical memory as is available.
+          </para>
          </listitem>
        </itemizedlist>
-      <para>The same calculation applies to files accessed from the OSS as for the MDS,
-      but the load is distributed over many more OSSs nodes, so the amount of memory
-      required for locks, inode cache, etc. listed under MDS is spread out over the
-      OSS nodes.</para>
-      <para>Because of these memory requirements, the following calculations should be
-      taken as determining the absolute minimum RAM required in an OSS node.</para>
+      <para>The same calculation applies to files accessed from the OSS as for
+        the MDS, but the load is typically distributed over more OSS nodes, so
+        the amount of memory required for locks, inode cache, etc. listed for
+        the MDS is spread out over the OSS nodes.</para>
+      <para>Because of these memory requirements, the following calculations
+        should be taken as determining the minimum RAM required in an OSS node.
+      </para>
        <section remap="h4">
          <title><indexterm><primary>setup</primary><secondary>memory</secondary><tertiary>OSS</tertiary></indexterm>Calculating OSS Memory Requirements</title>
-        <para>The minimum recommended RAM size for an OSS with eight OSTs is:</para>
+        <para>The minimum recommended RAM size for an OSS with eight OSTs,
+          handling objects for 1/4 of the active files for the MDS:</para>
          <informalexample>
-          <para>Linux kernel and userspace daemon memory = 1024 MB</para>
+          <para>Linux kernel and userspace daemon memory = 4096 MB</para>
            <para>Network send/receive buffers (16 MB * 512 threads) = 8192 MB</para>
            <para>1024 MB ldiskfs journal size * 8 OST devices = 8192 MB</para>
            <para>16 MB read/write buffer per OST IO thread * 512 threads = 8192 MB</para>
            <para>2048 MB file system read cache * 8 OSTs = 16384 MB</para>
-          <para>1024 * 4-core clients * 1024 files/core * 2kB/file = 8192 MB</para>
-          <para>12 interactive clients * 100,000 files * 2kB/file = 2400 MB</para>
-          <para>2M file extra working set * 2kB/file = 4096 MB</para>
-          <para>DLM locks + file cache TOTAL = 31072 MB</para>
-          <para>Per OSS DLM locks + file system metadata = 31072 MB/4 OSS = 7768 MB (approx.)</para>
-          <para>Per OSS RAM minimum requirement = 32 GB (approx.)</para>
+          <para>1024 * 32-core clients * 64 objects/core * 2KB/object = 4096 MB</para>
+          <para>12 interactive clients * 25,000 objects * 2KB/object = 600 MB</para>
+          <para>5 million object working set * 1.5KB/object = 7500 MB</para>
          </informalexample>
-        <para>This consumes about 16 GB just for pre-allocated buffers, and an
-       additional 1 GB for minimal file system and kernel usage. Therefore, for a
-       non-failover configuration, the minimum RAM would be about 32 GB for an OSS node
-       with eight OSTs. Adding additional memory on the OSS will improve the performance
-       of reading smaller, frequently-accessed files.</para>
-        <para>For a failover configuration, the minimum RAM would be at least 48 GB,
-       as some of the memory is per-node. When the OSS is not handling any failed-over
-       OSTs the extra RAM will be used as a read cache.</para>
-        <para>As a reasonable rule of thumb, about 8 GB of base memory plus 3 GB per OST
-       can be used. In failover configurations, about 6 GB per OST is needed.</para>
+        <para> For a non-failover configuration, the minimum RAM would be about
+          60 GB for an OSS node with eight OSTs. Additional memory on the OSS
+          will improve the performance of reading smaller, frequently-accessed
+          files.</para>
+        <para>For a failover configuration, the minimum RAM would be about
+          90 GB, as some of the memory is per-node. When the OSS is not handling
+          any failed-over OSTs the extra RAM will be used as a read cache.
+          </para>
+        <para>As a reasonable rule of thumb, about 24 GB of base memory plus
+          4 GB per OST can be used. In failover configurations, about 8 GB per
+          primary OST is needed.</para>
        </section>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_78272">
+  <section xml:id="network_considerations">
      <title><indexterm>
          <primary>setup</primary>
          <secondary>network</secondary>