LUDOC-11 config: improve ZFS MDT space calculations

[doc/manual.git] / SettingUpLustreSystem.xml
diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml

index ac425aa..a90d133 100644 (file)
--- a/SettingUpLustreSystem.xml
+++ b/SettingUpLustreSystem.xml
@@ -12,7 +12,7 @@
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_31079"/>
+          <xref linkend="dbdoclet.space_requirements"/>
        </para>
      </listitem>
      <listitem>
@@ -142,7 +142,7 @@
        results.)</para>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_31079">
+  <section xml:id="dbdoclet.space_requirements">
        <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
            <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
            Determining Space Requirements</title>
@@ -170,20 +170,24 @@
      for growth without the effort of additional storage.</para>
      <para>By default, the ldiskfs file system used by Lustre servers to store
      user-data objects and system data reserves 5% of space that cannot be used
-    by the Lustre file system.  Additionally, a Lustre file system reserves up
-    to 400 MB on each OST, and up to 4GB on each MDT for journal use and a
-    small amount of space outside the journal to store accounting data. This
-    reserved space is unusable for general storage. Thus, at least this much
-    space will be used on each OST before any file object data is saved.</para>
+    by the Lustre file system.  Additionally, an ldiskfs Lustre file system
+    reserves up to 400 MB on each OST, and up to 4GB on each MDT for journal
+    use and a small amount of space outside the journal to store accounting
+    data. This reserved space is unusable for general storage. Thus, at least
+    this much space will be used per OST before any file object data is saved.
+    </para>
      <para condition="l24">With a ZFS backing filesystem for the MDT or OST,
      the space allocation for inodes and file data is dynamic, and inodes are
-    allocated as needed.  A minimum of 2kB of usable space (before mirroring)
+    allocated as needed.  A minimum of 4kB of usable space (before mirroring)
      is needed for each inode, exclusive of other overhead such as directories,
-    internal log files, extended attributes, ACLs, etc.
+    internal log files, extended attributes, ACLs, etc.  ZFS also reserves
+    approximately 3% of the total storage space for internal and redundant
+    metadata, which is not usable by Lustre.
      Since the size of extended attributes and ACLs is highly dependent on
      kernel versions and site-specific policies, it is best to over-estimate
      the amount of space needed for the desired number of inodes, and any
-    excess space will be utilized to store more inodes.</para>
+    excess space will be utilized to store more inodes.
+    </para>
      <section>
        <title><indexterm>
            <primary>setup</primary>
@@ -193,9 +197,9 @@
            <primary>space</primary>
            <secondary>determining MGT requirements</secondary>
          </indexterm> Determining MGT Space Requirements</title>
-      <para>Less than 100 MB of space is required for the MGT. The size
-      is determined by the number of servers in the Lustre file system
-      cluster(s) that are managed by the MGS.</para>
+      <para>Less than 100 MB of space is typically required for the MGT.
+      The size is determined by the total number of servers in the Lustre
+      file system cluster(s) that are managed by the MGS.</para>
      </section>
      <section xml:id="dbdoclet.50438256_87676">
          <title><indexterm>
@@ -207,36 +211,35 @@
            <secondary>determining MDT requirements</secondary>
          </indexterm> Determining MDT Space Requirements</title>
        <para>When calculating the MDT size, the important factor to consider
-      is the number of files to be stored in the file system. This determines
-      the number of inodes needed, which drives the MDT sizing. To be on the
-      safe side, plan for 2 KB per ldiskfs inode on the MDT, which is the
-      default value. Attached storage required for Lustre file system metadata
-      is typically 1-2 percent of the file system capacity depending upon
-      file size.</para>
-      <note condition='l24'><para>Starting in release 2.4, using the DNE
-      remote directory feature it is possible to increase the metadata
-      capacity of a single filesystem by configuting additional MDTs into
-      the filesystem, see <xref linkend="dbdoclet.addingamdt"/> for details.
-      </para></note>
-      <para>For example, if the average file size is 5 MB and you have
-      100 TB of usable OST space, then you can calculate the minimum number
-      of inodes as follows:</para>
+      is the number of files to be stored in the file system, which depends on
+      at least 4 KiB per inode of usable space on the MDT.  Since MDTs typically
+      use RAID-1+0 mirroring, the total storage needed will be double this.
+      </para>
+      <para>Please note that the actual used space per MDT depends on the number
+      of files per directory, the number of stripes per file, whether files
+      have ACLs or user xattrs, and the number of hard links per file.  The
+      storage required for Lustre file system metadata is typically 1-2
+      percent of the total file system capacity depending upon file size.</para>
+      <para>For example, if the average file size is 5 MiB and you have
+      100 TiB of usable OST space, then you can calculate the minimum total
+      number of inodes each for MDTs and OSTs as follows:</para>
        <informalexample>
-        <para>(100 TB * 1024 GB/TB * 1024 MB/GB) / 5 MB/inode = 20 million inodes</para>
+        <para>(500 TB * 1000000 MB/TB) / 5 MB/inode = 100M inodes</para>
        </informalexample>
-      <para>For details about formatting options for MDT and OST file systems,
-      see <xref linkend="dbdoclet.ldiskfs_mdt_mkfs"/>.</para>
+      <para>For details about formatting options for ldiskfs MDT and OST file
+      systems, see <xref linkend="dbdoclet.ldiskfs_mdt_mkfs"/>.</para>
        <para>It is recommended that the MDT have at least twice the minimum
        number of inodes to allow for future expansion and allow for an average
-      file size smaller than expected. Thus, the required space is:</para>
+      file size smaller than expected. Thus, the minimum space for an ldiskfs
+      MDT should be approximately:
+      </para>
        <informalexample>
-        <para>2 KB/inode x 20 million inodes x 2 = 80 GB</para>
+        <para>2 KiB/inode x 100 million inodes x 2 = 400 GiB ldiskfs MDT</para>
        </informalexample>
        <note>
          <para>If the average file size is very small, 4 KB for example, the
-        Lustre file system is not very efficient as the MDT will use as much
-        space for each file as the space used on the OST. However, this is not
-        a common configuration for a Lustre environment.</para>
+        MDT will use as much space for each file as the space used on the OST.
+       However, this is an uncommon usage for a Lustre filesystem.</para>
        </note>
        <note>
          <para>If the MDT has too few inodes, this can cause the space on the
@@ -246,14 +249,30 @@
          number of inodes after the file system is formatted, depending on the
          storage.  For ldiskfs MDT filesystems the <literal>resize2fs</literal>
          tool can be used if the underlying block device is on a LVM logical
-        volume.  For ZFS new (mirrored) VDEVs can be added to the MDT pool.
-        Inodes will be added approximately in proportion to space added.</para>
+        volume and the underlying logical volume size can be increased.
+       For ZFS new (mirrored) VDEVs can be added to the MDT pool to increase
+       the total space available for inode storage.
+        Inodes will be added approximately in proportion to space added.
+       </para>
        </note>
-      <note condition='l24'><para>It is also possible to increase the number
-        of inodes available, as well as increasing the aggregate metadata
-        performance, by adding additional MDTs using the DNE remote directory
-        feature available in Lustre release 2.4 and later, see
-        <xref linkend="dbdoclet.addingamdt"/>.</para>
+      <note condition='l24'>
+        <para>Note that the number of total and free inodes reported by
+        <literal>lfs df -i</literal> for ZFS MDTs and OSTs is estimated based
+        on the current average space used per inode.  When a ZFS filesystem is
+        first formatted, this free inode estimate will be very conservative
+        (low) due to the high ratio of directories to regular files created for
+       internal Lustre metadata storage, but this estimate will improve as
+       more files are created by regular users and the average file size will
+       better reflect actual site usage.
+       </para>
+      </note>
+      <note condition='l24'>
+        <para>Starting in release 2.4, using the DNE remote directory feature
+       it is possible to increase the total number of inodes of a Lustre
+       filesystem, as well as increasing the aggregate metadata performance,
+       by configuring additional MDTs into the filesystem, see
+        <xref linkend="dbdoclet.addingamdt"/> for details.
+        </para>
        </note>
      </section>
      <section remap="h3">
@@ -344,7 +363,7 @@
        <literal>mkfs.lustre</literal>.  Decreasing the inode ratio tunable
        <literal>bytes-per-inode</literal> will create more inodes for a given
        MDT size, but will leave less space for extra per-file metadata.  The
-      inode ratio must always be strictly larget than the MDT inode size,
+      inode ratio must always be strictly larger than the MDT inode size,
        which is 512 bytes by default.  It is recommended to use an inode ratio
        at least 512 bytes larger than the inode size to ensure the MDT does
        not run out of space.</para>
@@ -373,12 +392,11 @@
        for potential variations in future usage. This helps reduce the format
        and file system check time and makes more space available for data.</para>
        <para>The table below shows the default
-      <emphasis role="italic">bytes-per-inode</emphasis>ratio ("inode ratio")
+      <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
        used for OSTs of various sizes when they are formatted.</para>
        <para>
-        <table frame="all">
-          <title xml:id="settinguplustresystem.tab1">Default Inode Ratios
-         Used for Newly Formatted OSTs</title>
+        <table frame="all" xml:id="settinguplustresystem.tab1">
+          <title>Default Inode Ratios Used for Newly Formatted OSTs</title>
            <tgroup cols="3">
              <colspec colname="c1" colwidth="3*"/>
              <colspec colname="c2" colwidth="2*"/>
@@ -496,7 +514,7 @@
          </indexterm>File and File System Limits</title>
  
          <para><xref linkend="settinguplustresystem.tab2"/> describes
-     file and file system size limits.  These limits are imposed by either
+     current known limits of Lustre.  These limits are imposed by either
       the Lustre architecture or the Linux virtual file system (VFS) and
       virtual memory subsystems. In a few cases, a limit is defined within
       the code and can be changed by re-compiling the Lustre software.
@@ -504,8 +522,8 @@
       document, and can be found elsewhere online. In these cases, the
       indicated limit was used for testing of the Lustre software. </para>
  
-      <table frame="all">
-        <title xml:id="settinguplustresystem.tab2">File and file system limits</title>
+      <table frame="all" xml:id="settinguplustresystem.tab2">
+        <title>File and file system limits</title>
          <tgroup cols="3">
            <colspec colname="c1" colwidth="3*"/>
            <colspec colname="c2" colwidth="2*"/>
@@ -529,16 +547,16 @@
                  <para> Maximum number of MDTs</para>
                </entry>
                <entry>
-                <para> 1</para>
-                <para condition='l24'>4096</para>
+                <para condition='l24'>256</para>
                </entry>
                <entry>
-                <para>The Lustre software release 2.3 and earlier allows a maximum of 1 MDT per file
-                  system, but a single MDS can host multiple MDTs, each one for a separate file
-                  system.</para>
-                <para condition="l24">The Lustre software release 2.4 and later requires one MDT for
-                  the filesystem root. Up to 4095 additional MDTs can be added to the file system and attached
-                  into the namespace with remote directories.</para>
+                <para>The Lustre software release 2.3 and earlier allows a
+               maximum of 1 MDT per file system, but a single MDS can host
+               multiple MDTs, each one for a separate file system.</para>
+                <para condition="l24">The Lustre software release 2.4 and later
+               requires one MDT for the filesystem root. At least 255 more
+               MDTs can be added to the filesystem and attached into
+               the namespace with DNE remote or striped directories.</para>
                </entry>
              </row>
              <row>
@@ -549,8 +567,10 @@
                  <para> 8150</para>
                </entry>
                <entry>
-                <para>The maximum number of OSTs is a constant that can be changed at compile time.
-                  Lustre file systems with up to 4000 OSTs have been tested.</para>
+                <para>The maximum number of OSTs is a constant that can be
+               changed at compile time.  Lustre file systems with up to
+               4000 OSTs have been tested.  Multiple OST file systems can
+               be configured on a single OSS node.</para>
                </entry>
              </row>
              <row>
@@ -561,8 +581,18 @@
                  <para> 128TB (ldiskfs), 256TB (ZFS)</para>
                </entry>
                <entry>
-                <para>This is not a <emphasis>hard</emphasis> limit. Larger OSTs are possible but
-                  today typical production systems do not go beyond the stated limit per OST. </para>
+                <para>This is not a <emphasis>hard</emphasis> limit. Larger
+               OSTs are possible but today typical production systems do not
+               typically go beyond the stated limit per OST because Lustre
+               can add capacity and performance with additional OSTs, and
+               having more OSTs improves aggregate I/O performance and
+               minimizes contention.
+               </para>
+               <para>
+               With 32-bit kernels, due to page cache limits, 16TB is the
+               maximum block device size, which in turn applies to the
+               size of OST.  It is strongly recommended to run Lustre
+               clients and servers with 64-bit kernels.</para>
                </entry>
              </row>
              <row>
@@ -573,7 +603,9 @@
                  <para> 131072</para>
                </entry>
                <entry>
-                <para>The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production.</para>
+                <para>The maximum number of clients is a constant that can
+               be changed at compile time. Up to 30000 clients have been
+               used in production.</para>
                </entry>
              </row>
              <row>
@@ -584,8 +616,10 @@
                  <para> 512 PB (ldiskfs), 1EB (ZFS)</para>
                </entry>
                <entry>
-                <para>Each OST or MDT on 64-bit kernel servers can have a file system up to the above limit. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para>
-                <para>You can have multiple OST file systems on a single OSS node.</para>
+                <para>Each OST can have a file system up to the
+               Maximum OST size limit, and the Maximum number of OSTs
+               can be combined into a single filesystem.
+               </para>
                </entry>
              </row>
              <row>
@@ -596,7 +630,11 @@
                  <para> 2000</para>
                </entry>
                <entry>
-                <para>This limit is imposed by the size of the layout that needs to be stored on disk and sent in RPC requests, but is not a hard limit of the protocol.</para>
+                <para>This limit is imposed by the size of the layout that
+               needs to be stored on disk and sent in RPC requests, but is
+               not a hard limit of the protocol. The number of OSTs in the
+               filesystem can exceed the stripe count, but this limits the
+               number of OSTs across which a single file can be striped.</para>
                </entry>
              </row>
              <row>
@@ -607,7 +645,8 @@
                  <para> &lt; 4 GB</para>
                </entry>
                <entry>
-                <para>The amount of data written to each object before moving on to next object.</para>
+                <para>The amount of data written to each object before moving
+               on to next object.</para>
                </entry>
              </row>
              <row>
@@ -618,19 +657,23 @@
                  <para> 64 KB</para>
                </entry>
                <entry>
-                <para>Due to the 64 KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB.</para>
+                <para>Due to the 64 KB PAGE_SIZE on some 64-bit machines,
+               the minimum stripe size is set to 64 KB.</para>
                </entry>
              </row>
-            <row>              <entry>
-                <para> Maximum object size</para>              </entry>
+            <row>
+             <entry>
+                <para> Maximum object size</para>
+             </entry>
                <entry>
                  <para> 16TB (ldiskfs), 256TB (ZFS)</para>
                </entry>
                <entry>
-                <para>The amount of data that can be stored in a single object. An object
-                  corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies.  
-                  For ZFS the limit is the size of the underlying OST.
-                  Files can consist of up to 2000 stripes, each stripe can contain the maximum object size. </para>
+                <para>The amount of data that can be stored in a single object.
+               An object corresponds to a stripe. The ldiskfs limit of 16 TB
+               for a single object applies.  For ZFS the limit is the size of
+               the underlying OST.  Files can consist of up to 2000 stripes,
+               each stripe can be up to the maximum object size. </para>
                </entry>
              </row>
              <row>
@@ -643,10 +686,16 @@
                  <para> 31.25 PB on 64-bit ldiskfs systems, 8EB on 64-bit ZFS systems</para>
                </entry>
                <entry>
-                <para>Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed
-                  by the kernel memory subsystem. On 64-bit systems this limit does not exist.
-                  Hence, files can be 2^63 bits (8EB) in size if the backing filesystem can support large enough objects.</para>
-                <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para>
+                <para>Individual files have a hard limit of nearly 16 TB on
+               32-bit systems imposed by the kernel memory subsystem. On
+               64-bit systems this limit does not exist.  Hence, files can
+               be 2^63 bits (8EB) in size if the backing filesystem can
+               support large enough objects.</para>
+                <para>A single file can have a maximum of 2000 stripes, which
+               gives an upper single file limit of 31.25 PB for 64-bit
+               ldiskfs systems. The actual amount of data that can be stored
+               in a file depends upon the amount of free space in each OST
+               on which the file is striped.</para>
                </entry>
              </row>
              <row>
@@ -742,9 +791,10 @@
          <para condition="l22">In Lustre software releases prior to version 2.2,
         the maximum stripe count for a single file was limited to 160 OSTs.
         In version 2.2, the wide striping feature was added to support files
-       striped over up to 2000 OSTs.  In order to store the layout for
-       such large files, the ldiskfs <literal>ea_inode</literal> feature must
-       be enabled on the MDT.  This feature is disabled by default at
+       striped over up to 2000 OSTs.  In order to store the large layout for
+       such files in ldiskfs, the <literal>ea_inode</literal> feature must
+       be enabled on the MDT, but no similar tunable is needed for ZFS MDTs.
+       This feature is disabled by default at
         <literal>mkfs.lustre</literal> time. In order to enable this feature,
         specify <literal>--mkfsoptions="-O ea_inode"</literal> at MDT format
         time, or use <literal>tune2fs -O ea_inode</literal> to enable it after
@@ -872,7 +922,7 @@
        </listitem>
      </itemizedlist>
      <para>Lustre networks and routing are configured and managed by specifying parameters to the
-      Lustre networking (<literal>lnet</literal>) module in
+      Lustre Networking (<literal>lnet</literal>) module in
          <literal>/etc/modprobe.d/lustre.conf</literal>.</para>
      <para>To prepare to configure Lustre networking, complete the following steps:</para>
      <orderedlist>
@@ -890,21 +940,32 @@
        <listitem>
          <para><emphasis role="bold">If routing is needed, identify the nodes to be used to route traffic between networks.</emphasis></para>
          <para>If you are using multiple network types, then you will need a router. Any node with
-          appropriate interfaces can route Lustre networking (LNET) traffic between different
+          appropriate interfaces can route Lustre networking (LNet) traffic between different
            network hardware types or topologies --the node may be a server, a client, or a standalone
-          router. LNET can route messages between different network types (such as
+          router. LNet can route messages between different network types (such as
            TCP-to-InfiniBand) or across different topologies (such as bridging two InfiniBand or
            TCP/IP networks). Routing will be configured in <xref linkend="configuringlnet"/>.</para>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Identify the network interfaces to include in or exclude from LNET. </emphasis>
-    </para>
-        <para>If not explicitly specified, LNET uses either the first available interface or a pre-defined default for a given network type. Interfaces that LNET should not use (such as an administrative network or IP-over-IB), can be excluded.</para>
-        <para>Network interfaces to be used or excluded will be specified using the lnet kernel module parameters networks and <literal>ip2netsas</literal> described in <xref linkend="configuringlnet"/>.</para>
+        <para><emphasis role="bold">Identify the network interfaces to include
+       in or exclude from LNet.</emphasis></para>
+        <para>If not explicitly specified, LNet uses either the first available
+       interface or a pre-defined default for a given network type. Interfaces
+       that LNet should not use (such as an administrative network or
+       IP-over-IB), can be excluded.</para>
+        <para>Network interfaces to be used or excluded will be specified using
+       the lnet kernel module parameters <literal>networks</literal> and
+       <literal>ip2nets</literal> as described in
+       <xref linkend="configuringlnet"/>.</para>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">To ease the setup of networks with complex network configurations, determine a cluster-wide module configuration.</emphasis></para>
-        <para>For large clusters, you can configure the networking setup for all nodes by using a single, unified set of parameters in the <literal>lustre.conf</literal> file on each node. Cluster-wide configuration is described in <xref linkend="configuringlnet"/>.</para>
+        <para><emphasis role="bold">To ease the setup of networks with complex
+       network configurations, determine a cluster-wide module configuration.
+       </emphasis></para>
+        <para>For large clusters, you can configure the networking setup for
+       all nodes by using a single, unified set of parameters in the
+       <literal>lustre.conf</literal> file on each node. Cluster-wide
+       configuration is described in <xref linkend="configuringlnet"/>.</para>
        </listitem>
      </orderedlist>
      <note>