LUDOC-355 dne: Correction to DNE config for remote sub-dirs

[doc/manual.git] / SettingUpLustreSystem.xml
diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml

index 7f7fe8c..ac425aa 100644 (file)
--- a/SettingUpLustreSystem.xml
+++ b/SettingUpLustreSystem.xml
@@ -17,7 +17,7 @@
      </listitem>
      <listitem>
        <para>
-          <xref linkend="dbdoclet.50438256_84701"/>
+          <xref linkend="dbdoclet.ldiskfs_mkfs_opts"/>
        </para>
      </listitem>
      <listitem>
@@ -43,20 +43,22 @@
      <para>Since the block devices are accessed by only one or two server nodes, a storage area network (SAN) that is accessible from all the servers is not required. Expensive switches are not needed because point-to-point connections between the servers and the storage arrays normally provide the simplest and best attachments. (If failover capability is desired, the storage must be attached to multiple servers.)</para>
      <para>For a production environment, it is preferable that the MGS have separate storage to allow future expansion to multiple file systems. However, it is possible to run the MDS and MGS on the same machine and have them share the same storage device.</para>
      <para>For best performance in a production environment, dedicated clients are required. For a non-production Lustre environment or for testing, a Lustre client and server can run on the same machine. However, dedicated clients are the only supported configuration.</para>
-    <para>Performance and other issues can occur when an MDS or OSS and a client are running on the same machine:</para>
+    <warning><para>Performance and recovery issues can occur if you put a client on an MDS or OSS:</para>
      <itemizedlist>
        <listitem>
-        <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
+        <para>Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.</para>
        </listitem>
        <listitem>
-        <para>Running the OSS and a client on the same machine can cause issues with low memory and memory pressure. If the client consumes all the memory and then tries to write data to the file system, the OSS will need to allocate pages to receive data from the client but will not be able to perform this operation due to low memory. This can cause the client to hang.</para>
+        <para>Running the MDS and a client on the same machine can cause recovery and deadlock issues and impact the performance of other Lustre clients.</para>
        </listitem>
      </itemizedlist>
+    </warning>
      <para>Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
        typically used for testing to match expected customer usage and avoid limitations due to the 4
        GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size limit of 32-bit CPUs.
-      Also, due to kernel API limitations, performing backups of Lustre 2.x. file systems on 32-bit
-      clients may cause backup tools to confuse files that have the same 32-bit inode number.</para>
+      Also, due to kernel API limitations, performing backups of Lustre software release 2.x. file
+      systems on 32-bit clients may cause backup tools to confuse files that have the same 32-bit
+      inode number.</para>
      <para>The storage attached to the servers typically uses RAID to provide fault tolerance and can
        optionally be organized with logical volume management (LVM), which is then formatted as a
        Lustre file system. Lustre OSS and MDS servers read, write and modify data in the format
@@ -84,49 +86,104 @@
        <para>For maximum performance, the MDT should be configured as RAID1 with an internal journal and two disks from different controllers.</para>
        <para>If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device.</para>
        <para>Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure disables an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.</para>
-      <para condition='l24'>If multiple MDTs are going to be present in the system, each MDT should be specified for the anticipated usage and load.</para>
-      <warning condition='l24'><para>MDT0 contains the root of the Lustre file system. If MDT0 is unavailable for any reason, the
-          file system cannot be used.</para></warning>
-      <note condition='l24'><para>Additional MDTs can be dedicated to sub-directories off the root file system provided by MDT0.
-          Subsequent directories may also be configured to have their own MDT. If an MDT serving a
-          subdirectory becomes unavailable this subdirectory and all directories beneath it will
-          also become unavailable. Configuring multiple levels of MDTs is an experimental feature
-          for the Lustre 2.4 release.</para></note>
+      <para condition='l24'>If multiple MDTs are going to be present in the
+      system, each MDT should be specified for the anticipated usage and load.
+      For details on how to add additional MDTs to the filesystem, see
+      <xref linkend="dbdoclet.addingamdt"/>.</para>
+      <warning condition='l24'><para>MDT0 contains the root of the Lustre file
+      system. If MDT0 is unavailable for any reason, the file system cannot be
+      used.</para></warning>
+      <note condition='l24'><para>Using the DNE feature it is possible to
+      dedicate additional MDTs to sub-directories off the file system root
+      directory stored on MDT0, or arbitrarily for lower-level subdirectories.
+      using the <literal>lfs mkdir -i <replaceable>mdt_index</replaceable></literal> command.
+      If an MDT serving a subdirectory becomes unavailable, any subdirectories
+      on that MDT and all directories beneath it will also become inaccessible.
+      Configuring multiple levels of MDTs is an experimental feature for the
+      2.4 release, and is fully functional in the 2.8 release.  This is
+      typically useful for top-level directories to assign different users
+      or projects to separate MDTs, or to distribute other large working sets
+      of files to multiple MDTs.</para></note>
+      <note condition='l28'><para>Starting in the 2.8 release it is possible
+      to spread a single large directory across multiple MDTs using the DNE
+      striped directory feature by specifying multiple stripes (or shards)
+      at creation time using the
+      <literal>lfs mkdir -c <replaceable>stripe_count</replaceable></literal>
+      command, where <replaceable>stripe_count</replaceable> is often the
+      number of MDTs in the filesystem.  Striped directories should typically
+      not be used for all directories in the filesystem, since this incurs
+      extra overhead compared to non-striped directories, but is useful for
+      larger directories (over 50k entries) where many output files are being
+      created at one time.
+      </para></note>
      </section>
      <section remap="h3">
        <title><indexterm><primary>setup</primary><secondary>OST</secondary></indexterm>OST Storage Hardware Considerations</title>
-      <para>The data access pattern for the OSS storage is a streaming I/O pattern that is dependent on the access patterns of applications being used. Each OSS can manage multiple object storage targets (OSTs), one for each volume with I/O traffic load-balanced between servers and targets. An OSS should be configured to have a balance between the network bandwidth and the attached storage bandwidth to prevent bottlenecks in the I/O path. Depending on the server hardware, an OSS typically serves between 2 and 8 targets, with each target up to 128 terabytes (TBs) in size.</para>
-      <para>Lustre file system capacity is the sum of the capacities provided by the targets. For
-        example, 64 OSSs, each with two 8 TB targets, provide a file system with a capacity of
-        nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks in a
-        RAID 6 configuration), it may be possible to get 50 MB/sec from each drive, providing up to
-        400 MB/sec of disk bandwidth per OST. If this system is used as storage backend with a
-        system network, such as the InfiniBand network, that provides a similar bandwidth, then each
-        OSS could provide 800 MB/sec of end-to-end I/O throughput. (Although the architectural
-        constraints described here are simple, in practice it takes careful hardware selection,
-        benchmarking and integration to obtain such results.)</para>
+      <para>The data access pattern for the OSS storage is a streaming I/O
+      pattern that is dependent on the access patterns of applications being
+      used. Each OSS can manage multiple object storage targets (OSTs), one
+      for each volume with I/O traffic load-balanced between servers and
+      targets. An OSS should be configured to have a balance between the
+      network bandwidth and the attached storage bandwidth to prevent
+      bottlenecks in the I/O path. Depending on the server hardware, an OSS
+      typically serves between 2 and 8 targets, with each target between
+      24-48TB, but may be up to 256 terabytes (TBs) in size.</para>
+      <para>Lustre file system capacity is the sum of the capacities provided
+      by the targets. For example, 64 OSSs, each with two 8 TB OSTs,
+      provide a file system with a capacity of nearly 1 PB. If each OST uses
+      ten 1 TB SATA disks (8 data disks plus 2 parity disks in a RAID-6
+      configuration), it may be possible to get 50 MB/sec from each drive,
+      providing up to 400 MB/sec of disk bandwidth per OST. If this system
+      is used as storage backend with a system network, such as the InfiniBand
+      network, that provides a similar bandwidth, then each OSS could provide
+      800 MB/sec of end-to-end I/O throughput. (Although the architectural
+      constraints described here are simple, in practice it takes careful
+      hardware selection, benchmarking and integration to obtain such
+      results.)</para>
      </section>
    </section>
    <section xml:id="dbdoclet.50438256_31079">
        <title><indexterm><primary>setup</primary><secondary>space</secondary></indexterm>
            <indexterm><primary>space</primary><secondary>determining requirements</secondary></indexterm>
            Determining Space Requirements</title>
-    <para>The desired performance characteristics of the backing file systems on the MDT and OSTs
-      are independent of one another. The size of the MDT backing file system depends on the number
-      of inodes needed in the total Lustre file system, while the aggregate OST space depends on the
-      total amount of data stored on the file system. If MGS data is to be stored on the MDT device
-      (co-located MGT and MDT), add 100 MB to the required size estimate for the MDT.</para>
-    <para>Each time a file is created on a Lustre file system, it consumes one inode on the MDT and one inode for each OST object over which the file is striped. Normally, each file&apos;s stripe count is based on the system-wide default stripe count. However, this can be changed for individual files using the <literal>lfs setstripe</literal> option. For more details, see <xref linkend="managingstripingfreespace"/>.</para>
-    <para>In a Lustre ldiskfs file system, all the inodes are allocated on the MDT and OSTs when the file system is first formatted. The total number of inodes on a formatted MDT or OST cannot be easily changed, although it is possible to add OSTs with additional space and corresponding inodes. Thus, the number of inodes created at format time should be generous enough to anticipate future expansion.</para>
-    <para>When the file system is in use and a file is created, the metadata associated with that file is stored in one of the pre-allocated inodes and does not consume any of the free space used to store file data.</para>
-    <note>
-      <para>By default, the ldiskfs file system used by Lustre servers to store user-data objects
-        and system data reserves 5% of space that cannot be used by the Lustre file system.
-        Additionally, a Lustre file system reserves up to 400 MB on each OST for journal use and a
-        small amount of space outside the journal to store accounting data. This reserved space is
-        unusable for general storage. Thus, at least 400 MB of space is used on each OST before any
-        file object data is saved.</para>
-    </note>
+    <para>The desired performance characteristics of the backing file systems
+    on the MDT and OSTs are independent of one another. The size of the MDT
+    backing file system depends on the number of inodes needed in the total
+    Lustre file system, while the aggregate OST space depends on the total
+    amount of data stored on the file system. If MGS data is to be stored
+    on the MDT device (co-located MGT and MDT), add 100 MB to the required
+    size estimate for the MDT.</para>
+    <para>Each time a file is created on a Lustre file system, it consumes
+    one inode on the MDT and one OST object over which the file is striped.
+    Normally, each file&apos;s stripe count is based on the system-wide
+    default stripe count.  However, this can be changed for individual files
+    using the <literal>lfs setstripe</literal> option. For more details,
+    see <xref linkend="managingstripingfreespace"/>.</para>
+    <para>In a Lustre ldiskfs file system, all the MDT inodes and OST
+    objects are allocated when the file system is first formatted.  When
+    the file system is in use and a file is created, metadata associated
+    with that file is stored in one of the pre-allocated inodes and does
+    not consume any of the free space used to store file data.  The total
+    number of inodes on a formatted ldiskfs MDT or OST cannot be easily
+    changed. Thus, the number of inodes created at format time should be
+    generous enough to anticipate near term expected usage, with some room
+    for growth without the effort of additional storage.</para>
+    <para>By default, the ldiskfs file system used by Lustre servers to store
+    user-data objects and system data reserves 5% of space that cannot be used
+    by the Lustre file system.  Additionally, a Lustre file system reserves up
+    to 400 MB on each OST, and up to 4GB on each MDT for journal use and a
+    small amount of space outside the journal to store accounting data. This
+    reserved space is unusable for general storage. Thus, at least this much
+    space will be used on each OST before any file object data is saved.</para>
+    <para condition="l24">With a ZFS backing filesystem for the MDT or OST,
+    the space allocation for inodes and file data is dynamic, and inodes are
+    allocated as needed.  A minimum of 2kB of usable space (before mirroring)
+    is needed for each inode, exclusive of other overhead such as directories,
+    internal log files, extended attributes, ACLs, etc.
+    Since the size of extended attributes and ACLs is highly dependent on
+    kernel versions and site-specific policies, it is best to over-estimate
+    the amount of space needed for the desired number of inodes, and any
+    excess space will be utilized to store more inodes.</para>
      <section>
        <title><indexterm>
            <primary>setup</primary>
@@ -136,8 +193,9 @@
            <primary>space</primary>
            <secondary>determining MGT requirements</secondary>
          </indexterm> Determining MGT Space Requirements</title>
-      <para>Less than 100 MB of space is required for the MGT. The size is determined by the number
-        of servers in the Lustre cluster(s) that are managed by the MGS.</para>
+      <para>Less than 100 MB of space is required for the MGT. The size
+      is determined by the number of servers in the Lustre file system
+      cluster(s) that are managed by the MGS.</para>
      </section>
      <section xml:id="dbdoclet.50438256_87676">
          <title><indexterm>
@@ -148,20 +206,54 @@
            <primary>space</primary>
            <secondary>determining MDT requirements</secondary>
          </indexterm> Determining MDT Space Requirements</title>
-      <para>When calculating the MDT size, the important factor to consider is the number of files to be stored in the file system. This determines the number of inodes needed, which drives the MDT sizing. To be on the safe side, plan for 2 KB per inode on the MDT, which is the default value. Attached storage required for Lustre metadata is typically 1-2 percent of the file system capacity depending upon file size.</para>
-      <para>For example, if the average file size is 5 MB and you have 100 TB of usable OST space, then you can calculate the minimum number of inodes as follows:</para>
+      <para>When calculating the MDT size, the important factor to consider
+      is the number of files to be stored in the file system. This determines
+      the number of inodes needed, which drives the MDT sizing. To be on the
+      safe side, plan for 2 KB per ldiskfs inode on the MDT, which is the
+      default value. Attached storage required for Lustre file system metadata
+      is typically 1-2 percent of the file system capacity depending upon
+      file size.</para>
+      <note condition='l24'><para>Starting in release 2.4, using the DNE
+      remote directory feature it is possible to increase the metadata
+      capacity of a single filesystem by configuting additional MDTs into
+      the filesystem, see <xref linkend="dbdoclet.addingamdt"/> for details.
+      </para></note>
+      <para>For example, if the average file size is 5 MB and you have
+      100 TB of usable OST space, then you can calculate the minimum number
+      of inodes as follows:</para>
        <informalexample>
          <para>(100 TB * 1024 GB/TB * 1024 MB/GB) / 5 MB/inode = 20 million inodes</para>
        </informalexample>
-      <para>We recommend that you use at least twice the minimum number of inodes to allow for future expansion and allow for an average file size smaller than expected. Thus, the required space is:</para>
+      <para>For details about formatting options for MDT and OST file systems,
+      see <xref linkend="dbdoclet.ldiskfs_mdt_mkfs"/>.</para>
+      <para>It is recommended that the MDT have at least twice the minimum
+      number of inodes to allow for future expansion and allow for an average
+      file size smaller than expected. Thus, the required space is:</para>
        <informalexample>
-        <para>2 KB/inode * 40 million inodes = 80 GB</para>
+        <para>2 KB/inode x 20 million inodes x 2 = 80 GB</para>
        </informalexample>
-      <para>If the average file size is small, 4 KB for example, the Lustre file system is not very
-        efficient as the MDT uses as much space as the OSTs. However, this is not a common
-        configuration for a Lustre environment.</para>
        <note>
-        <para>If the MDT is too small, this can cause all the space on the OSTs to be unusable. Be sure to determine the appropriate size of the MDT needed to support the file system before formatting the file system. It is difficult to increase the number of inodes after the file system is formatted.</para>
+        <para>If the average file size is very small, 4 KB for example, the
+        Lustre file system is not very efficient as the MDT will use as much
+        space for each file as the space used on the OST. However, this is not
+        a common configuration for a Lustre environment.</para>
+      </note>
+      <note>
+        <para>If the MDT has too few inodes, this can cause the space on the
+       OSTs to be inaccessible since no new files can be created. Be sure to
+        determine the appropriate size of the MDT needed to support the file
+        system before formatting the file system. It is possible to increase the
+        number of inodes after the file system is formatted, depending on the
+        storage.  For ldiskfs MDT filesystems the <literal>resize2fs</literal>
+        tool can be used if the underlying block device is on a LVM logical
+        volume.  For ZFS new (mirrored) VDEVs can be added to the MDT pool.
+        Inodes will be added approximately in proportion to space added.</para>
+      </note>
+      <note condition='l24'><para>It is also possible to increase the number
+        of inodes available, as well as increasing the aggregate metadata
+        performance, by adding additional MDTs using the DNE remote directory
+        feature available in Lustre release 2.4 and later, see
+        <xref linkend="dbdoclet.addingamdt"/>.</para>
        </note>
      </section>
      <section remap="h3">
@@ -173,22 +265,33 @@
            <primary>space</primary>
            <secondary>determining OST requirements</secondary>
          </indexterm> Determining OST Space Requirements</title>
-      <para>For the OST, the amount of space taken by each object depends on the usage pattern of
-        the users/applications running on the system. The Lustre software defaults to a conservative
-        estimate for the object size (16 KB per object). If you are confident that the average file
-        size for your applications will be larger than this, you can specify a larger average file
-        size (fewer total inodes) to reduce file system overhead and minimize file system check
-        time. See <xref linkend="dbdoclet.50438256_53886"/> for more details.</para>
+      <para>For the OST, the amount of space taken by each object depends on
+      the usage pattern of the users/applications running on the system. The
+      Lustre software defaults to a conservative estimate for the average
+      object size (between 64KB per object for 10GB OSTs, and 1MB per object
+      for 16TB and larger OSTs). If you are confident that the average file
+      size for your applications will be larger than this, you can specify a
+      larger average file size (fewer total inodes for a given OST size) to
+      reduce file system overhead and minimize file system check time.
+      See <xref linkend="dbdoclet.ldiskfs_ost_mkfs"/> for more details.</para>
      </section>
    </section>
-  <section xml:id="dbdoclet.50438256_84701">
-      <title>
-          <indexterm><primary>file system</primary><secondary>formatting options</secondary></indexterm>
-          <indexterm><primary>setup</primary><secondary>file system</secondary></indexterm>
-          Setting File System Formatting Options</title>
-    <para>By default, the <literal>mkfs.lustre</literal> utility applies these options to the Lustre
-      backing file system used to store data and metadata in order to enhance Lustre file system
-      performance and scalability. These options include:</para>
+  <section xml:id="dbdoclet.ldiskfs_mkfs_opts">
+    <title>
+      <indexterm>
+        <primary>ldiskfs</primary>
+       <secondary>formatting options</secondary>
+      </indexterm>
+      <indexterm>
+        <primary>setup</primary>
+       <secondary>ldiskfs</secondary>
+      </indexterm>
+      Setting ldiskfs File System Formatting Options
+    </title>
+    <para>By default, the <literal>mkfs.lustre</literal> utility applies these
+    options to the Lustre backing file system used to store data and metadata
+    in order to enhance Lustre file system performance and scalability. These
+    options include:</para>
          <itemizedlist>
              <listitem>
                <para><literal>flex_bg</literal> - When the flag is set to enable this
@@ -215,38 +318,67 @@
      <screen>--mkfsoptions=&apos;backing fs options&apos;</screen>
      <para>For other <literal>mkfs.lustre</literal> options, see the Linux man page for
          <literal>mke2fs(8)</literal>.</para>
-    <section xml:id="dbdoclet.50438256_pgfId-1293228">
+    <section xml:id="dbdoclet.ldiskfs_mdt_mkfs">
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>MDS</secondary>
          </indexterm><indexterm>
            <primary>setup</primary>
            <secondary>inodes</secondary>
-        </indexterm>Setting Formatting Options for an MDT</title>
-      <para>The number of inodes on the MDT is determined at format time based on the total size of
-        the file system to be created. The default <emphasis role="italic"
-          >bytes-per-inode</emphasis> ratio ("inode ratio") for an MDT is optimized at one inode for
-        every 2048 bytes of file system space. It is recommended that this value not be changed for
-        MDTs.</para>
-      <para>This setting takes into account the space needed for additional metadata, such as the
-        journal (up to 400 MB), bitmaps and directories, and a few files that the Lustre file system
-        uses to maintain cluster consistency.</para>
+        </indexterm>Setting Formatting Options for an ldiskfs MDT</title>
+      <para>The number of inodes on the MDT is determined at format time
+      based on the total size of the file system to be created. The default
+      <emphasis role="italic">bytes-per-inode</emphasis> ratio ("inode ratio")
+      for an MDT is optimized at one inode for every 2048 bytes of file
+      system space. It is recommended that this value not be changed for
+      MDTs.</para>
+      <para>This setting takes into account the space needed for additional
+      ldiskfs filesystem-wide metadata, such as the journal (up to 4 GB),
+      bitmaps, and directories, as well as files that Lustre uses internally
+      to maintain cluster consistency.  There is additional per-file metadata
+      such as file layout for files with a large number of stripes, Access
+      Control Lists (ACLs), and user extended attributes.</para>
+      <para> It is possible to reserve less than the recommended 2048 bytes
+      per inode for an ldiskfs MDT when it is first formatted by adding the
+      <literal>--mkfsoptions="-i bytes-per-inode"</literal> option to
+      <literal>mkfs.lustre</literal>.  Decreasing the inode ratio tunable
+      <literal>bytes-per-inode</literal> will create more inodes for a given
+      MDT size, but will leave less space for extra per-file metadata.  The
+      inode ratio must always be strictly larget than the MDT inode size,
+      which is 512 bytes by default.  It is recommended to use an inode ratio
+      at least 512 bytes larger than the inode size to ensure the MDT does
+      not run out of space.</para>
+      <para>The size of the inode may be changed by adding the
+      <literal>--stripe-count-hint=N</literal> to have
+      <literal>mkfs.lustre</literal> automatically calculate a reasonable
+      inode size based on the default stripe count that will be used by the
+      filesystem, or directly by specifying the
+      <literal>--mkfsoptions="-I inode-size"</literal> option.  Increasing
+      the inode size will provide more space in the inode for a larger Lustre
+      file layout, ACLs, user and system extended attributes, SELinux and
+      other security labels, and other internal metadata.  However, if these
+      features or other in-inode xattrs are not needed, the larger inode size
+      will hurt metadata performance as 2x, 4x, or 8x as much data would be
+      read or written for each MDT inode access.
+      </para>
      </section>
-    <section xml:id="dbdoclet.50438256_53886">
+    <section xml:id="dbdoclet.ldiskfs_ost_mkfs">
        <title><indexterm>
            <primary>inodes</primary>
            <secondary>OST</secondary>
-        </indexterm>Setting Formatting Options for an OST</title>
-      <para>When formatting OST file systems, it is normally advantageous to take local file system
-        usage into account. When doing so, try to minimize the number of inodes on each OST, while
-        keeping enough margin for potential variations in future usage. This helps reduce the format
-        and file system check time and makes more space available for data.</para>
-      <para>The table below shows the default <emphasis role="italic">bytes-per-inode
-        </emphasis>ratio ("inode ratio") used for OSTs of various sizes when they are formatted. </para>
+        </indexterm>Setting Formatting Options for an ldiskfs OST</title>
+      <para>When formatting an OST file system, it can be beneficial
+      to take local file system usage into account. When doing so, try to
+      reduce the number of inodes on each OST, while keeping enough margin
+      for potential variations in future usage. This helps reduce the format
+      and file system check time and makes more space available for data.</para>
+      <para>The table below shows the default
+      <emphasis role="italic">bytes-per-inode</emphasis>ratio ("inode ratio")
+      used for OSTs of various sizes when they are formatted.</para>
        <para>
          <table frame="all">
-          <title xml:id="settinguplustresystem.tab1">Inode Ratios Used for Newly Formatted
-            OSTs</title>
+          <title xml:id="settinguplustresystem.tab1">Default Inode Ratios
+         Used for Newly Formatted OSTs</title>
            <tgroup cols="3">
              <colspec colname="c1" colwidth="3*"/>
              <colspec colname="c2" colwidth="2*"/>
@@ -257,7 +389,7 @@
                    <para><emphasis role="bold">LUN/OST size</emphasis></para>
                  </entry>
                  <entry>
-                  <para><emphasis role="bold">Inode ratio</emphasis></para>
+                  <para><emphasis role="bold">Default Inode ratio</emphasis></para>
                  </entry>
                  <entry>
                    <para><emphasis role="bold">Total inodes</emphasis></para>
@@ -313,20 +445,37 @@
            </tgroup>
          </table>
        </para>
-      <para>In environments with few small files, the default inode ratio may result in far too many
-        inodes for the average file size. In this case, performance can be improved by increasing
-        the number of <emphasis role="italic">bytes-per-inode</emphasis>.To set the inode ratio, use
-        the <literal>-i</literal> argument to <literal>mkfs.lustre</literal> to specify the
-          <emphasis role="italic">bytes-per-inode</emphasis> value. </para>
+      <para>In environments with few small files, the default inode ratio
+      may result in far too many inodes for the average file size. In this
+      case, performance can be improved by increasing the number of
+      <emphasis role="italic">bytes-per-inode</emphasis>.  To set the inode
+      ratio, use the <literal>--mkfsoptions="-i <replaceable>bytes-per-inode</replaceable>"</literal>
+      argument to <literal>mkfs.lustre</literal> to specify the expected
+      average (mean) size of OST objects.  For example, to create an OST
+      with an expected average object size of 8MB run:
+      <screen>[oss#] mkfs.lustre --ost --mkfsoptions=&quot;-i $((8192 * 1024))&quot; ...</screen>
+      </para>
        <note>
-        <para>File system check time on OSTs is affected by a number of  variables in addition to
-          the number of inodes, including the size of the file system, the number of allocated
-          blocks, the distribution of allocated blocks on the disk, disk speed, CPU speed, and the
-          amount of RAM on the server. Reasonable file system check times are 5-30 minutes per
-          TB.</para>
+        <para>OSTs formatted with ldiskfs are limited to a maximum of
+       320 million to 1 billion objects.  Specifying a very small
+       bytes-per-inode ratio for a large OST that causes this limit to be
+       exceeded can cause either premature out-of-space errors and prevent
+       the full OST space from being used, or will waste space and slow down
+       e2fsck more than necessary.  The default inode ratios are chosen to
+       ensure that the total number of inodes remain below this limit.
+       </para>
        </note>
-      <para>For more details about formatting MDT and OST file systems, see <xref
-          linkend="dbdoclet.50438208_51921"/>.</para>
+      <note>
+        <para>File system check time on OSTs is affected by a number of
+       variables in addition to the number of inodes, including the size of
+       the file system, the number of allocated blocks, the distribution of
+       allocated blocks on the disk, disk speed, CPU speed, and the amount
+       of RAM on the server. Reasonable file system check times for valid
+       filesystems are 5-30 minutes per TB, but may increase significantly
+       if substantial errors are detected and need to be required.</para>
+      </note>
+      <para>For more details about formatting MDT and OST file systems,
+      see <xref linkend="dbdoclet.ldiskfs_raid_opts"/>.</para>
      </section>
      <section remap="h3">
        <title><indexterm>
@@ -339,16 +488,22 @@
            <secondary><emphasis role="italic">See</emphasis> wide striping</secondary>
          </indexterm><indexterm>
            <primary>large_xattr</primary>
+          <secondary>ea_inode</secondary>
          </indexterm><indexterm>
            <primary>wide striping</primary>
            <secondary>large_xattr</secondary>
+          <tertiary>ea_inode</tertiary>
          </indexterm>File and File System Limits</title>
-      <para><xref linkend="settinguplustresystem.tab2"/> describes file and file system size limits.
-        These limits are imposed by either the Lustre architecture or the Linux virtual file system
-        (VFS) and virtual memory subsystems. In a few cases, a limit is defined within the code and
-        can be changed by re-compiling the Lustre software (see <xref
-          linkend="installinglustrefromsourcecode"/>). In these cases, the indicated limit was used
-        for testing of the Lustre software. </para>
+
+        <para><xref linkend="settinguplustresystem.tab2"/> describes
+     file and file system size limits.  These limits are imposed by either
+     the Lustre architecture or the Linux virtual file system (VFS) and
+     virtual memory subsystems. In a few cases, a limit is defined within
+     the code and can be changed by re-compiling the Lustre software.
+     Instructions to install from source code are beyond the scope of this
+     document, and can be found elsewhere online. In these cases, the
+     indicated limit was used for testing of the Lustre software. </para>
+
        <table frame="all">
          <title xml:id="settinguplustresystem.tab2">File and file system limits</title>
          <tgroup cols="3">
@@ -378,12 +533,12 @@
                  <para condition='l24'>4096</para>
                </entry>
                <entry>
-                <para>The Lustre 2.3 release and earlier allows a maximum of 1 MDT per file system,
-                  but a single MDS can host multiple MDTs, each one for a separate file
+                <para>The Lustre software release 2.3 and earlier allows a maximum of 1 MDT per file
+                  system, but a single MDS can host multiple MDTs, each one for a separate file
                    system.</para>
-                <para condition="l24">The Lustre 2.4 release and later requires one MDT for the
-                  root. Upto 4095 additional MDTs can be added to the file system and attached into
-                  the namespace with remote directories.</para>
+                <para condition="l24">The Lustre software release 2.4 and later requires one MDT for
+                  the filesystem root. Up to 4095 additional MDTs can be added to the file system and attached
+                  into the namespace with remote directories.</para>
                </entry>
              </row>
              <row>
@@ -403,11 +558,11 @@
                  <para> Maximum OST size</para>
                </entry>
                <entry>
-                <para> 128TB </para>
+                <para> 128TB (ldiskfs), 256TB (ZFS)</para>
                </entry>
                <entry>
                  <para>This is not a <emphasis>hard</emphasis> limit. Larger OSTs are possible but
-                  today typical production systems do not go beyond 128TB per OST. </para>
+                  today typical production systems do not go beyond the stated limit per OST. </para>
                </entry>
              </row>
              <row>
@@ -418,7 +573,7 @@
                  <para> 131072</para>
                </entry>
                <entry>
-                <para>The maximum number of clients is a constant that can be changed at compile time.</para>
+                <para>The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production.</para>
                </entry>
              </row>
              <row>
@@ -426,10 +581,10 @@
                  <para> Maximum size of a file system</para>
                </entry>
                <entry>
-                <para> 512 PB</para>
+                <para> 512 PB (ldiskfs), 1EB (ZFS)</para>
                </entry>
                <entry>
-                <para>Each OST or MDT on 64-bit kernel servers can have a file system up to 128 TB. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para>
+                <para>Each OST or MDT on 64-bit kernel servers can have a file system up to the above limit. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para>
                  <para>You can have multiple OST file systems on a single OSS node.</para>
                </entry>
              </row>
@@ -469,12 +624,13 @@
              <row>              <entry>
                  <para> Maximum object size</para>              </entry>
                <entry>
-                <para> 16 TB</para>
+                <para> 16TB (ldiskfs), 256TB (ZFS)</para>
                </entry>
                <entry>
                  <para>The amount of data that can be stored in a single object. An object
-                  corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies.
-                  Files can consist of up to 2000 stripes, each 16 TB in size. </para>
+                  corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies.  
+                  For ZFS the limit is the size of the underlying OST.
+                  Files can consist of up to 2000 stripes, each stripe can contain the maximum object size. </para>
                </entry>
              </row>
              <row>
@@ -484,14 +640,13 @@
                <entry>
                  <para> 16 TB on 32-bit systems</para>
                  <para>&#160;</para>
-                <para> 31.25 PB on 64-bit systems</para>
+                <para> 31.25 PB on 64-bit ldiskfs systems, 8EB on 64-bit ZFS systems</para>
                </entry>
                <entry>
                  <para>Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed
                    by the kernel memory subsystem. On 64-bit systems this limit does not exist.
-                  Hence, files can be 64-bits in size. An additional size limit of up to the number
-                  of stripes is imposed, where each stripe is 16 TB.</para>
-                <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para>
+                  Hence, files can be 2^63 bits (8EB) in size if the backing filesystem can support large enough objects.</para>
+                <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para>
                </entry>
              </row>
              <row>
@@ -499,14 +654,20 @@
                  <para> Maximum number of files or subdirectories in a single directory</para>
                </entry>
                <entry>
-                <para> 10 million files</para>
+                <para> 10 million files (ldiskfs), 2^48 (ZFS)</para>
                </entry>
                <entry>
-                <para>The Lustre software uses the ldiskfs hashed directory code, which has a limit
-                  of about 10 million files depending on the length of the file name. The limit on
-                  subdirectories is the same as the limit on regular files.</para>
-                <para>Lustre file systems are tested with ten million files in a single
-                  directory.</para>
+                <para>The Lustre software uses the ldiskfs hashed directory
+                code, which has a limit of about 10 million files, depending
+                on the length of the file name. The limit on subdirectories
+                is the same as the limit on regular files.</para>
+                <note condition='l28'><para>Starting in the 2.8 release it is
+                possible to exceed this limit by striping a single directory
+                over multiple MDTs with the <literal>lfs mkdir -c</literal>
+                command, which increases the single directory limit by a
+                factor of the number of directory stripes used.</para></note>
+                <para>Lustre file systems are tested with ten million files
+                in a single directory.</para>
                </entry>
              </row>
              <row>
@@ -514,15 +675,25 @@
                  <para> Maximum number of files in the file system</para>
                </entry>
                <entry>
-                <para> 4 billion</para>
-                <para condition='l24'>4096 * 4 billion</para>
+                <para> 4 billion (ldiskfs), 256 trillion (ZFS)</para>
+                <para condition='l24'>up to 256 times the per-MDT limit</para>
                </entry>
                <entry>
-                <para>The ldiskfs file system imposes an upper limit of 4 billion inodes. By default, the MDS file system is formatted with 2KB of space per inode, meaning 1 billion inodes per file system of 2 TB.</para>
-                <para>This can be increased initially, at the time of MDS file system creation. For more information, see <xref linkend="settinguplustresystem"/>.</para>
-                               <para condition="l24">Each additional MDT can hold up to 4 billion additional files, depending
-                  on available inodes and the distribution directories and files in the file
-                  system.</para>
+                <para>The ldiskfs filesystem imposes an upper limit of
+                4 billion inodes per filesystem. By default, the MDT
+                filesystem is formatted with one inode per 2KB of space,
+                meaning 512 million inodes per TB of MDT space. This can be
+                increased initially at the time of MDT filesystem creation.
+                For more information, see
+                <xref linkend="settinguplustresystem"/>.</para>
+                <para condition="l24">The ZFS filesystem
+                dynamically allocates inodes and does not have a fixed ratio
+                of inodes per unit of MDT space, but consumes approximately
+                4KB of space per inode, depending on the configuration.</para>
+                <para condition="l24">Each additional MDT can hold up to the
+                above maximum number of additional files, depending on
+                available space and the distribution directories and files
+                in the filesystem.</para>
                </entry>
              </row>
              <row>
@@ -533,7 +704,8 @@
                  <para> 255 bytes (filename)</para>
                </entry>
                <entry>
-                <para>This limit is 255 bytes for a single filename, the same as in an ldiskfs file system.</para>
+                <para>This limit is 255 bytes for a single filename, the
+                same as the limit in the underlying filesystems.</para>
                </entry>
              </row>
              <row>
@@ -552,7 +724,7 @@
                  <para> Maximum number of open files for a Lustre file system</para>
                </entry>
                <entry>
-                <para> None</para>
+                <para> No limit</para>
                </entry>
                <entry>
                  <para>The Lustre software does not impose a maximum for the number of open files,
@@ -567,13 +739,19 @@
        </table>
        <para>&#160;</para>
        <note>
-        <para condition="l22">In Lustre releases prior to 2.2, the maximum stripe count for a single
-          file was limited to 160 OSTs. In Lustre release 2.2, the large <literal>xattr</literal>
-          feature ("wide striping") was added to support up to 2000 OSTs. This feature is disabled
-          by default at <literal>mkfs.lustre</literal> time. In order to enable this feature, set
-          the "<literal>-O large_xattr</literal>" option on the MDT either by using
-            <literal>--mkfsoptions</literal> at format time or by using
-          <literal>tune2fs</literal>.</para>
+        <para condition="l22">In Lustre software releases prior to version 2.2,
+       the maximum stripe count for a single file was limited to 160 OSTs.
+       In version 2.2, the wide striping feature was added to support files
+       striped over up to 2000 OSTs.  In order to store the layout for
+       such large files, the ldiskfs <literal>ea_inode</literal> feature must
+       be enabled on the MDT.  This feature is disabled by default at
+       <literal>mkfs.lustre</literal> time. In order to enable this feature,
+       specify <literal>--mkfsoptions="-O ea_inode"</literal> at MDT format
+       time, or use <literal>tune2fs -O ea_inode</literal> to enable it after
+       the MDT has been formatted.  Using either the deprecated
+       <literal>large_xattr</literal> or preferred <literal>ea_inode</literal>
+       feature name results in <literal>ea_inode</literal> being shown in
+       the file system feature list.</para>
        </note>
      </section>
    </section>