-<?xml version='1.0' encoding='UTF-8'?>
-<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
- xml:lang="en-US" xml:id="understandinglustre">
- <title xml:id="understandinglustre.title">Understanding Lustre Architecture</title>
- <para>This chapter describes the Lustre architecture and features of the Lustre file system. It
- includes the following sections:</para>
+<?xml version='1.0' encoding='utf-8'?>
+<chapter xmlns="http://docbook.org/ns/docbook"
+ xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
+ xml:id="understandinglustre">
+ <title xml:id="understandinglustre.title">Understanding Lustre
+ Architecture</title>
+ <para>This chapter describes the Lustre architecture and features of the
+ Lustre file system. It includes the following sections:</para>
<itemizedlist>
<listitem>
<para>
- <xref linkend="understandinglustre.whatislustre"/>
+ <xref linkend="understandinglustre.whatislustre" />
</para>
</listitem>
<listitem>
<para>
- <xref linkend="understandinglustre.components"/>
+ <xref linkend="understandinglustre.components" />
</para>
</listitem>
<listitem>
<para>
- <xref linkend="understandinglustre.storageio"/>
+ <xref linkend="understandinglustre.storageio" />
</para>
</listitem>
</itemizedlist>
<section xml:id="understandinglustre.whatislustre">
- <title><indexterm>
- <primary>Lustre</primary>
- </indexterm>What a Lustre File System Is (and What It Isn't)</title>
- <para>The Lustre architecture is a storage architecture for clusters. The central component of
- the Lustre architecture is the Lustre file system, which is supported on the Linux operating
- system and provides a POSIX<superscript>*</superscript> standard-compliant UNIX file system
- interface.</para>
- <para>The Lustre storage architecture is used for many different kinds of clusters. It is best
- known for powering many of the largest high-performance computing (HPC) clusters worldwide,
- with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes
- per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide
- global file system, serving dozens of clusters.</para>
- <para>The ability of a Lustre file system to scale capacity and performance for any need reduces
- the need to deploy many separate file systems, such as one for each compute cluster. Storage
- management is simplified by avoiding the need to copy data between compute clusters. In
- addition to aggregating storage capacity of many servers, the I/O throughput is also
- aggregated and scales with additional servers. Moreover, throughput and/or capacity can be
- easily increased by adding servers dynamically.</para>
- <para>While a Lustre file system can function in many work environments, it is not necessarily
- the best choice for all applications. It is best suited for uses that exceed the capacity that
- a single server can provide, though in some use cases, a Lustre file system can perform better
- with a single server than other file systems due to its strong locking and data
- coherency.</para>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ </indexterm>What a Lustre File System Is (and What It Isn't)</title>
+ <para>The Lustre architecture is a storage architecture for clusters. The
+ central component of the Lustre architecture is the Lustre file system,
+ which is supported on the Linux operating system and provides a POSIX
+ <superscript>*</superscript>standard-compliant UNIX file system
+ interface.</para>
+ <para>The Lustre storage architecture is used for many different kinds of
+ clusters. It is best known for powering many of the largest
+ high-performance computing (HPC) clusters worldwide, with tens of thousands
+ of client systems, petabytes (PiB) of storage and hundreds of gigabytes per
+ second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system
+ as a site-wide global file system, serving dozens of clusters.</para>
+ <para>The ability of a Lustre file system to scale capacity and performance
+ for any need reduces the need to deploy many separate file systems, such as
+ one for each compute cluster. Storage management is simplified by avoiding
+ the need to copy data between compute clusters. In addition to aggregating
+ storage capacity of many servers, the I/O throughput is also aggregated and
+ scales with additional servers. Moreover, throughput and/or capacity can be
+ easily increased by adding servers dynamically.</para>
+ <para>While a Lustre file system can function in many work environments, it
+ is not necessarily the best choice for all applications. It is best suited
+ for uses that exceed the capacity that a single server can provide, though
+ in some use cases, a Lustre file system can perform better with a single
+ server than other file systems due to its strong locking and data
+ coherency.</para>
<para>A Lustre file system is currently not particularly well suited for
- "peer-to-peer" usage models where clients and servers are running on the same node,
- each sharing a small amount of storage, due to the lack of data replication at the Lustre
- software level. In such uses, if one client/server fails, then the data stored on that node
- will not be accessible until the node is restarted.</para>
+ "peer-to-peer" usage models where clients and servers are running on the
+ same node, each sharing a small amount of storage, due to the lack of data
+ replication at the Lustre software level. In such uses, if one
+ client/server fails, then the data stored on that node will not be
+ accessible until the node is restarted.</para>
<section remap="h3">
- <title><indexterm>
- <primary>Lustre</primary>
- <secondary>features</secondary>
- </indexterm>Lustre Features</title>
- <para>Lustre file systems run on a variety of vendor's kernels. For more details, see the
- Lustre Test Matrix <xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="dbdoclet.50438261_99193"/>.</para>
- <para>A Lustre installation can be scaled up or down with respect to the number of client
- nodes, disk storage and bandwidth. Scalability and performance are dependent on available
- disk and network bandwidth and the processing power of the servers in the system. A Lustre
- file system can be deployed in a wide variety of configurations that can be scaled well
- beyond the size and performance observed in production systems to date.</para>
- <para><xref linkend="understandinglustre.tab1"/> shows the practical range of scalability and
- performance characteristics of a Lustre file system and some test results in production
- systems.</para>
- <table frame="all">
- <title xml:id="understandinglustre.tab1">Lustre File System Scalability and
- Performance</title>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>features</secondary>
+ </indexterm>Lustre Features</title>
+ <para>Lustre file systems run on a variety of vendor's kernels. For more
+ details, see the Lustre Test Matrix
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="preparing_installation" />.</para>
+ <para>A Lustre installation can be scaled up or down with respect to the
+ number of client nodes, disk storage and bandwidth. Scalability and
+ performance are dependent on available disk and network bandwidth and the
+ processing power of the servers in the system. A Lustre file system can
+ be deployed in a wide variety of configurations that can be scaled well
+ beyond the size and performance observed in production systems to
+ date.</para>
+ <para>
+ <xref linkend="understandinglustre.tab1" /> shows some of the
+ scalability and performance characteristics of a Lustre file system.
+ For a full list of Lustre file and filesystem limits see
+ <xref linkend="settinguplustresystem.tab2"/>.</para>
+ <table frame="all" xml:id="understandinglustre.tab1">
+ <title>Lustre File System Scalability and Performance</title>
<tgroup cols="3">
- <colspec colname="c1" colwidth="1*"/>
- <colspec colname="c2" colwidth="2*"/>
- <colspec colname="c3" colwidth="3*"/>
+ <colspec colname="c1" colwidth="1*" />
+ <colspec colname="c2" colwidth="2*" />
+ <colspec colname="c3" colwidth="3*" />
<thead>
<row>
<entry>
- <para><emphasis role="bold">Feature</emphasis></para>
+ <para>
+ <emphasis role="bold">Feature</emphasis>
+ </para>
</entry>
<entry>
- <para><emphasis role="bold">Current Practical Range</emphasis></para>
+ <para>
+ <emphasis role="bold">Current Practical Range</emphasis>
+ </para>
</entry>
<entry>
- <para><emphasis role="bold">Tested in Production</emphasis></para>
+ <para>
+ <emphasis role="bold">Known Production Usage</emphasis>
+ </para>
</entry>
</row>
</thead>
<row>
<entry>
<para>
- <emphasis role="bold">Client Scalability</emphasis></para>
+ <emphasis role="bold">Client Scalability</emphasis>
+ </para>
</entry>
<entry>
- <para> 100-100000</para>
+ <para>100-100000</para>
</entry>
<entry>
- <para> 50000+ clients, many in the 10000 to 20000 range</para>
+ <para>50000+ clients, many in the 10000 to 20000 range</para>
</entry>
</row>
<row>
<entry>
- <para><emphasis role="bold">Client Performance</emphasis></para>
+ <para>
+ <emphasis role="bold">Client Performance</emphasis>
+ </para>
</entry>
<entry>
<para>
- <emphasis>Single client: </emphasis></para>
+ <emphasis>Single client:</emphasis>
+ </para>
<para>I/O 90% of network bandwidth</para>
- <para><emphasis>Aggregate:</emphasis></para>
- <para>2.5 TB/sec I/O</para>
+ <para>
+ <emphasis>Aggregate:</emphasis>
+ </para>
+ <para>50 TB/sec I/O, 50M IOPS</para>
</entry>
<entry>
<para>
- <emphasis>Single client: </emphasis></para>
- <para>2 GB/sec I/O, 1000 metadata ops/sec</para>
- <para><emphasis>Aggregate:</emphasis></para>
- <para>240 GB/sec I/O </para>
+ <emphasis>Single client:</emphasis>
+ </para>
+ <para>15 GB/sec I/O (HDR IB), 50000 IOPS</para>
+ <para>
+ <emphasis>Aggregate:</emphasis>
+ </para>
+ <para>10 TB/sec I/O, 10M IOPS</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">OSS Scalability</emphasis></para>
+ <emphasis role="bold">OSS Scalability</emphasis>
+ </para>
</entry>
<entry>
<para>
- <emphasis>Single OSS:</emphasis></para>
- <para>1-32 OSTs per OSS,</para>
- <para>128TB per OST</para>
+ <emphasis>Single OSS:</emphasis>
+ </para>
+ <para>1-32 OSTs per OSS</para>
<para>
- <emphasis>OSS count:</emphasis></para>
- <para>500 OSSs, with up to 4000 OSTs</para>
+ <emphasis>Single OST:</emphasis>
+ </para>
+ <para>500M objects, 1024TiB per OST</para>
+ <para>
+ <emphasis>OSS count:</emphasis>
+ </para>
+ <para>1000 OSSs, 4000 OSTs</para>
</entry>
<entry>
<para>
- <emphasis>Single OSS:</emphasis></para>
- <para>8 OSTs per OSS,</para>
- <para>16TB per OST</para>
+ <emphasis>Single OSS:</emphasis>
+ </para>
+ <para>4 OSTs per OSS</para>
+ <para>
+ <emphasis>Single OST:</emphasis>
+ </para>
+ <para>1024TiB OSTs</para>
<para>
- <emphasis>OSS count:</emphasis></para>
- <para>450 OSSs with 1000 4TB OSTs</para>
- <para>192 OSSs with 1344 8TB OSTs</para>
+ <emphasis>OSS count:</emphasis>
+ </para>
+ <para>450 OSSs with 900 750TiB HDD OSTs + 450 25TiB NVMe OSTs</para>
+ <para>1024 OSSs with 1024 72TiB OSTs</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">OSS Performance</emphasis></para>
+ <emphasis role="bold">OSS Performance</emphasis>
+ </para>
</entry>
<entry>
<para>
- <emphasis>Single OSS:</emphasis></para>
- <para> 5 GB/sec</para>
+ <emphasis>Single OSS:</emphasis>
+ </para>
+ <para>15 GB/sec, 1.5M IOPS</para>
<para>
- <emphasis>Aggregate:</emphasis></para>
- <para> 2.5 TB/sec</para>
+ <emphasis>Aggregate:</emphasis>
+ </para>
+ <para>50 TB/sec, 50M IOPS</para>
</entry>
<entry>
<para>
- <emphasis>Single OSS:</emphasis></para>
- <para> 2.0+ GB/sec</para>
+ <emphasis>Single OSS:</emphasis>
+ </para>
+ <para>10 GB/sec, 1.5M IOPS</para>
<para>
- <emphasis>Aggregate:</emphasis></para>
- <para> 240 GB/sec</para>
+ <emphasis>Aggregate:</emphasis>
+ </para>
+ <para>20 TB/sec, 20M IOPS</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">MDS Scalability</emphasis></para>
+ <emphasis role="bold">MDS Scalability</emphasis>
+ </para>
</entry>
<entry>
<para>
- <emphasis>Single MDS:</emphasis></para>
- <para> 4 billion files</para>
+ <emphasis>Single MDS:</emphasis>
+ </para>
+ <para>1-4 MDTs per MDS</para>
<para>
- <emphasis>MDS count:</emphasis></para>
- <para> 1 primary + 1 backup</para>
- <para condition="l24"><emphasis role="italic">Since Lustre software release 2.4:
- </emphasis></para>
- <para condition="l24">Up to 4096 MDSs and up to 4096 MDTs</para>
+ <emphasis>Single MDT:</emphasis>
+ </para>
+ <para>4 billion files, 16TiB per MDT (ldiskfs)</para>
+ <para>64 billion files, 64TiB per MDT (ZFS)</para>
+ <para>
+ <emphasis>MDS count:</emphasis>
+ </para>
+ <para>256 MDSs, up to 256 MDTs</para>
</entry>
<entry>
<para>
- <emphasis>Single MDS:</emphasis></para>
- <para> 750 million files</para>
+ <emphasis>Single MDS:</emphasis>
+ </para>
+ <para>4 billion files</para>
<para>
- <emphasis>MDS count:</emphasis></para>
- <para> 1 primary + 1 backup</para>
+ <emphasis>MDS count:</emphasis>
+ </para>
+ <para>40 MDS with 40 4TiB MDTs in production</para>
+ <para>256 MDS with 256 64GiB MDTs in testing</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">MDS Performance</emphasis></para>
+ <emphasis role="bold">MDS Performance</emphasis>
+ </para>
</entry>
<entry>
- <para> 35000/s create operations,</para>
- <para> 100000/s metadata stat operations</para>
+ <para>1M/s create operations</para>
+ <para>2M/s stat operations</para>
</entry>
<entry>
- <para> 15000/s create operations,</para>
- <para> 35000/s metadata stat operations</para>
+ <para>100k/s create operations,</para>
+ <para>200k/s metadata stat operations</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">File system Scalability</emphasis></para>
+ <emphasis role="bold">File system Scalability</emphasis>
+ </para>
</entry>
<entry>
<para>
- <emphasis>Single File:</emphasis></para>
- <para>2.5 PB max file size</para>
+ <emphasis>Single File:</emphasis>
+ </para>
+ <para>32 PiB max file size (ldiskfs)</para>
+ <para>2^63 bytes (ZFS)</para>
<para>
- <emphasis>Aggregate:</emphasis></para>
- <para>512 PB space, 4 billion files</para>
+ <emphasis>Aggregate:</emphasis>
+ </para>
+ <para>512 PiB space, 1 trillion files</para>
</entry>
<entry>
<para>
- <emphasis>Single File:</emphasis></para>
- <para>multi-TB max file size</para>
+ <emphasis>Single File:</emphasis>
+ </para>
+ <para>multi-TiB max file size</para>
<para>
- <emphasis>Aggregate:</emphasis></para>
- <para>10 PB space, 750 million files</para>
+ <emphasis>Aggregate:</emphasis>
+ </para>
+ <para>700 PiB space, 25 billion files</para>
</entry>
</row>
</tbody>
<para>Other Lustre software features are:</para>
<itemizedlist>
<listitem>
- <para><emphasis role="bold">Performance-enhanced ext4 file system:</emphasis> The Lustre
- file system uses an improved version of the ext4 journaling file system to store data
- and metadata. This version, called <emphasis role="italic"
- ><literal>ldiskfs</literal></emphasis>, has been enhanced to improve performance and
- provide additional functionality needed by the Lustre file system.</para>
+ <para>
+ <emphasis role="bold">Performance-enhanced ext4 file
+ system:</emphasis>The Lustre file system uses an improved version of
+ the ext4 journaling file system to store data and metadata. This
+ version, called
+ <emphasis role="italic">
+ <literal>ldiskfs</literal>
+ </emphasis>, has been enhanced to improve performance and provide
+ additional functionality needed by the Lustre file system.</para>
+ </listitem>
+ <listitem>
+ <para>It is also possible to use ZFS as the backing filesystem for
+ Lustre for the MDT, OST, and MGS storage. This allows Lustre to
+ leverage the scalability and data integrity features of ZFS for
+ individual storage targets.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">POSIX standard compliance:</emphasis> The full POSIX test
- suite passes in an identical manner to a local ext4 file system, with limited exceptions
- on Lustre clients. In a cluster, most operations are atomic so that clients never see
- stale data or metadata. The Lustre software supports mmap() file I/O.</para>
+ <para>
+ <emphasis role="bold">POSIX standard compliance:</emphasis>The full
+ POSIX test suite passes in an identical manner to a local ext4 file
+ system, with limited exceptions on Lustre clients. In a cluster, most
+ operations are atomic so that clients never see stale data or
+ metadata. The Lustre software supports mmap() file I/O.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">High-performance heterogeneous networking:</emphasis> The
- Lustre software supports a variety of high performance, low latency networks and permits
- Remote Direct Memory Access (RDMA) for InfiniBand<superscript>*</superscript> (utilizing
- OpenFabrics Enterprise Distribution (OFED<superscript>*</superscript>) and other
- advanced networks for fast and efficient network transport. Multiple RDMA networks can
- be bridged using Lustre routing for maximum performance. The Lustre software also
- includes integrated network diagnostics.</para>
+ <para>
+ <emphasis role="bold">High-performance heterogeneous
+ networking:</emphasis>The Lustre software supports a variety of high
+ performance, low latency networks and permits Remote Direct Memory
+ Access (RDMA) for InfiniBand
+ <superscript>*</superscript>(utilizing OpenFabrics Enterprise
+ Distribution (OFED<superscript>*</superscript>), Intel OmniPath®,
+ and other advanced networks for fast
+ and efficient network transport. Multiple RDMA networks can be
+ bridged using Lustre routing for maximum performance. The Lustre
+ software also includes integrated network diagnostics.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">High-availability:</emphasis> The Lustre file system supports
- active/active failover using shared storage partitions for OSS targets (OSTs). Lustre
- software release 2.3 and earlier releases offer active/passive failover using a shared
- storage partition for the MDS target (MDT).</para>
- <para condition="l24">With Lustre software release 2.4 or later servers and clients it is
- possible to configure active/active failover of multiple MDTs. This allows application
- transparent recovery. The Lustre file system can work with a variety of high
- availability (HA) managers to allow automated failover and has no single point of
- failure (NSPF). Multiple mount protection (MMP) provides integrated protection from
- errors in highly-available systems that would otherwise cause file system
- corruption.</para>
+ <para>
+ <emphasis role="bold">High-availability:</emphasis>The Lustre file
+ system supports active/active failover using shared storage
+ partitions for OSS targets (OSTs), and for MDS targets (MDTs).
+ The Lustre file system can work
+ with a variety of high availability (HA) managers to allow automated
+ failover and has no single point of failure (NSPF). This allows
+ application transparent recovery. Multiple mount protection (MMP)
+ provides integrated protection from errors in highly-available
+ systems that would otherwise cause file system corruption.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Security:</emphasis> By default TCP connections are only
- allowed from privileged ports. UNIX group membership is verified on the MDS.</para>
+ <para>
+ <emphasis role="bold">Security:</emphasis>By default TCP connections
+ are only allowed from privileged ports. UNIX group membership is
+ verified on the MDS.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Access control list (ACL), extended attributes:</emphasis> the
- Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs.
- Noteworthy additional features include root squash.</para>
+ <para>
+ <emphasis role="bold">Access control list (ACL), extended
+ attributes:</emphasis>the Lustre security model follows that of a
+ UNIX file system, enhanced with POSIX ACLs. Noteworthy additional
+ features include root squash.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Interoperability:</emphasis> The Lustre file system runs on a
- variety of CPU architectures and mixed-endian clusters and is interoperable between
- successive major Lustre software releases.</para>
+ <para>
+ <emphasis role="bold">Interoperability:</emphasis>The Lustre file
+ system runs on a variety of CPU architectures and mixed-endian
+ clusters and is interoperable between successive major Lustre
+ software releases.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Object-based architecture:</emphasis> Clients are isolated
- from the on-disk file structure enabling upgrading of the storage architecture without
- affecting the client.</para>
+ <para>
+ <emphasis role="bold">Object-based architecture:</emphasis>Clients
+ are isolated from the on-disk file structure enabling upgrading of
+ the storage architecture without affecting the client.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Byte-granular file and fine-grained metadata
- locking:</emphasis> Many clients can read and modify the same file or directory
- concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent
- between all clients and servers in the file system. The MDT LDLM manages locks on inode
- permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored
- thereon, which scales the locking performance as the file system grows.</para>
+ <para>
+ <emphasis role="bold">Byte-granular file and fine-grained metadata
+ locking:</emphasis>Many clients can read and modify the same file or
+ directory concurrently. The Lustre distributed lock manager (LDLM)
+ ensures that files are coherent between all clients and servers in
+ the file system. The MDT LDLM manages locks on inode permissions and
+ pathnames. Each OST has its own LDLM for locks on file stripes stored
+ thereon, which scales the locking performance as the file system
+ grows.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Quotas:</emphasis> User and group quotas are available for a
- Lustre file system.</para>
+ <para>
+ <emphasis role="bold">Quotas:</emphasis>User and group quotas are
+ available for a Lustre file system.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Capacity growth:</emphasis> The size of a Lustre file system
- and aggregate cluster bandwidth can be increased without interruption by adding a new
- OSS with OSTs to the cluster.</para>
+ <para>
+ <emphasis role="bold">Capacity growth:</emphasis>The size of a Lustre
+ file system and aggregate cluster bandwidth can be increased without
+ interruption by adding new OSTs and MDTs to the cluster.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Controlled striping:</emphasis> The layout of files across
- OSTs can be configured on a per file, per directory, or per file system basis. This
- allows file I/O to be tuned to specific application requirements within a single file
- system. The Lustre file system uses RAID-0 striping and balances space usage across
- OSTs.</para>
+ <para>
+ <emphasis role="bold">Controlled file layout:</emphasis>The layout of
+ files across OSTs can be configured on a per file, per directory, or
+ per file system basis. This allows file I/O to be tuned to specific
+ application requirements within a single file system. The Lustre file
+ system uses RAID-0 striping and balances space usage across
+ OSTs.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Network data integrity protection:</emphasis> A checksum of
- all data sent from the client to the OSS protects against corruption during data
- transfer.</para>
+ <para>
+ <emphasis role="bold">Network data integrity protection:</emphasis>A
+ checksum of all data sent from the client to the OSS protects against
+ corruption during data transfer.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">MPI I/O:</emphasis> The Lustre architecture has a dedicated
- MPI ADIO layer that optimizes parallel I/O to match the underlying file system
- architecture.</para>
+ <para>
+ <emphasis role="bold">MPI I/O:</emphasis>The Lustre architecture has
+ a dedicated MPI ADIO layer that optimizes parallel I/O to match the
+ underlying file system architecture.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">NFS and CIFS export:</emphasis> Lustre files can be
- re-exported using NFS (via Linux knfsd) or CIFS (via Samba) enabling them to be shared
- with non-Linux clients, such as Microsoft<superscript>*</superscript>
- Windows<superscript>*</superscript> and Apple<superscript>*</superscript> Mac OS
- X<superscript>*</superscript>.</para>
+ <para>
+ <emphasis role="bold">NFS and CIFS export:</emphasis>Lustre files can
+ be re-exported using NFS (via Linux knfsd or Ganesha) or CIFS (via
+ Samba), enabling them to be shared with non-Linux clients such as
+ Microsoft<superscript>*</superscript>Windows,
+ <superscript>*</superscript>Apple
+ <superscript>*</superscript>Mac OS X
+ <superscript>*</superscript>, and others.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Disaster recovery tool:</emphasis> The Lustre file system
- provides an online distributed file system check (LFSCK) that can restore consistency between
- storage components in case of a major file system error. A Lustre file system can
- operate even in the presence of file system inconsistencies, and LFSCK can run while the filesystem is in use, so LFSCK is not required to complete
- before returning the file system to production.</para>
+ <para>
+ <emphasis role="bold">Disaster recovery tool:</emphasis>The Lustre
+ file system provides an online distributed file system check (LFSCK)
+ that can restore consistency between storage components in case of a
+ major file system error. A Lustre file system can operate even in the
+ presence of file system inconsistencies, and LFSCK can run while the
+ filesystem is in use, so LFSCK is not required to complete before
+ returning the file system to production.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Performance monitoring:</emphasis> The Lustre file system
- offers a variety of mechanisms to examine performance and tuning.</para>
+ <para>
+ <emphasis role="bold">Performance monitoring:</emphasis>The Lustre
+ file system offers a variety of mechanisms to examine performance and
+ tuning.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Open source:</emphasis> The Lustre software is licensed under
- the GPL 2.0 license for use with the Linux operating system.</para>
+ <para>
+ <emphasis role="bold">Open source:</emphasis>The Lustre software is
+ licensed under the GPL 2.0 license for use with the Linux operating
+ system.</para>
</listitem>
</itemizedlist>
</section>
</section>
<section xml:id="understandinglustre.components">
- <title><indexterm>
- <primary>Lustre</primary>
- <secondary>components</secondary>
- </indexterm>Lustre Components</title>
- <para>An installation of the Lustre software includes a management server (MGS) and one or more
- Lustre file systems interconnected with Lustre networking (LNET).</para>
- <para>A basic configuration of Lustre file system components is shown in <xref
- linkend="understandinglustre.fig.cluster"/>.</para>
- <figure>
- <title xml:id="understandinglustre.fig.cluster">Lustre file system components in a basic
- cluster </title>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>components</secondary>
+ </indexterm>Lustre Components</title>
+ <para>An installation of the Lustre software includes a management server
+ (MGS) and one or more Lustre file systems interconnected with Lustre
+ networking (LNet).</para>
+ <para>A basic configuration of Lustre file system components is shown in
+ <xref linkend="understandinglustre.fig.cluster" />.</para>
+ <figure xml:id="understandinglustre.fig.cluster">
+ <title>Lustre file system components in a basic cluster</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="100%" fileref="./figures/Basic_Cluster.png"/>
+ <imagedata scalefit="1" width="100%"
+ fileref="./figures/Basic_Cluster.png" />
</imageobject>
<textobject>
- <phrase> Lustre file system components in a basic cluster </phrase>
+ <phrase>Lustre file system components in a basic cluster</phrase>
</textobject>
</mediaobject>
</figure>
<section remap="h3">
- <title><indexterm>
- <primary>Lustre</primary>
- <secondary>MGS</secondary>
- </indexterm>Management Server (MGS)</title>
- <para>The MGS stores configuration information for all the Lustre file systems in a cluster
- and provides this information to other Lustre components. Each Lustre target contacts the
- MGS to provide information, and Lustre clients contact the MGS to retrieve
- information.</para>
- <para>It is preferable that the MGS have its own storage space so that it can be managed
- independently. However, the MGS can be co-located and share storage space with an MDS as
- shown in <xref linkend="understandinglustre.fig.cluster"/>.</para>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>MGS</secondary>
+ </indexterm>Management Server (MGS)</title>
+ <para>The MGS stores configuration information for all the Lustre file
+ systems in a cluster and provides this information to other Lustre
+ components. Each Lustre target contacts the MGS to provide information,
+ and Lustre clients contact the MGS to retrieve information.</para>
+ <para>It is preferable that the MGS have its own storage space so that it
+ can be managed independently. However, the MGS can be co-located and
+ share storage space with an MDS as shown in
+ <xref linkend="understandinglustre.fig.cluster" />.</para>
</section>
<section remap="h3">
<title>Lustre File System Components</title>
- <para>Each Lustre file system consists of the following components:</para>
+ <para>Each Lustre file system consists of the following
+ components:</para>
<itemizedlist>
<listitem>
- <para><emphasis role="bold">Metadata Server (MDS)</emphasis> - The MDS makes metadata
- stored in one or more MDTs available to Lustre clients. Each MDS manages the names and
- directories in the Lustre file system(s) and provides network request handling for one
- or more local MDTs.</para>
+ <para>
+ <emphasis role="bold">Metadata Servers (MDS)</emphasis>- The MDS makes
+ metadata stored in one or more MDTs available to Lustre clients. Each
+ MDS manages the names and directories in the Lustre file system(s)
+ and provides network request handling for one or more local
+ MDTs.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Metadata Target (MDT</emphasis> ) - For Lustre software
- release 2.3 and earlier, each file system has one MDT. The MDT stores metadata (such as
- filenames, directories, permissions and file layout) on storage attached to an MDS. Each
- file system has one MDT. An MDT on a shared storage target can be available to multiple
- MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS
- can serve the MDT and make it available to clients. This is referred to as MDS
- failover.</para>
- <para condition="l24">Since Lustre software release 2.4, multiple MDTs are supported. Each
- file system has at least one MDT. An MDT on a shared storage target can be available via
- multiple MDSs, although only one MDS can export the MDT to the clients at one time. Two
- MDS machines share storage for two or more MDTs. After the failure of one MDS, the
- remaining MDS begins serving the MDT(s) of the failed MDS.</para>
+ <para>
+ <emphasis role="bold">Metadata Targets (MDT</emphasis>) - Each
+ filesystem has at least one MDT, which holds the root directory. The
+ MDT stores metadata (such as filenames, directories, permissions and
+ file layout) on storage attached to an MDS. Each file system has one
+ MDT. An MDT on a shared storage target can be available to multiple
+ MDSs, although only one can access it at a time. If an active MDS
+ fails, a second MDS node can serve the MDT and make it available to
+ clients. This is referred to as MDS failover.</para>
+ <para>Multiple MDTs are supported with the Distributed Namespace
+ Environment (<xref linkend="DNE"/>).
+ In addition to the primary MDT that holds the filesystem root, it
+ is possible to add additional MDS nodes, each with their own MDTs,
+ to hold sub-directory trees of the filesystem.</para>
+ <para condition="l28">Since Lustre software release 2.8, DNE also
+ allows the filesystem to distribute files of a single directory over
+ multiple MDT nodes. A directory which is distributed across multiple
+ MDTs is known as a <emphasis><xref linkend="stripeddirectory"/></emphasis>.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Object Storage Servers (OSS)</emphasis> : The OSS provides
- file I/O service and network request handling for one or more local OSTs. Typically, an
- OSS serves between two and eight OSTs, up to 16 TB each. A typical configuration is an
- MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a
- large number of compute nodes.</para>
+ <para>
+ <emphasis role="bold">Object Storage Servers (OSS)</emphasis>: The
+ OSS provides file I/O service and network request handling for one or
+ more local OSTs. Typically, an OSS serves between two and eight OSTs,
+ up to 16 TiB each. A typical configuration is an MDT on a dedicated
+ node, two or more OSTs on each OSS node, and a client on each of a
+ large number of compute nodes.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Object Storage Target (OST)</emphasis> : User file data is
- stored in one or more objects, each object on a separate OST in a Lustre file system.
- The number of objects per file is configurable by the user and can be tuned to optimize
- performance for a given workload.</para>
+ <para>
+ <emphasis role="bold">Object Storage Target (OST)</emphasis>: User
+ file data is stored in one or more objects, each object on a separate
+ OST in a Lustre file system. The number of objects per file is
+ configurable by the user and can be tuned to optimize performance for
+ a given workload.</para>
</listitem>
<listitem>
- <para><emphasis role="bold">Lustre clients</emphasis> : Lustre clients are computational,
- visualization or desktop nodes that are running Lustre client software, allowing them to
- mount the Lustre file system.</para>
+ <para>
+ <emphasis role="bold">Lustre clients</emphasis>: Lustre clients are
+ computational, visualization or desktop nodes that are running Lustre
+ client software, allowing them to mount the Lustre file
+ system.</para>
</listitem>
</itemizedlist>
- <para>The Lustre client software provides an interface between the Linux virtual file system
- and the Lustre servers. The client software includes a management client (MGC), a metadata
- client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in
- the file system.</para>
- <para>A logical object volume (LOV) aggregates the OSCs to provide transparent access across
- all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent,
- synchronized namespace. Several clients can write to different parts of the same file
- simultaneously, while, at the same time, other clients can read from the file.</para>
- <para><xref linkend="understandinglustre.tab.storagerequire"/> provides the requirements for
- attached storage for each Lustre file system component and describes desirable
- characteristics of the hardware used.</para>
- <table frame="all">
- <title xml:id="understandinglustre.tab.storagerequire"><indexterm>
- <primary>Lustre</primary>
- <secondary>requirements</secondary>
- </indexterm>Storage and hardware requirements for Lustre file system components</title>
+ <para>The Lustre client software provides an interface between the Linux
+ virtual file system and the Lustre servers. The client software includes
+ a management client (MGC), a metadata client (MDC), and multiple object
+ storage clients (OSCs), one corresponding to each OST in the file
+ system.</para>
+ <para>A logical object volume (LOV) aggregates the OSCs to provide
+ transparent access across all the OSTs. Thus, a client with the Lustre
+ file system mounted sees a single, coherent, synchronized namespace.
+ Several clients can write to different parts of the same file
+ simultaneously, while, at the same time, other clients can read from the
+ file.</para>
+ <para>A logical metadata volume (LMV) aggregates the MDCs to provide
+ transparent access across all the MDTs in a similar manner as the LOV
+ does for file access. This allows the client to see the directory tree
+ on multiple MDTs as a single coherent namespace, and striped directories
+ are merged on the clients to form a single visible directory to users
+ and applications.
+ </para>
+ <para>
+ <xref linkend="understandinglustre.tab.storagerequire" />provides the
+ requirements for attached storage for each Lustre file system component
+ and describes desirable characteristics of the hardware used.</para>
+ <table frame="all" xml:id="understandinglustre.tab.storagerequire">
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>requirements</secondary>
+ </indexterm>Storage and hardware requirements for Lustre file system
+ components</title>
<tgroup cols="3">
- <colspec colname="c1" colwidth="1*"/>
- <colspec colname="c2" colwidth="3*"/>
- <colspec colname="c3" colwidth="3*"/>
+ <colspec colname="c1" colwidth="1*" />
+ <colspec colname="c2" colwidth="3*" />
+ <colspec colname="c3" colwidth="3*" />
<thead>
<row>
<entry>
- <para><emphasis role="bold"/></para>
+ <para>
+ <emphasis role="bold" />
+ </para>
</entry>
<entry>
- <para><emphasis role="bold">Required attached storage</emphasis></para>
+ <para>
+ <emphasis role="bold">Required attached storage</emphasis>
+ </para>
</entry>
<entry>
- <para><emphasis role="bold">Desirable hardware characteristics</emphasis></para>
+ <para>
+ <emphasis role="bold">Desirable hardware
+ characteristics</emphasis>
+ </para>
</entry>
</row>
</thead>
<row>
<entry>
<para>
- <emphasis role="bold">MDSs</emphasis></para>
+ <emphasis role="bold">MDSs</emphasis>
+ </para>
</entry>
<entry>
- <para> 1-2% of file system capacity</para>
+ <para>1-2% of file system capacity</para>
</entry>
<entry>
- <para> Adequate CPU power, plenty of memory, fast disk storage.</para>
+ <para>Adequate CPU power, plenty of memory, fast disk
+ storage.</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">OSSs</emphasis></para>
+ <emphasis role="bold">OSSs</emphasis>
+ </para>
</entry>
<entry>
- <para> 1-16 TB per OST, 1-8 OSTs per OSS</para>
+ <para>1-128 TiB per OST, 1-8 OSTs per OSS</para>
</entry>
<entry>
- <para> Good bus bandwidth. Recommended that storage be balanced evenly across
- OSSs.</para>
+ <para>Good bus bandwidth. Recommended that storage be balanced
+ evenly across OSSs and matched to network bandwidth.</para>
</entry>
</row>
<row>
<entry>
<para>
- <emphasis role="bold">Clients</emphasis></para>
+ <emphasis role="bold">Clients</emphasis>
+ </para>
</entry>
<entry>
- <para> None</para>
+ <para>No local storage needed</para>
</entry>
<entry>
- <para> Low latency, high bandwidth network.</para>
+ <para>Low latency, high bandwidth network.</para>
</entry>
</row>
</tbody>
</tgroup>
</table>
- <para>For additional hardware requirements and considerations, see <xref
- linkend="settinguplustresystem"/>.</para>
+ <para>For additional hardware requirements and considerations, see
+ <xref linkend="settinguplustresystem" />.</para>
</section>
<section remap="h3">
- <title><indexterm>
- <primary>Lustre</primary>
- <secondary>LNET</secondary>
- </indexterm>Lustre Networking (LNET)</title>
- <para>Lustre Networking (LNET) is a custom networking API that provides the communication
- infrastructure that handles metadata and file I/O data for the Lustre file system servers
- and clients. For more information about LNET, see <xref
- linkend="understandinglustrenetworking"/>.</para>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>LNet</secondary>
+ </indexterm>Lustre Networking (LNet)</title>
+ <para>Lustre Networking (LNet) is a custom networking API that provides
+ the communication infrastructure that handles metadata and file I/O data
+ for the Lustre file system servers and clients. For more information
+ about LNet, see
+ <xref linkend="understandinglustrenetworking" />.</para>
</section>
<section remap="h3">
- <title><indexterm>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>cluster</secondary>
+ </indexterm>Lustre Cluster</title>
+ <para>At scale, a Lustre file system cluster can include hundreds of OSSs
+ and thousands of clients (see
+ <xref linkend="understandinglustre.fig.lustrescale" />). More than one
+ type of network can be used in a Lustre cluster. Shared storage between
+ OSSs enables failover capability. For more details about OSS failover,
+ see
+ <xref linkend="understandingfailover" />.</para>
+ <figure xml:id="understandinglustre.fig.lustrescale">
+ <title>
+ <indexterm>
<primary>Lustre</primary>
- <secondary>cluster</secondary>
- </indexterm>Lustre Cluster</title>
- <para>At scale, a Lustre file system cluster can include hundreds of OSSs and thousands of
- clients (see <xref linkend="understandinglustre.fig.lustrescale"/>). More than one type of
- network can be used in a Lustre cluster. Shared storage between OSSs enables failover
- capability. For more details about OSS failover, see <xref linkend="understandingfailover"
- />.</para>
- <figure>
- <title xml:id="understandinglustre.fig.lustrescale"><indexterm>
- <primary>Lustre</primary>
- <secondary>at scale</secondary>
- </indexterm>Lustre cluster at scale</title>
+ <secondary>at scale</secondary>
+ </indexterm>Lustre cluster at scale</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="100%" fileref="./figures/Scaled_Cluster.png"/>
+ <imagedata scalefit="1" width="100%"
+ fileref="./figures/Scaled_Cluster.png" />
</imageobject>
<textobject>
- <phrase> Lustre file system cluster at scale </phrase>
+ <phrase>Lustre file system cluster at scale</phrase>
</textobject>
</mediaobject>
</figure>
</section>
</section>
<section xml:id="understandinglustre.storageio">
- <title><indexterm>
- <primary>Lustre</primary>
- <secondary>storage</secondary>
- </indexterm>
- <indexterm>
- <primary>Lustre</primary>
- <secondary>I/O</secondary>
- </indexterm> Lustre File System Storage and I/O</title>
- <para>In Lustre software release 2.0, Lustre file identifiers (FIDs) were introduced to replace
- UNIX inode numbers for identifying files or objects. A FID is a 128-bit identifier that
- contains a unique 64-bit sequence number, a 32-bit object ID (OID), and a 32-bit version
- number. The sequence number is unique across all Lustre targets in a file system (OSTs and
- MDTs). This change enabled future support for multiple MDTs (introduced in Lustre software
- release 2.3) and ZFS (introduced in Lustre software release 2.4).</para>
- <para>Also introduced in release 2.0 is a feature call <emphasis role="italic"
- >FID-in-dirent</emphasis> (also known as <emphasis role="italic">dirdata</emphasis>) in
- which the FID is stored as part of the name of the file in the parent directory. This feature
- significantly improves performance for <literal>ls</literal> command executions by reducing
- disk I/O. The FID-in-dirent is generated at the time the file is created.</para>
- <note>
- <para>The FID-in-dirent feature is not compatible with the Lustre software release 1.8 format.
- Therefore, when an upgrade from Lustre software release 1.8 to a Lustre software release 2.x
- is performed, the FID-in-dirent feature is not automatically enabled. For upgrades from
- Lustre software release 1.8 to Lustre software releases 2.0 through 2.3, FID-in-dirent can
- be enabled manually but only takes effect for new files. </para>
- <para>For more information about upgrading from Lustre software release 1.8 and enabling
- FID-in-dirent for existing files, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
- linkend="upgradinglustre"/>Chapter 16 “Upgrading a Lustre File System”.</para>
- </note>
- <para condition="l24">The LFSCK 1.5 file system administration tool released with Lustre
- software release 2.4 provides functionality that enables FID-in-dirent for existing files. It
- includes the following functionality:<itemizedlist>
- <listitem>
- <para>Generates IGIF mode FIDs for existing release 1.8 files.</para>
- </listitem>
- <listitem>
- <para>Verifies the FID-in-dirent for each file to determine when it doesn’t exist or is
- invalid and then regenerates the FID-in-dirent if needed.</para>
- </listitem>
- <listitem>
- <para>Verifies the linkEA entry for each file to determine when it is missing or invalid
- and then regenerates the linkEA if needed. The <emphasis role="italic">linkEA</emphasis>
- consists of the file name plus its parent FID and is stored as an extended attribute in
- the file itself. Thus, the linkEA can be used to parse out the full path name of a file
- from root.</para>
- </listitem>
- </itemizedlist></para>
- <para>Information about where file data is located on the OST(s) is stored as an extended
- attribute called layout EA in an MDT object identified by the FID for the file (see <xref
- xmlns:xlink="http://www.w3.org/1999/xlink" linkend="Fig1.3_LayoutEAonMDT"/>). If the file is
- a data file (not a directory or symbol link), the MDT object points to 1-to-N OST object(s) on
- the OST(s) that contain the file data. If the MDT file points to one object, all the file data
- is stored in that object. If the MDT file points to more than one object, the file data is
- <emphasis role="italic">striped</emphasis> across the objects using RAID 0, and each object
- is stored on a different OST. (For more information about how striping is implemented in a
- Lustre file system, see <xref linkend="dbdoclet.50438250_89922"/>.</para>
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>storage</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>I/O</secondary>
+ </indexterm>Lustre File System Storage and I/O</title>
+ <para>Lustre File IDentifiers (FIDs) are used internally for identifying
+ files or objects, similar to inode numbers in local filesystems. A FID
+ is a 128-bit identifier, which contains a unique 64-bit sequence number
+ (SEQ), a 32-bit object ID (OID), and a 32-bit version number. The sequence
+ number is unique across all Lustre targets in a file system (OSTs and
+ MDTs). This allows multiple MDTs and OSTs to uniquely identify objects
+ without depending on identifiers in the underlying filesystem (e.g. inode
+ numbers) that are likely to be duplicated between targets. The FID SEQ
+ number also allows mapping a FID to a particular MDT or OST.</para>
+ <para>The LFSCK file system consistency checking tool provides
+ functionality that enables FID-in-dirent for existing files. It
+ includes the following functionality:
+ <itemizedlist>
+ <listitem>
+ <para>Verifies the FID stored with each directory entry and regenerates
+ it from the inode if it is invalid or missing.</para>
+ </listitem>
+ <listitem>
+ <para>Verifies the linkEA entry for each inode and regenerates it if
+ invalid or missing. The <emphasis role="italic">linkEA</emphasis>
+ stores the file name and parent FID. It is stored as an extended
+ attribute in each inode. Thus, the linkEA can be used to
+ reconstruct the full path name of a file from only the FID.</para>
+ </listitem>
+ </itemizedlist></para>
+ <para>Information about where file data is located on the OST(s) is stored
+ as an extended attribute called layout EA in an MDT object identified by
+ the FID for the file (see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="Fig1.3_LayoutEAonMDT" />). If the file is a regular file (not a
+ directory or symbol link), the MDT object points to 1-to-N OST object(s) on
+ the OST(s) that contain the file data. If the MDT file points to one
+ object, all the file data is stored in that object. If the MDT file points
+ to more than one object, the file data is
+ <emphasis role="italic">striped</emphasis> across the objects using RAID 0,
+ and each object is stored on a different OST. (For more information about
+ how striping is implemented in a Lustre file system, see
+ <xref linkend="lustre_striping" />.</para>
<figure xml:id="Fig1.3_LayoutEAonMDT">
<title>Layout EA on MDT pointing to file data on OSTs</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="80%" fileref="./figures/Metadata_File.png"/>
+ <imagedata scalefit="1" width="80%"
+ fileref="./figures/Metadata_File.png" />
</imageobject>
<textobject>
- <phrase> Layout EA on MDT pointing to file data on OSTs </phrase>
+ <phrase>Layout EA on MDT pointing to file data on OSTs</phrase>
</textobject>
</mediaobject>
</figure>
- <para>When a client wants to read from or write to a file, it first fetches the layout EA from
- the MDT object for the file. The client then uses this information to perform I/O on the file,
- directly interacting with the OSS nodes where the objects are stored.
- <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>This process is illustrated
- in <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="Fig1.4_ClientReqstgData"
- /><?oxy_custom_end?>.</para>
+ <para>When a client wants to read from or write to a file, it first fetches
+ the layout EA from the MDT object for the file. The client then uses this
+ information to perform I/O on the file, directly interacting with the OSS
+ nodes where the objects are stored.
+ <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>
+ This process is illustrated in
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="Fig1.4_ClientReqstgData" /><?oxy_custom_end?>
+ .</para>
<figure xml:id="Fig1.4_ClientReqstgData">
<title>Lustre client requesting file data</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="75%" fileref="./figures/File_Write.png"/>
+ <imagedata scalefit="1" width="75%"
+ fileref="./figures/File_Write.png" />
</imageobject>
<textobject>
- <phrase> Lustre client requesting file data </phrase>
+ <phrase>Lustre client requesting file data</phrase>
</textobject>
</mediaobject>
</figure>
- <para>The available bandwidth of a Lustre file system is determined as follows:</para>
+ <para>The available bandwidth of a Lustre file system is determined as
+ follows:</para>
<itemizedlist>
<listitem>
- <para>The <emphasis>network bandwidth</emphasis> equals the aggregated bandwidth of the OSSs
- to the targets.</para>
+ <para>The
+ <emphasis>network bandwidth</emphasis> equals the aggregated bandwidth
+ of the OSSs to the targets.</para>
</listitem>
<listitem>
- <para>The <emphasis>disk bandwidth</emphasis> equals the sum of the disk bandwidths of the
- storage targets (OSTs) up to the limit of the network bandwidth.</para>
+ <para>The
+ <emphasis>disk bandwidth</emphasis> equals the sum of the disk
+ bandwidths of the storage targets (OSTs) up to the limit of the network
+ bandwidth.</para>
</listitem>
<listitem>
- <para>The <emphasis>aggregate bandwidth</emphasis> equals the minimum of the disk bandwidth
- and the network bandwidth.</para>
+ <para>The
+ <emphasis>aggregate bandwidth</emphasis> equals the minimum of the disk
+ bandwidth and the network bandwidth.</para>
</listitem>
<listitem>
- <para>The <emphasis>available file system space</emphasis> equals the sum of the available
- space of all the OSTs.</para>
+ <para>The
+ <emphasis>available file system space</emphasis> equals the sum of the
+ available space of all the OSTs.</para>
</listitem>
</itemizedlist>
- <section xml:id="dbdoclet.50438250_89922">
+ <section xml:id="lustre_striping">
<title>
- <indexterm>
- <primary>Lustre</primary>
- <secondary>striping</secondary>
- </indexterm>
- <indexterm>
- <primary>striping</primary>
- <secondary>overview</secondary>
- </indexterm> Lustre File System and Striping</title>
- <para>One of the main factors leading to the high performance of Lustre file systems is the
- ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally
- configure for each file the number of stripes, stripe size, and OSTs that are used.</para>
- <para>Striping can be used to improve performance when the aggregate bandwidth to a single
- file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a
- single OST does not have enough free space to hold an entire file. For more information
- about benefits and drawbacks of file striping, see <xref linkend="dbdoclet.50438209_48033"
- />.</para>
- <para>Striping allows segments or 'chunks' of data in a file to be stored on
- different OSTs, as shown in <xref linkend="understandinglustre.fig.filestripe"/>. In the
- Lustre file system, a RAID 0 pattern is used in which data is "striped" across a
- certain number of objects. The number of objects in a single file is called the
- <literal>stripe_count</literal>.</para>
- <para>Each object contains a chunk of data from the file. When the chunk of data being written
- to a particular object exceeds the <literal>stripe_size</literal>, the next chunk of data in
- the file is stored on the next object.</para>
- <para>Default values for <literal>stripe_count</literal> and <literal>stripe_size</literal>
- are set for the file system. The default value for <literal>stripe_count</literal> is 1
- stripe for file and the default value for <literal>stripe_size</literal> is 1MB. The user
- may change these values on a per directory or per file basis. For more details, see <xref
- linkend="dbdoclet.50438209_78664"/>.</para>
- <para><xref linkend="understandinglustre.fig.filestripe"/>, the <literal>stripe_size</literal>
- for File C is larger than the <literal>stripe_size</literal> for File A, allowing more data
- to be stored in a single stripe for File C. The <literal>stripe_count</literal> for File A
- is 3, resulting in data striped across three objects, while the
- <literal>stripe_count</literal> for File B and File C is 1.</para>
- <para>No space is reserved on the OST for unwritten data. File A in <xref
- linkend="understandinglustre.fig.filestripe"/>.</para>
- <figure>
- <title xml:id="understandinglustre.fig.filestripe">File striping on a Lustre file
- system</title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>striping</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>striping</primary>
+ <secondary>overview</secondary>
+ </indexterm>Lustre File System and Striping</title>
+ <para>One of the main factors leading to the high performance of Lustre
+ file systems is the ability to stripe data across multiple OSTs in a
+ round-robin fashion. Users can optionally configure for each file the
+ number of stripes, stripe size, and OSTs that are used.</para>
+ <para>Striping can be used to improve performance when the aggregate
+ bandwidth to a single file exceeds the bandwidth of a single OST. The
+ ability to stripe is also useful when a single OST does not have enough
+ free space to hold an entire file. For more information about benefits
+ and drawbacks of file striping, see
+ <xref linkend="file_striping.considerations" />.</para>
+ <para>Striping allows segments or 'chunks' of data in a file to be stored
+ on different OSTs, as shown in
+ <xref linkend="understandinglustre.fig.filestripe" />. In the Lustre file
+ system, a RAID 0 pattern is used in which data is "striped" across a
+ certain number of objects. The number of objects in a single file is
+ called the
+ <literal>stripe_count</literal>.</para>
+ <para>Each object contains a chunk of data from the file. When the chunk
+ of data being written to a particular object exceeds the
+ <literal>stripe_size</literal>, the next chunk of data in the file is
+ stored on the next object.</para>
+ <para>Default values for
+ <literal>stripe_count</literal> and
+ <literal>stripe_size</literal> are set for the file system. The default
+ value for
+ <literal>stripe_count</literal> is 1 stripe for file and the default value
+ for
+ <literal>stripe_size</literal> is 1MB. The user may change these values on
+ a per directory or per file basis. For more details, see
+ <xref linkend="file_striping.lfs_setstripe" />.</para>
+ <para>
+ <xref linkend="understandinglustre.fig.filestripe" />, the
+ <literal>stripe_size</literal> for File C is larger than the
+ <literal>stripe_size</literal> for File A, allowing more data to be stored
+ in a single stripe for File C. The
+ <literal>stripe_count</literal> for File A is 3, resulting in data striped
+ across three objects, while the
+ <literal>stripe_count</literal> for File B and File C is 1.</para>
+ <para>No space is reserved on the OST for unwritten data. File A in
+ <xref linkend="understandinglustre.fig.filestripe" />.</para>
+ <figure xml:id="understandinglustre.fig.filestripe">
+ <title>File striping on a
+ Lustre file system</title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="100%" fileref="./figures/File_Striping.png"/>
+ <imagedata scalefit="1" width="100%"
+ fileref="./figures/File_Striping.png" />
</imageobject>
<textobject>
- <phrase>File striping pattern across three OSTs for three different data files. The file
- is sparse and missing chunk 6. </phrase>
+ <phrase>File striping pattern across three OSTs for three different
+ data files. The file is sparse and missing chunk 6.</phrase>
</textobject>
</mediaobject>
</figure>
- <para>The maximum file size is not limited by the size of a single target. In a Lustre file
- system, files can be striped across multiple objects (up to 2000), and each object can be
- up to 16 TB in size with ldiskfs. This leads to a maximum file size of 31.25 PB. (Note that
- a Lustre file system can support files up to 2^64 bytes depending on the backing storage
- used by OSTs.)</para>
+ <para>The maximum file size is not limited by the size of a single
+ target. In a Lustre file system, files can be striped across multiple
+ objects (up to 2000), and each object can be up to 16 TiB in size with
+ ldiskfs, or up to 256PiB with ZFS. This leads to a maximum file size of
+ 31.25 PiB for ldiskfs or 8EiB with ZFS. Note that a Lustre file system can
+ support files up to 2^63 bytes (8EiB), limited only by the space available
+ on the OSTs.</para>
<note>
- <para>Versions of the Lustre software prior to Release 2.2 limited the maximum stripe count
- for a single file to 160 OSTs.</para>
+ <para>ldiskfs filesystems without the <literal>ea_inode</literal>
+ feature limit the maximum stripe count for a single file to 160 OSTs.
+ </para>
</note>
- <para>Although a single file can only be striped over 2000 objects, Lustre file systems can
- have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O
- bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000
- servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to
- utilize the full file system bandwidth.</para>
- <para>For more information about striping, see <xref linkend="managingstripingfreespace"
- />.</para>
+ <para>Although a single file can only be striped over 2000 objects,
+ Lustre file systems can have thousands of OSTs. The I/O bandwidth to
+ access a single file is the aggregated I/O bandwidth to the objects in a
+ file, which can be as much as a bandwidth of up to 2000 servers. On
+ systems with more than 2000 OSTs, clients can do I/O using multiple files
+ to utilize the full file system bandwidth.</para>
+ <para>For more information about striping, see
+ <xref linkend="managingstripingfreespace" />.</para>
+ <para>
+ <emphasis role="bold">Extended Attributes(xattrs)</emphasis></para>
+ <para>Lustre uses lov_user_md_v1/lov_user_md_v3 data-structures to
+ maintain its file striping information under xattrs. Extended
+ attributes are created when files and directory are created. Lustre
+ uses <literal>trusted</literal> extended attributes to store its
+ parameters which are root-only accessible. The parameters are:</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis role="bold"><literal>trusted.lov</literal>:</emphasis>
+ Holds layout for a regular file, or default file layout stored
+ on a directory (also accessible as <literal>lustre.lov</literal>
+ for non-root users).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis role="bold"><literal>trusted.lma</literal>:</emphasis>
+ Holds FID and extra state flags for current file</para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis role="bold"><literal>trusted.lmv</literal>:</emphasis>
+ Holds layout for a striped directory (DNE 2), not present otherwise
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis role="bold"><literal>trusted.link</literal>:</emphasis>
+ Holds parent directory FID + filename for each link to a file
+ (for <literal>lfs fid2path</literal>)</para>
+ </listitem>
+ </itemizedlist>
+ <para>xattr which are stored and present in the file could be verify
+ using:</para>
+ <para><screen># getfattr -d -m - /mnt/testfs/file></screen></para>
</section>
</section>
</chapter>
+<!--
+ vim:expandtab:shiftwidth=2:tabstop=8:textwidth=80:
+ -->