X-Git-Url: https://git.whamcloud.com/?a=blobdiff_plain;f=UnderstandingLustre.xml;h=ffde1f780da0e3a3955aa76b1675f88ea81fdb03;hb=fcafa5bebef80213b4a1822edd83edc26894eccb;hp=dd95bc5789d3385a1e7ddcf9c7b05d4493f0fb19;hpb=09da8e9464945525cc66087da4446e4ba9958564;p=doc%2Fmanual.git diff --git a/UnderstandingLustre.xml b/UnderstandingLustre.xml index dd95bc5..ffde1f7 100644 --- a/UnderstandingLustre.xml +++ b/UnderstandingLustre.xml @@ -1,87 +1,107 @@ - - - Understanding Lustre Architecture - - This chapter describes the Lustre architecture and features of Lustre. It includes the - following sections: + + + Understanding Lustre + Architecture + This chapter describes the Lustre architecture and features of the + Lustre file system. It includes the following sections: - + - + - +
- <indexterm> - <primary>Lustre</primary> - </indexterm>What a Lustre File System Is (and What It Isn't) - The Lustre architecture is a storage architecture for clusters. The central component of - the Lustre architecture is the Lustre file system, which is supported on the Linux operating - system and provides a POSIX-compliant UNIX file system interface. - The Lustre storage architecture is used for many different kinds of clusters. It is best - known for powering many of the largest high-performance computing (HPC) clusters worldwide, - with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes - per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide - global file system, serving dozens of clusters. - The ability of a Lustre file system to scale capacity and performance for any need reduces - the need to deploy many separate file systems, such as one for each compute cluster. Storage - management is simplified by avoiding the need to copy data between compute clusters. In - addition to aggregating storage capacity of many servers, the I/O throughput is also - aggregated and scales with additional servers. Moreover, throughput and/or capacity can be - easily increased by adding servers dynamically. - While a Lustre file system can function in many work environments, it is not necessarily - the best choice for all applications. It is best suited for uses that exceed the capacity that - a single server can provide, though in some use cases, a Lustre file system can perform better - with a single server than other file systems due to its strong locking and data - coherency. + + <indexterm> + <primary>Lustre</primary> + </indexterm>What a Lustre File System Is (and What It Isn't) + The Lustre architecture is a storage architecture for clusters. The + central component of the Lustre architecture is the Lustre file system, + which is supported on the Linux operating system and provides a POSIX + *standard-compliant UNIX file system + interface. + The Lustre storage architecture is used for many different kinds of + clusters. It is best known for powering many of the largest + high-performance computing (HPC) clusters worldwide, with tens of thousands + of client systems, petabytes (PiB) of storage and hundreds of gigabytes per + second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system + as a site-wide global file system, serving dozens of clusters. + The ability of a Lustre file system to scale capacity and performance + for any need reduces the need to deploy many separate file systems, such as + one for each compute cluster. Storage management is simplified by avoiding + the need to copy data between compute clusters. In addition to aggregating + storage capacity of many servers, the I/O throughput is also aggregated and + scales with additional servers. Moreover, throughput and/or capacity can be + easily increased by adding servers dynamically. + While a Lustre file system can function in many work environments, it + is not necessarily the best choice for all applications. It is best suited + for uses that exceed the capacity that a single server can provide, though + in some use cases, a Lustre file system can perform better with a single + server than other file systems due to its strong locking and data + coherency. A Lustre file system is currently not particularly well suited for - "peer-to-peer" usage models where clients and servers are running on the same node, - each sharing a small amount of storage, due to the lack of Lustre-level data replication. In - such uses, if one client/server fails, then the data stored on that node will not be - accessible until the node is restarted. + "peer-to-peer" usage models where clients and servers are running on the + same node, each sharing a small amount of storage, due to the lack of data + replication at the Lustre software level. In such uses, if one + client/server fails, then the data stored on that node will not be + accessible until the node is restarted.
- <indexterm> - <primary>Lustre</primary> - <secondary>features</secondary> - </indexterm>Lustre Features - Lustre file systems run on a variety of vendor's kernels. For more details, see the - Lustre Support - Matrix on the Intel Lustre community wiki. - A Lustre installation can be scaled up or down with respect to the number of client - nodes, disk storage and bandwidth. Scalability and performance are dependent on available - disk and network bandwidth and the processing power of the servers in the system. A Lustre - file system can be deployed in a wide variety of configurations that can be scaled well - beyond the size and performance observed in production systems to date. - shows the practical range of scalability and - performance characteristics of a Lustre file system and some test results in production - systems. - - Lustre Scalability and Performance + + <indexterm> + <primary>Lustre</primary> + <secondary>features</secondary> + </indexterm>Lustre Features + Lustre file systems run on a variety of vendor's kernels. For more + details, see the Lustre Test Matrix + . + A Lustre installation can be scaled up or down with respect to the + number of client nodes, disk storage and bandwidth. Scalability and + performance are dependent on available disk and network bandwidth and the + processing power of the servers in the system. A Lustre file system can + be deployed in a wide variety of configurations that can be scaled well + beyond the size and performance observed in production systems to + date. + + shows some of the + scalability and performance characteristics of a Lustre file system. + For a full list of Lustre file and filesystem limits see + . +
+ Lustre File System Scalability and Performance - - - + + + - Feature + + Feature + - Current Practical Range + + Current Practical Range + - Tested in Production + + Known Production Usage + @@ -89,360 +109,498 @@ - Client Scalability + Client Scalability + - 100-100000 + 100-100000 - 50000+ clients, many in the 10000 to 20000 range + 50000+ clients, many in the 10000 to 20000 range - Client Performance + + Client Performance + - Single client: + Single client: + I/O 90% of network bandwidth - Aggregate: - 2.5 TB/sec I/O + + Aggregate: + + 10 TB/sec I/O - Single client: - 2 GB/sec I/O, 1000 metadata ops/sec - Aggregate: - 240 GB/sec I/O + Single client: + + 4.5 GB/sec I/O (FDR IB, OPA1), + 1000 metadata ops/sec + + Aggregate: + + 2.5 TB/sec I/O - OSS Scalability + OSS Scalability + - Single OSS: - 1-32 OSTs per OSS, - 128TB per OST + Single OSS: + + 1-32 OSTs per OSS - OSS count: - 500 OSSs, with up to 4000 OSTs + Single OST: + + 300M objects, 256TiB per OST (ldiskfs) + 500M objects, 256TiB per OST (ZFS) + + OSS count: + + 1000 OSSs, with up to 4000 OSTs - Single OSS: - 8 OSTs per OSS, - 16TB per OST + Single OSS: + + 32x 8TiB OSTs per OSS (ldiskfs), + 8x 32TiB OSTs per OSS (ldiskfs) + 1x 72TiB OST per OSS (ZFS) - OSS count: - 450 OSSs with 1000 4TB OSTs - 192 OSSs with 1344 8TB OSTs + OSS count: + + 450 OSSs with 1000 4TiB OSTs + 192 OSSs with 1344 8TiB OSTs + 768 OSSs with 768 72TiB OSTs - OSS Performance + OSS Performance + - Single OSS: - 5 GB/sec + Single OSS: + + 15 GB/sec - Aggregate: - 2.5 TB/sec + Aggregate: + + 10 TB/sec - Single OSS: - 2.0+ GB/sec + Single OSS: + + 10 GB/sec - Aggregate: - 240 GB/sec + Aggregate: + + 2.5 TB/sec - MDS Scalability + MDS Scalability + - Single MDS: - 4 billion files + Single MDS: + + 1-4 MDTs per MDS + + Single MDT: + + 4 billion files, 8TiB per MDT (ldiskfs) + 64 billion files, 64TiB per MDT (ZFS) - MDS count: - 1 primary + 1 backup - Since Lustre* Release 2.4: up to 4096 MDSs and up to 4096 - MDTs. + MDS count: + + 1 primary + 1 standby + 256 MDSs, with up to 256 MDTs - Single MDS: - 750 million files + Single MDS: + + 3 billion files - MDS count: - 1 primary + 1 backup + MDS count: + + 7 MDS with 7 2TiB MDTs in production + 256 MDS with 256 64GiB MDTs in testing - MDS Performance + MDS Performance + - 35000/s create operations, - 100000/s metadata stat operations + 50000/s create operations, + 200000/s metadata stat operations - 15000/s create operations, - 35000/s metadata stat operations + 15000/s create operations, + 50000/s metadata stat operations - File system Scalability + File system Scalability + - Single File: - 2.5 PB max file size + Single File: + + 32 PiB max file size (ldiskfs) + 2^63 bytes (ZFS) - Aggregate: - 512 PB space, 4 billion files + Aggregate: + + 512 PiB space, 1 trillion files - Single File: - multi-TB max file size + Single File: + + multi-TiB max file size - Aggregate: - 10 PB space, 750 million files + Aggregate: + + 55 PiB space, 8 billion files
- Other Lustre features are: + Other Lustre software features are: - Performance-enhanced ext4 file system: The Lustre - file system uses an improved version of the ext4 journaling file system to store data - and metadata. This version, called ldiskfs, has been enhanced to improve performance and - provide additional functionality needed by the Lustre file system. + + Performance-enhanced ext4 file + system:The Lustre file system uses an improved version of + the ext4 journaling file system to store data and metadata. This + version, called + + ldiskfs + , has been enhanced to improve performance and provide + additional functionality needed by the Lustre file system. - POSIX* compliance: The full POSIX test suite passes - in an identical manner to a local ext4 filesystem, with limited exceptions on Lustre - clients. In a cluster, most operations are atomic so that clients never see stale data - or metadata. The Lustre software supports mmap() file I/O. + With the Lustre software release 2.4 and later, + it is also possible to use ZFS as the backing filesystem for Lustre + for the MDT, OST, and MGS storage. This allows Lustre to leverage the + scalability and data integrity features of ZFS for individual storage + targets. - High-performance heterogeneous networking: The - Lustre software supports a variety of high performance, low latency networks and permits - Remote Direct Memory Access (RDMA) for Infiniband* (OFED) and other advanced networks - for fast and efficient network transport. Multiple RDMA networks can be bridged using - Lustre routing for maximum performance. The Lustre software also includes integrated - network diagnostics. + + POSIX standard compliance:The full + POSIX test suite passes in an identical manner to a local ext4 file + system, with limited exceptions on Lustre clients. In a cluster, most + operations are atomic so that clients never see stale data or + metadata. The Lustre software supports mmap() file I/O. - High-availability: The Lustre file system supports - active/active failover using shared storage partitions for OSS targets (OSTs). Lustre - Release 2.3 and earlier releases offer active/passive failover using a shared storage - partition for the MDS target (MDT). - With Lustre Release 2.4 or later servers and clients it is possible - to configure active/active failover of multiple MDTs. This allows application - transparent recovery. The Lustre file system can work with a variety of high - availability (HA) managers to allow automated failover and has no single point of - failure (NSPF). Multiple mount protection (MMP) provides integrated protection from - errors in highly-available systems that would otherwise cause file system - corruption. + + High-performance heterogeneous + networking:The Lustre software supports a variety of high + performance, low latency networks and permits Remote Direct Memory + Access (RDMA) for InfiniBand + *(utilizing OpenFabrics Enterprise + Distribution (OFED*), Intel OmniPath®, + and other advanced networks for fast + and efficient network transport. Multiple RDMA networks can be + bridged using Lustre routing for maximum performance. The Lustre + software also includes integrated network diagnostics. - Security: By default TCP connections are only - allowed from privileged ports. UNIX group membership is verified on the MDS. + + High-availability:The Lustre file + system supports active/active failover using shared storage + partitions for OSS targets (OSTs). Lustre software release 2.3 and + earlier releases offer active/passive failover using a shared storage + partition for the MDS target (MDT). The Lustre file system can work + with a variety of high availability (HA) managers to allow automated + failover and has no single point of failure (NSPF). This allows + application transparent recovery. Multiple mount protection (MMP) + provides integrated protection from errors in highly-available + systems that would otherwise cause file system corruption. - Access control list (ACL), extended attributes: the - Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs. - Noteworthy additional features include root squash. + With Lustre software release 2.4 or later + servers and clients it is possible to configure active/active + failover of multiple MDTs. This allows scaling the metadata + performance of Lustre filesystems with the addition of MDT storage + devices and MDS nodes. - Interoperability: The Lustre file system runs on a - variety of CPU architectures and mixed-endian clusters and is interoperable between - successive major Lustre software releases. + + Security:By default TCP connections + are only allowed from privileged ports. UNIX group membership is + verified on the MDS. - Object-based architecture: Clients are isolated - from the on-disk file structure enabling upgrading of the storage architecture without - affecting the client. + + Access control list (ACL), extended + attributes:the Lustre security model follows that of a + UNIX file system, enhanced with POSIX ACLs. Noteworthy additional + features include root squash. - Byte-granular file and fine-grained metadata - locking: Many clients can read and modify the same file or directory - concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent - between all clients and servers in the file system. The MDT LDLM manages locks on inode - permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored - thereon, which scales the locking performance as the file system grows. + + Interoperability:The Lustre file + system runs on a variety of CPU architectures and mixed-endian + clusters and is interoperable between successive major Lustre + software releases. - Quotas: User and group quotas are available for a - Lustre file system. + + Object-based architecture:Clients + are isolated from the on-disk file structure enabling upgrading of + the storage architecture without affecting the client. - Capacity growth: The size of a Lustre file system - and aggregate cluster bandwidth can be increased without interruption by adding a new - OSS with OSTs to the cluster. + + Byte-granular file and fine-grained metadata + locking:Many clients can read and modify the same file or + directory concurrently. The Lustre distributed lock manager (LDLM) + ensures that files are coherent between all clients and servers in + the file system. The MDT LDLM manages locks on inode permissions and + pathnames. Each OST has its own LDLM for locks on file stripes stored + thereon, which scales the locking performance as the file system + grows. - Controlled striping: The layout of files across - OSTs can be configured on a per file, per directory, or per file system basis. This - allows file I/O to be tuned to specific application requirements within a single file - system. The Lustre file system uses RAID-0 striping and balances space usage across - OSTs. + + Quotas:User and group quotas are + available for a Lustre file system. - Network data integrity protection: A checksum of - all data sent from the client to the OSS protects against corruption during data - transfer. + + Capacity growth:The size of a Lustre + file system and aggregate cluster bandwidth can be increased without + interruption by adding new OSTs and MDTs to the cluster. - MPI I/O: The Lustre architecture has a dedicated - MPI ADIO layer that optimizes parallel I/O to match the underlying file system - architecture. + + Controlled file layout:The layout of + files across OSTs can be configured on a per file, per directory, or + per file system basis. This allows file I/O to be tuned to specific + application requirements within a single file system. The Lustre file + system uses RAID-0 striping and balances space usage across + OSTs. - NFS and CIFS export: Lustre files can be re-exported using NFS (via Linux knfsd) or CIFS (via Samba) enabling them to be shared with non-Linux clients, such as Microsoft* Windows* and Apple* Mac OS X*. + + Network data integrity protection:A + checksum of all data sent from the client to the OSS protects against + corruption during data transfer. - Disaster recovery tool: The Lustre file system - provides a distributed file system check (lfsck) that can restore consistency between - storage components in case of a major file system error. A Lustre file system can - operate even in the presence of file system inconsistencies, so lfsck is not required - before returning the file system to production. + + MPI I/O:The Lustre architecture has + a dedicated MPI ADIO layer that optimizes parallel I/O to match the + underlying file system architecture. - Performance monitoring: The Lustre file system - offers a variety of mechanisms to examine performance and tuning. + + NFS and CIFS export:Lustre files can + be re-exported using NFS (via Linux knfsd or Ganesha) or CIFS (via + Samba), enabling them to be shared with non-Linux clients such as + Microsoft*Windows, + *Apple + *Mac OS X + *, and others. - Open source: The Lustre software is licensed under - the GPL 2.0 license for use with Linux. + + Disaster recovery tool:The Lustre + file system provides an online distributed file system check (LFSCK) + that can restore consistency between storage components in case of a + major file system error. A Lustre file system can operate even in the + presence of file system inconsistencies, and LFSCK can run while the + filesystem is in use, so LFSCK is not required to complete before + returning the file system to production. + + + + Performance monitoring:The Lustre + file system offers a variety of mechanisms to examine performance and + tuning. + + + + Open source:The Lustre software is + licensed under the GPL 2.0 license for use with the Linux operating + system.
- <indexterm> - <primary>Lustre</primary> - <secondary>components</secondary> - </indexterm>Lustre Components - An installation of the Lustre software includes a management server (MGS) and one or more - Lustre file systems interconnected with Lustre networking (LNET). - A basic configuration of Lustre components is shown in . -
- Lustre* components in a basic cluster + + <indexterm> + <primary>Lustre</primary> + <secondary>components</secondary> + </indexterm>Lustre Components + An installation of the Lustre software includes a management server + (MGS) and one or more Lustre file systems interconnected with Lustre + networking (LNet). + A basic configuration of Lustre file system components is shown in + . +
+ Lustre file system components in a basic cluster - + - Lustre* components in a basic cluster + Lustre file system components in a basic cluster
- <indexterm> - <primary>Lustre</primary> - <secondary>MGS</secondary> - </indexterm>Management Server (MGS) - The MGS stores configuration information for all the Lustre file systems in a cluster - and provides this information to other Lustre components. Each Lustre target contacts the - MGS to provide information, and Lustre clients contact the MGS to retrieve - information. - It is preferable that the MGS have its own storage space so that it can be managed - independently. However, the MGS can be co-located and share storage space with an MDS as - shown in . + + <indexterm> + <primary>Lustre</primary> + <secondary>MGS</secondary> + </indexterm>Management Server (MGS) + The MGS stores configuration information for all the Lustre file + systems in a cluster and provides this information to other Lustre + components. Each Lustre target contacts the MGS to provide information, + and Lustre clients contact the MGS to retrieve information. + It is preferable that the MGS have its own storage space so that it + can be managed independently. However, the MGS can be co-located and + share storage space with an MDS as shown in + .
Lustre File System Components - Each Lustre file system consists of the following components: + Each Lustre file system consists of the following + components: - Metadata Server (MDS) - The MDS makes metadata - stored in one or more MDTs available to Lustre clients. Each MDS manages the names and - directories in the Lustre file system(s) and provides network request handling for one - or more local MDTs. + + Metadata Servers (MDS)- The MDS makes + metadata stored in one or more MDTs available to Lustre clients. Each + MDS manages the names and directories in the Lustre file system(s) + and provides network request handling for one or more local + MDTs. - Metadata Target (MDT ) - For Lustre Release 2.3 and - earlier, each file system has one MDT. The MDT stores metadata (such as filenames, - directories, permissions and file layout) on storage attached to an MDS. Each file - system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, - although only one can access it at a time. If an active MDS fails, a standby MDS can - serve the MDT and make it available to clients. This is referred to as MDS - failover. - Since Lustre Release 2.4, multiple MDTs are supported. Each file - system has at least one MDT. An MDT on a shared storage target can be available via - multiple MDSs, although only one MDS can export the MDT to the clients at one time. Two - MDS machines share storage for two or more MDTs. After the failure of one MDS, the - remaining MDS begins serving the MDT(s) of the failed MDS. + + Metadata Targets (MDT) - For Lustre + software release 2.3 and earlier, each file system has one MDT. The + MDT stores metadata (such as filenames, directories, permissions and + file layout) on storage attached to an MDS. Each file system has one + MDT. An MDT on a shared storage target can be available to multiple + MDSs, although only one can access it at a time. If an active MDS + fails, a standby MDS can serve the MDT and make it available to + clients. This is referred to as MDS failover. + Since Lustre software release 2.4, multiple + MDTs are supported in the Distributed Namespace Environment (DNE). + In addition to the primary MDT that holds the filesystem root, it + is possible to add additional MDS nodes, each with their own MDTs, + to hold sub-directory trees of the filesystem. + Since Lustre software release 2.8, DNE also + allows the filesystem to distribute files of a single directory over + multiple MDT nodes. A directory which is distributed across multiple + MDTs is known as a striped directory. - Object Storage Servers (OSS) : The OSS provides - file I/O service and network request handling for one or more local OSTs. Typically, an - OSS serves between two and eight OSTs, up to 16 TB each. A typical configuration is an - MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a - large number of compute nodes. + + Object Storage Servers (OSS): The + OSS provides file I/O service and network request handling for one or + more local OSTs. Typically, an OSS serves between two and eight OSTs, + up to 16 TiB each. A typical configuration is an MDT on a dedicated + node, two or more OSTs on each OSS node, and a client on each of a + large number of compute nodes. - Object Storage Target (OST) : User file data is - stored in one or more objects, each object on a separate OST in a Lustre file system. - The number of objects per file is configurable by the user and can be tuned to optimize - performance for a given workload. + + Object Storage Target (OST): User + file data is stored in one or more objects, each object on a separate + OST in a Lustre file system. The number of objects per file is + configurable by the user and can be tuned to optimize performance for + a given workload. - Lustre clients : Lustre clients are computational, - visualization or desktop nodes that are running Lustre client software, allowing them to - mount the Lustre file system. + + Lustre clients: Lustre clients are + computational, visualization or desktop nodes that are running Lustre + client software, allowing them to mount the Lustre file + system. - The Lustre client software provides an interface between the Linux virtual file system - and the Lustre servers. The client software includes a management client (MGC), a metadata - client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in - the file system. - A logical object volume (LOV) aggregates the OSCs to provide transparent access across - all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, - synchronized namespace. Several clients can write to different parts of the same file - simultaneously, while, at the same time, other clients can read from the file. - provides the requirements for - attached storage for each Lustre file system component and describes desirable - characteristics of the hardware used. - - <indexterm> - <primary>Lustre</primary> - <secondary>requirements</secondary> - </indexterm>Storage and hardware requirements for Lustre* components + The Lustre client software provides an interface between the Linux + virtual file system and the Lustre servers. The client software includes + a management client (MGC), a metadata client (MDC), and multiple object + storage clients (OSCs), one corresponding to each OST in the file + system. + A logical object volume (LOV) aggregates the OSCs to provide + transparent access across all the OSTs. Thus, a client with the Lustre + file system mounted sees a single, coherent, synchronized namespace. + Several clients can write to different parts of the same file + simultaneously, while, at the same time, other clients can read from the + file. + A logical metadata volume (LMV) aggregates the MDCs to provide + transparent access across all the MDTs in a similar manner as the LOV + does for file access. This allows the client to see the directory tree + on multiple MDTs as a single coherent namespace, and striped directories + are merged on the clients to form a single visible directory to users + and applications. + + + provides the + requirements for attached storage for each Lustre file system component + and describes desirable characteristics of the hardware used. +
+ + <indexterm> + <primary>Lustre</primary> + <secondary>requirements</secondary> + </indexterm>Storage and hardware requirements for Lustre file system + components - - - + + + - + + + - Required attached storage + + Required attached storage + - Desirable hardware characteristics + + Desirable hardware + characteristics + @@ -450,217 +608,307 @@ - MDSs + MDSs + - 1-2% of file system capacity + 1-2% of file system capacity - Adequate CPU power, plenty of memory, fast disk storage. + Adequate CPU power, plenty of memory, fast disk + storage. - OSSs + OSSs + - 1-16 TB per OST, 1-8 OSTs per OSS + 1-128 TiB per OST, 1-8 OSTs per OSS - Good bus bandwidth. Recommended that storage be balanced evenly across - OSSs. + Good bus bandwidth. Recommended that storage be balanced + evenly across OSSs and matched to network bandwidth. - Clients + Clients + - None + No local storage needed - Low latency, high bandwidth network. + Low latency, high bandwidth network.
- For additional hardware requirements and considerations, see . + For additional hardware requirements and considerations, see + .
- <indexterm> - <primary>Lustre</primary> - <secondary>LNET</secondary> - </indexterm>Lustre Networking (LNET) - Lustre Networking (LNET) is a custom networking API that provides the communication - infrastructure that handles metadata and file I/O data for the Lustre file system servers - and clients. For more information about LNET, see . + + <indexterm> + <primary>Lustre</primary> + <secondary>LNet</secondary> + </indexterm>Lustre Networking (LNet) + Lustre Networking (LNet) is a custom networking API that provides + the communication infrastructure that handles metadata and file I/O data + for the Lustre file system servers and clients. For more information + about LNet, see + .
- <indexterm> + <title> + <indexterm> + <primary>Lustre</primary> + <secondary>cluster</secondary> + </indexterm>Lustre Cluster + At scale, a Lustre file system cluster can include hundreds of OSSs + and thousands of clients (see + ). More than one + type of network can be used in a Lustre cluster. Shared storage between + OSSs enables failover capability. For more details about OSS failover, + see + . +
+ + <indexterm> <primary>Lustre</primary> - <secondary>cluster</secondary> - </indexterm>Lustre Cluster - At scale, the Lustre cluster can include hundreds of OSSs and thousands of clients (see - ). More than one type of network can - be used in a Lustre cluster. Shared storage between OSSs enables failover capability. For - more details about OSS failover, see . -
- <indexterm> - <primary>Lustre</primary> - <secondary>at scale</secondary> - </indexterm>Lustre* cluster at scale + at scale + Lustre cluster at scale - + - Lustre* clustre at scale + Lustre file system cluster at scale
- <indexterm> - <primary>Lustre</primary> - <secondary>storage</secondary> - </indexterm> - <indexterm> - <primary>Lustre</primary> - <secondary>I/O</secondary> - </indexterm> Lustre Storage and I/O - In a Lustre file system, a file stored on the MDT points to one or more objects associated - with a data file, as shown in . Each object - contains data and is stored on an OST. If the MDT file points to one object, all the file data - is stored in that object. If the file points to more than one object, the file data is - 'striped' across the objects (using RAID 0) and each object is stored on a different - OST. (For more information about how striping is implemented in a Lustre file system, see - ) - In , each filename points to an inode. The - inode contains all of the file attributes, such as owner, access permissions, Lustre striping - layout, access time, and access control. Multiple filenames may point to the same - inode. -
- MDT file points to objects on OSTs containing - file data + + <indexterm> + <primary>Lustre</primary> + <secondary>storage</secondary> + </indexterm> + <indexterm> + <primary>Lustre</primary> + <secondary>I/O</secondary> + </indexterm>Lustre File System Storage and I/O + In Lustre software release 2.0, Lustre file identifiers (FIDs) were + introduced to replace UNIX inode numbers for identifying files or objects. + A FID is a 128-bit identifier that contains a unique 64-bit sequence + number, a 32-bit object ID (OID), and a 32-bit version number. The sequence + number is unique across all Lustre targets in a file system (OSTs and + MDTs). This change enabled future support for multiple MDTs (introduced in + Lustre software release 2.4) and ZFS (introduced in Lustre software release + 2.4). + Also introduced in release 2.0 is an ldiskfs feature named + FID-in-dirent(also known as + dirdata) in which the FID is stored as + part of the name of the file in the parent directory. This feature + significantly improves performance for + ls command executions by reducing disk I/O. The + FID-in-dirent is generated at the time the file is created. + + The FID-in-dirent feature is not backward compatible with the + release 1.8 ldiskfs disk format. Therefore, when an upgrade from + release 1.8 to release 2.x is performed, the FID-in-dirent feature is + not automatically enabled. For upgrades from release 1.8 to releases + 2.0 through 2.3, FID-in-dirent can be enabled manually but only takes + effect for new files. + For more information about upgrading from Lustre software release + 1.8 and enabling FID-in-dirent for existing files, see + Chapter 16 “Upgrading a Lustre File + System”. + + The LFSCK file system consistency checking tool + released with Lustre software release 2.4 provides functionality that + enables FID-in-dirent for existing files. It includes the following + functionality: + + + Generates IGIF mode FIDs for existing files from a 1.8 version + file system files. + + + Verifies the FID-in-dirent for each file and regenerates the + FID-in-dirent if it is invalid or missing. + + + Verifies the linkEA entry for each and regenerates the linkEA + if it is invalid or missing. The + linkEA consists of the file name and + parent FID. It is stored as an extended attribute in the file + itself. Thus, the linkEA can be used to reconstruct the full path name + of a file. + + + Information about where file data is located on the OST(s) is stored + as an extended attribute called layout EA in an MDT object identified by + the FID for the file (see + ). If the file is a regular file (not a + directory or symbol link), the MDT object points to 1-to-N OST object(s) on + the OST(s) that contain the file data. If the MDT file points to one + object, all the file data is stored in that object. If the MDT file points + to more than one object, the file data is + striped across the objects using RAID 0, + and each object is stored on a different OST. (For more information about + how striping is implemented in a Lustre file system, see + . +
+ Layout EA on MDT pointing to file data on OSTs - + - MDT file points to objects on OSTs containing file data + Layout EA on MDT pointing to file data on OSTs
- When a client opens a file, the fileopen operation transfers the file - layout from the MDS to the client. The client then uses this information to perform I/O on the - file, directly interacting with the OSS nodes where the objects are stored. This process is - illustrated in . -
- File open and file I/O in Lustre* + When a client wants to read from or write to a file, it first fetches + the layout EA from the MDT object for the file. The client then uses this + information to perform I/O on the file, directly interacting with the OSS + nodes where the objects are stored. + + This process is illustrated in + + . +
+ Lustre client requesting file data - + - File open and file I/O in Lustre* + Lustre client requesting file data
- Each file on the MDT contains the layout of the associated data file, including the OST - number and object identifier. Clients request the file layout from the MDS and then perform - file I/O operations by communicating directly with the OSSs that manage that file data. - The available bandwidth of a Lustre file system is determined as follows: + The available bandwidth of a Lustre file system is determined as + follows: - The network bandwidth equals the aggregated bandwidth of the OSSs - to the targets. + The + network bandwidth equals the aggregated bandwidth + of the OSSs to the targets. - The disk bandwidth equals the sum of the disk bandwidths of the - storage targets (OSTs) up to the limit of the network bandwidth. + The + disk bandwidth equals the sum of the disk + bandwidths of the storage targets (OSTs) up to the limit of the network + bandwidth. - The aggregate bandwidth equals the minimum of the disk bandwidth - and the network bandwidth. + The + aggregate bandwidth equals the minimum of the disk + bandwidth and the network bandwidth. - The available file system space equals the sum of the available - space of all the OSTs. + The + available file system space equals the sum of the + available space of all the OSTs.
- <indexterm> - <primary>Lustre</primary> - <secondary>striping</secondary> - </indexterm> - <indexterm> - <primary>striping</primary> - <secondary>overview</secondary> - </indexterm> Lustre File System and Striping - One of the main factors leading to the high performance of Lustre file systems is the - ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally - configure for each file the number of stripes, stripe size, and OSTs that are used. - Striping can be used to improve performance when the aggregate bandwidth to a single - file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a - single OST does not have enough free space to hold an entire file. For more information - about benefits and drawbacks of file striping, see . - Striping allows segments or 'chunks' of data in a file to be stored on - different OSTs, as shown in . In the - Lustre file system, a RAID 0 pattern is used in which data is "striped" across a - certain number of objects. The number of objects in a single file is called the - stripe_count. - Each object contains a chunk of data from the file. When the chunk of data being written - to a particular object exceeds the stripe_size, the next chunk of data in - the file is stored on the next object. - Default values for stripe_count and stripe_size - are set for the file system. The default value for stripe_count is 1 - stripe for file and the default value for stripe_size is 1MB. The user - may change these values on a per directory or per file basis. For more details, see . - , the stripe_size - for File C is larger than the stripe_size for File A, allowing more data - to be stored in a single stripe for File C. The stripe_count for File A - is 3, resulting in data striped across three objects, while the - stripe_count for File B and File C is 1. - No space is reserved on the OST for unwritten data. File A in . -
- File striping on a Lustre* file - system + + Lustre + striping + + + striping + overview + Lustre File System and Striping + One of the main factors leading to the high performance of Lustre + file systems is the ability to stripe data across multiple OSTs in a + round-robin fashion. Users can optionally configure for each file the + number of stripes, stripe size, and OSTs that are used. + Striping can be used to improve performance when the aggregate + bandwidth to a single file exceeds the bandwidth of a single OST. The + ability to stripe is also useful when a single OST does not have enough + free space to hold an entire file. For more information about benefits + and drawbacks of file striping, see + . + Striping allows segments or 'chunks' of data in a file to be stored + on different OSTs, as shown in + . In the Lustre file + system, a RAID 0 pattern is used in which data is "striped" across a + certain number of objects. The number of objects in a single file is + called the + stripe_count. + Each object contains a chunk of data from the file. When the chunk + of data being written to a particular object exceeds the + stripe_size, the next chunk of data in the file is + stored on the next object. + Default values for + stripe_count and + stripe_size are set for the file system. The default + value for + stripe_count is 1 stripe for file and the default value + for + stripe_size is 1MB. The user may change these values on + a per directory or per file basis. For more details, see + . + + , the + stripe_size for File C is larger than the + stripe_size for File A, allowing more data to be stored + in a single stripe for File C. The + stripe_count for File A is 3, resulting in data striped + across three objects, while the + stripe_count for File B and File C is 1. + No space is reserved on the OST for unwritten data. File A in + . +
+ File striping on a + Lustre file system - + - File striping pattern across three OSTs for three different data files. The file - is sparse and missing chunk 6. + File striping pattern across three OSTs for three different + data files. The file is sparse and missing chunk 6.
- The maximum file size is not limited by the size of a single target. In a Lustre file - system, files can be striped across multiple objects (up to 2000), and each object can be - up to 16 TB in size with ldiskfs. This leads to a maximum file size of 31.25 PB. (Note that - a Lustre file system can support files up to 2^64 bytes depending on the backing storage - used by OSTs.) + The maximum file size is not limited by the size of a single + target. In a Lustre file system, files can be striped across multiple + objects (up to 2000), and each object can be up to 16 TiB in size with + ldiskfs, or up to 256PiB with ZFS. This leads to a maximum file size of + 31.25 PiB for ldiskfs or 8EiB with ZFS. Note that a Lustre file system can + support files up to 2^63 bytes (8EiB), limited only by the space available + on the OSTs. - Versions of the Lustre software prior to Release 2.2 limited the maximum stripe count - for a single file to 160 OSTs. + Versions of the Lustre software prior to Release 2.2 limited the + maximum stripe count for a single file to 160 OSTs. - Although a single file can only be striped over 2000 objects, Lustre file systems can - have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O - bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000 - servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to - utilize the full file system bandwidth. - For more information about striping, see . + Although a single file can only be striped over 2000 objects, + Lustre file systems can have thousands of OSTs. The I/O bandwidth to + access a single file is the aggregated I/O bandwidth to the objects in a + file, which can be as much as a bandwidth of up to 2000 servers. On + systems with more than 2000 OSTs, clients can do I/O using multiple files + to utilize the full file system bandwidth. + For more information about striping, see + .