1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="understandinglustre">
5 <title xml:id="understandinglustre.title">Understanding Lustre
7 <para>This chapter describes the Lustre architecture and features of the
8 Lustre file system. It includes the following sections:</para>
12 <xref linkend="understandinglustre.whatislustre" />
17 <xref linkend="understandinglustre.components" />
22 <xref linkend="understandinglustre.storageio" />
26 <section xml:id="understandinglustre.whatislustre">
29 <primary>Lustre</primary>
30 </indexterm>What a Lustre File System Is (and What It Isn't)</title>
31 <para>The Lustre architecture is a storage architecture for clusters. The
32 central component of the Lustre architecture is the Lustre file system,
33 which is supported on the Linux operating system and provides a POSIX
34 <superscript>*</superscript>standard-compliant UNIX file system
36 <para>The Lustre storage architecture is used for many different kinds of
37 clusters. It is best known for powering many of the largest
38 high-performance computing (HPC) clusters worldwide, with tens of thousands
39 of client systems, petabytes (PB) of storage and hundreds of gigabytes per
40 second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system
41 as a site-wide global file system, serving dozens of clusters.</para>
42 <para>The ability of a Lustre file system to scale capacity and performance
43 for any need reduces the need to deploy many separate file systems, such as
44 one for each compute cluster. Storage management is simplified by avoiding
45 the need to copy data between compute clusters. In addition to aggregating
46 storage capacity of many servers, the I/O throughput is also aggregated and
47 scales with additional servers. Moreover, throughput and/or capacity can be
48 easily increased by adding servers dynamically.</para>
49 <para>While a Lustre file system can function in many work environments, it
50 is not necessarily the best choice for all applications. It is best suited
51 for uses that exceed the capacity that a single server can provide, though
52 in some use cases, a Lustre file system can perform better with a single
53 server than other file systems due to its strong locking and data
55 <para>A Lustre file system is currently not particularly well suited for
56 "peer-to-peer" usage models where clients and servers are running on the
57 same node, each sharing a small amount of storage, due to the lack of data
58 replication at the Lustre software level. In such uses, if one
59 client/server fails, then the data stored on that node will not be
60 accessible until the node is restarted.</para>
64 <primary>Lustre</primary>
65 <secondary>features</secondary>
66 </indexterm>Lustre Features</title>
67 <para>Lustre file systems run on a variety of vendor's kernels. For more
68 details, see the Lustre Test Matrix
69 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
70 linkend="dbdoclet.50438261_99193" />.</para>
71 <para>A Lustre installation can be scaled up or down with respect to the
72 number of client nodes, disk storage and bandwidth. Scalability and
73 performance are dependent on available disk and network bandwidth and the
74 processing power of the servers in the system. A Lustre file system can
75 be deployed in a wide variety of configurations that can be scaled well
76 beyond the size and performance observed in production systems to
79 <xref linkend="understandinglustre.tab1" /> shows some of the
80 scalability and performance characteristics of a Lustre file system.
81 For a full list of Lustre file and filesystem limits see
82 <xref linkend="settinguplustresystem.tab2"/>.</para>
83 <table frame="all" xml:id="understandinglustre.tab1">
84 <title>Lustre File System Scalability and Performance</title>
86 <colspec colname="c1" colwidth="1*" />
87 <colspec colname="c2" colwidth="2*" />
88 <colspec colname="c3" colwidth="3*" />
93 <emphasis role="bold">Feature</emphasis>
98 <emphasis role="bold">Current Practical Range</emphasis>
103 <emphasis role="bold">Known Production Usage</emphasis>
112 <emphasis role="bold">Client Scalability</emphasis>
116 <para>100-100000</para>
119 <para>50000+ clients, many in the 10000 to 20000 range</para>
125 <emphasis role="bold">Client Performance</emphasis>
130 <emphasis>Single client:</emphasis>
132 <para>I/O 90% of network bandwidth</para>
134 <emphasis>Aggregate:</emphasis>
136 <para>10 TB/sec I/O</para>
140 <emphasis>Single client:</emphasis>
142 <para>4.5 GB/sec I/O (FDR IB, OPA1),
143 1000 metadata ops/sec</para>
145 <emphasis>Aggregate:</emphasis>
147 <para>2.5 TB/sec I/O </para>
153 <emphasis role="bold">OSS Scalability</emphasis>
158 <emphasis>Single OSS:</emphasis>
160 <para>1-32 OSTs per OSS</para>
162 <emphasis>Single OST:</emphasis>
164 <para>300M objects, 128TB per OST (ldiskfs)</para>
165 <para>500M objects, 256TB per OST (ZFS)</para>
167 <emphasis>OSS count:</emphasis>
169 <para>1000 OSSs, with up to 4000 OSTs</para>
173 <emphasis>Single OSS:</emphasis>
175 <para>32x 8TB OSTs per OSS (ldiskfs),</para>
176 <para>8x 32TB OSTs per OSS (ldiskfs)</para>
177 <para>1x 72TB OST per OSS (ZFS)</para>
179 <emphasis>OSS count:</emphasis>
181 <para>450 OSSs with 1000 4TB OSTs</para>
182 <para>192 OSSs with 1344 8TB OSTs</para>
183 <para>768 OSSs with 768 72TB OSTs</para>
189 <emphasis role="bold">OSS Performance</emphasis>
194 <emphasis>Single OSS:</emphasis>
196 <para>15 GB/sec</para>
198 <emphasis>Aggregate:</emphasis>
200 <para>10 TB/sec</para>
204 <emphasis>Single OSS:</emphasis>
206 <para>10 GB/sec</para>
208 <emphasis>Aggregate:</emphasis>
210 <para>2.5 TB/sec</para>
216 <emphasis role="bold">MDS Scalability</emphasis>
221 <emphasis>Single MDS:</emphasis>
223 <para>1-4 MDTs per MDS</para>
225 <emphasis>Single MDT:</emphasis>
227 <para>4 billion files, 8TB per MDT (ldiskfs)</para>
228 <para>64 billion files, 64TB per MDT (ZFS)</para>
230 <emphasis>MDS count:</emphasis>
232 <para>1 primary + 1 standby</para>
233 <para condition="l24">256 MDSs, with up to 256 MDTs</para>
237 <emphasis>Single MDS:</emphasis>
239 <para>3 billion files</para>
241 <emphasis>MDS count:</emphasis>
243 <para>7 MDS with 7 2TB MDTs in production</para>
244 <para>256 MDS with 256 64GB MDTs in testing</para>
250 <emphasis role="bold">MDS Performance</emphasis>
254 <para>50000/s create operations,</para>
255 <para>200000/s metadata stat operations</para>
258 <para>15000/s create operations,</para>
259 <para>50000/s metadata stat operations</para>
265 <emphasis role="bold">File system Scalability</emphasis>
270 <emphasis>Single File:</emphasis>
272 <para>32 PB max file size (ldiskfs)</para>
273 <para>2^63 bytes (ZFS)</para>
275 <emphasis>Aggregate:</emphasis>
277 <para>512 PB space, 1 trillion files</para>
281 <emphasis>Single File:</emphasis>
283 <para>multi-TB max file size</para>
285 <emphasis>Aggregate:</emphasis>
287 <para>55 PB space, 8 billion files</para>
293 <para>Other Lustre software features are:</para>
297 <emphasis role="bold">Performance-enhanced ext4 file
298 system:</emphasis>The Lustre file system uses an improved version of
299 the ext4 journaling file system to store data and metadata. This
301 <emphasis role="italic">
302 <literal>ldiskfs</literal>
303 </emphasis>, has been enhanced to improve performance and provide
304 additional functionality needed by the Lustre file system.</para>
307 <para condition="l24">With the Lustre software release 2.4 and later,
308 it is also possible to use ZFS as the backing filesystem for Lustre
309 for the MDT, OST, and MGS storage. This allows Lustre to leverage the
310 scalability and data integrity features of ZFS for individual storage
315 <emphasis role="bold">POSIX standard compliance:</emphasis>The full
316 POSIX test suite passes in an identical manner to a local ext4 file
317 system, with limited exceptions on Lustre clients. In a cluster, most
318 operations are atomic so that clients never see stale data or
319 metadata. The Lustre software supports mmap() file I/O.</para>
323 <emphasis role="bold">High-performance heterogeneous
324 networking:</emphasis>The Lustre software supports a variety of high
325 performance, low latency networks and permits Remote Direct Memory
326 Access (RDMA) for InfiniBand
327 <superscript>*</superscript>(utilizing OpenFabrics Enterprise
328 Distribution (OFED<superscript>*</superscript>), Intel OmniPath®,
329 and other advanced networks for fast
330 and efficient network transport. Multiple RDMA networks can be
331 bridged using Lustre routing for maximum performance. The Lustre
332 software also includes integrated network diagnostics.</para>
336 <emphasis role="bold">High-availability:</emphasis>The Lustre file
337 system supports active/active failover using shared storage
338 partitions for OSS targets (OSTs). Lustre software release 2.3 and
339 earlier releases offer active/passive failover using a shared storage
340 partition for the MDS target (MDT). The Lustre file system can work
341 with a variety of high availability (HA) managers to allow automated
342 failover and has no single point of failure (NSPF). This allows
343 application transparent recovery. Multiple mount protection (MMP)
344 provides integrated protection from errors in highly-available
345 systems that would otherwise cause file system corruption.</para>
348 <para condition="l24">With Lustre software release 2.4 or later
349 servers and clients it is possible to configure active/active
350 failover of multiple MDTs. This allows scaling the metadata
351 performance of Lustre filesystems with the addition of MDT storage
352 devices and MDS nodes.</para>
356 <emphasis role="bold">Security:</emphasis>By default TCP connections
357 are only allowed from privileged ports. UNIX group membership is
358 verified on the MDS.</para>
362 <emphasis role="bold">Access control list (ACL), extended
363 attributes:</emphasis>the Lustre security model follows that of a
364 UNIX file system, enhanced with POSIX ACLs. Noteworthy additional
365 features include root squash.</para>
369 <emphasis role="bold">Interoperability:</emphasis>The Lustre file
370 system runs on a variety of CPU architectures and mixed-endian
371 clusters and is interoperable between successive major Lustre
372 software releases.</para>
376 <emphasis role="bold">Object-based architecture:</emphasis>Clients
377 are isolated from the on-disk file structure enabling upgrading of
378 the storage architecture without affecting the client.</para>
382 <emphasis role="bold">Byte-granular file and fine-grained metadata
383 locking:</emphasis>Many clients can read and modify the same file or
384 directory concurrently. The Lustre distributed lock manager (LDLM)
385 ensures that files are coherent between all clients and servers in
386 the file system. The MDT LDLM manages locks on inode permissions and
387 pathnames. Each OST has its own LDLM for locks on file stripes stored
388 thereon, which scales the locking performance as the file system
393 <emphasis role="bold">Quotas:</emphasis>User and group quotas are
394 available for a Lustre file system.</para>
398 <emphasis role="bold">Capacity growth:</emphasis>The size of a Lustre
399 file system and aggregate cluster bandwidth can be increased without
400 interruption by adding new OSTs and MDTs to the cluster.</para>
404 <emphasis role="bold">Controlled file layout:</emphasis>The layout of
405 files across OSTs can be configured on a per file, per directory, or
406 per file system basis. This allows file I/O to be tuned to specific
407 application requirements within a single file system. The Lustre file
408 system uses RAID-0 striping and balances space usage across
413 <emphasis role="bold">Network data integrity protection:</emphasis>A
414 checksum of all data sent from the client to the OSS protects against
415 corruption during data transfer.</para>
419 <emphasis role="bold">MPI I/O:</emphasis>The Lustre architecture has
420 a dedicated MPI ADIO layer that optimizes parallel I/O to match the
421 underlying file system architecture.</para>
425 <emphasis role="bold">NFS and CIFS export:</emphasis>Lustre files can
426 be re-exported using NFS (via Linux knfsd or Ganesha) or CIFS (via
427 Samba), enabling them to be shared with non-Linux clients such as
428 Microsoft<superscript>*</superscript>Windows,
429 <superscript>*</superscript>Apple
430 <superscript>*</superscript>Mac OS X
431 <superscript>*</superscript>, and others.</para>
435 <emphasis role="bold">Disaster recovery tool:</emphasis>The Lustre
436 file system provides an online distributed file system check (LFSCK)
437 that can restore consistency between storage components in case of a
438 major file system error. A Lustre file system can operate even in the
439 presence of file system inconsistencies, and LFSCK can run while the
440 filesystem is in use, so LFSCK is not required to complete before
441 returning the file system to production.</para>
445 <emphasis role="bold">Performance monitoring:</emphasis>The Lustre
446 file system offers a variety of mechanisms to examine performance and
451 <emphasis role="bold">Open source:</emphasis>The Lustre software is
452 licensed under the GPL 2.0 license for use with the Linux operating
458 <section xml:id="understandinglustre.components">
461 <primary>Lustre</primary>
462 <secondary>components</secondary>
463 </indexterm>Lustre Components</title>
464 <para>An installation of the Lustre software includes a management server
465 (MGS) and one or more Lustre file systems interconnected with Lustre
466 networking (LNet).</para>
467 <para>A basic configuration of Lustre file system components is shown in
468 <xref linkend="understandinglustre.fig.cluster" />.</para>
469 <figure xml:id="understandinglustre.fig.cluster">
470 <title>Lustre file system components in a basic cluster</title>
473 <imagedata scalefit="1" width="100%"
474 fileref="./figures/Basic_Cluster.png" />
477 <phrase>Lustre file system components in a basic cluster</phrase>
484 <primary>Lustre</primary>
485 <secondary>MGS</secondary>
486 </indexterm>Management Server (MGS)</title>
487 <para>The MGS stores configuration information for all the Lustre file
488 systems in a cluster and provides this information to other Lustre
489 components. Each Lustre target contacts the MGS to provide information,
490 and Lustre clients contact the MGS to retrieve information.</para>
491 <para>It is preferable that the MGS have its own storage space so that it
492 can be managed independently. However, the MGS can be co-located and
493 share storage space with an MDS as shown in
494 <xref linkend="understandinglustre.fig.cluster" />.</para>
497 <title>Lustre File System Components</title>
498 <para>Each Lustre file system consists of the following
503 <emphasis role="bold">Metadata Servers (MDS)</emphasis>- The MDS makes
504 metadata stored in one or more MDTs available to Lustre clients. Each
505 MDS manages the names and directories in the Lustre file system(s)
506 and provides network request handling for one or more local
511 <emphasis role="bold">Metadata Targets (MDT</emphasis>) - For Lustre
512 software release 2.3 and earlier, each file system has one MDT. The
513 MDT stores metadata (such as filenames, directories, permissions and
514 file layout) on storage attached to an MDS. Each file system has one
515 MDT. An MDT on a shared storage target can be available to multiple
516 MDSs, although only one can access it at a time. If an active MDS
517 fails, a standby MDS can serve the MDT and make it available to
518 clients. This is referred to as MDS failover.</para>
519 <para condition="l24">Since Lustre software release 2.4, multiple
520 MDTs are supported in the Distributed Namespace Environment (DNE).
521 In addition to the primary MDT that holds the filesystem root, it
522 is possible to add additional MDS nodes, each with their own MDTs,
523 to hold sub-directory trees of the filesystem.</para>
524 <para condition="l28">Since Lustre software release 2.8, DNE also
525 allows the filesystem to distribute files of a single directory over
526 multiple MDT nodes. A directory which is distributed across multiple
527 MDTs is known as a <emphasis>striped directory</emphasis>.</para>
531 <emphasis role="bold">Object Storage Servers (OSS)</emphasis>: The
532 OSS provides file I/O service and network request handling for one or
533 more local OSTs. Typically, an OSS serves between two and eight OSTs,
534 up to 16 TB each. A typical configuration is an MDT on a dedicated
535 node, two or more OSTs on each OSS node, and a client on each of a
536 large number of compute nodes.</para>
540 <emphasis role="bold">Object Storage Target (OST)</emphasis>: User
541 file data is stored in one or more objects, each object on a separate
542 OST in a Lustre file system. The number of objects per file is
543 configurable by the user and can be tuned to optimize performance for
544 a given workload.</para>
548 <emphasis role="bold">Lustre clients</emphasis>: Lustre clients are
549 computational, visualization or desktop nodes that are running Lustre
550 client software, allowing them to mount the Lustre file
554 <para>The Lustre client software provides an interface between the Linux
555 virtual file system and the Lustre servers. The client software includes
556 a management client (MGC), a metadata client (MDC), and multiple object
557 storage clients (OSCs), one corresponding to each OST in the file
559 <para>A logical object volume (LOV) aggregates the OSCs to provide
560 transparent access across all the OSTs. Thus, a client with the Lustre
561 file system mounted sees a single, coherent, synchronized namespace.
562 Several clients can write to different parts of the same file
563 simultaneously, while, at the same time, other clients can read from the
565 <para>A logical metadata volume (LMV) aggregates the MDCs to provide
566 transparent access across all the MDTs in a similar manner as the LOV
567 does for file access. This allows the client to see the directory tree
568 on multiple MDTs as a single coherent namespace, and striped directories
569 are merged on the clients to form a single visible directory to users
573 <xref linkend="understandinglustre.tab.storagerequire" />provides the
574 requirements for attached storage for each Lustre file system component
575 and describes desirable characteristics of the hardware used.</para>
576 <table frame="all" xml:id="understandinglustre.tab.storagerequire">
579 <primary>Lustre</primary>
580 <secondary>requirements</secondary>
581 </indexterm>Storage and hardware requirements for Lustre file system
584 <colspec colname="c1" colwidth="1*" />
585 <colspec colname="c2" colwidth="3*" />
586 <colspec colname="c3" colwidth="3*" />
591 <emphasis role="bold" />
596 <emphasis role="bold">Required attached storage</emphasis>
601 <emphasis role="bold">Desirable hardware
602 characteristics</emphasis>
611 <emphasis role="bold">MDSs</emphasis>
615 <para>1-2% of file system capacity</para>
618 <para>Adequate CPU power, plenty of memory, fast disk
625 <emphasis role="bold">OSSs</emphasis>
629 <para>1-128 TB per OST, 1-8 OSTs per OSS</para>
632 <para>Good bus bandwidth. Recommended that storage be balanced
633 evenly across OSSs and matched to network bandwidth.</para>
639 <emphasis role="bold">Clients</emphasis>
643 <para>No local storage needed</para>
646 <para>Low latency, high bandwidth network.</para>
652 <para>For additional hardware requirements and considerations, see
653 <xref linkend="settinguplustresystem" />.</para>
658 <primary>Lustre</primary>
659 <secondary>LNet</secondary>
660 </indexterm>Lustre Networking (LNet)</title>
661 <para>Lustre Networking (LNet) is a custom networking API that provides
662 the communication infrastructure that handles metadata and file I/O data
663 for the Lustre file system servers and clients. For more information
665 <xref linkend="understandinglustrenetworking" />.</para>
670 <primary>Lustre</primary>
671 <secondary>cluster</secondary>
672 </indexterm>Lustre Cluster</title>
673 <para>At scale, a Lustre file system cluster can include hundreds of OSSs
674 and thousands of clients (see
675 <xref linkend="understandinglustre.fig.lustrescale" />). More than one
676 type of network can be used in a Lustre cluster. Shared storage between
677 OSSs enables failover capability. For more details about OSS failover,
679 <xref linkend="understandingfailover" />.</para>
680 <figure xml:id="understandinglustre.fig.lustrescale">
683 <primary>Lustre</primary>
684 <secondary>at scale</secondary>
685 </indexterm>Lustre cluster at scale</title>
688 <imagedata scalefit="1" width="100%"
689 fileref="./figures/Scaled_Cluster.png" />
692 <phrase>Lustre file system cluster at scale</phrase>
698 <section xml:id="understandinglustre.storageio">
701 <primary>Lustre</primary>
702 <secondary>storage</secondary>
705 <primary>Lustre</primary>
706 <secondary>I/O</secondary>
707 </indexterm>Lustre File System Storage and I/O</title>
708 <para>In Lustre software release 2.0, Lustre file identifiers (FIDs) were
709 introduced to replace UNIX inode numbers for identifying files or objects.
710 A FID is a 128-bit identifier that contains a unique 64-bit sequence
711 number, a 32-bit object ID (OID), and a 32-bit version number. The sequence
712 number is unique across all Lustre targets in a file system (OSTs and
713 MDTs). This change enabled future support for multiple MDTs (introduced in
714 Lustre software release 2.4) and ZFS (introduced in Lustre software release
716 <para>Also introduced in release 2.0 is an ldiskfs feature named
717 <emphasis role="italic">FID-in-dirent</emphasis>(also known as
718 <emphasis role="italic">dirdata</emphasis>) in which the FID is stored as
719 part of the name of the file in the parent directory. This feature
720 significantly improves performance for
721 <literal>ls</literal> command executions by reducing disk I/O. The
722 FID-in-dirent is generated at the time the file is created.</para>
724 <para>The FID-in-dirent feature is not backward compatible with the
725 release 1.8 ldiskfs disk format. Therefore, when an upgrade from
726 release 1.8 to release 2.x is performed, the FID-in-dirent feature is
727 not automatically enabled. For upgrades from release 1.8 to releases
728 2.0 through 2.3, FID-in-dirent can be enabled manually but only takes
729 effect for new files.</para>
730 <para>For more information about upgrading from Lustre software release
731 1.8 and enabling FID-in-dirent for existing files, see
732 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
733 linkend="upgradinglustre" />Chapter 16 “Upgrading a Lustre File
736 <para condition="l24">The LFSCK file system consistency checking tool
737 released with Lustre software release 2.4 provides functionality that
738 enables FID-in-dirent for existing files. It includes the following
742 <para>Generates IGIF mode FIDs for existing files from a 1.8 version
743 file system files.</para>
746 <para>Verifies the FID-in-dirent for each file and regenerates the
747 FID-in-dirent if it is invalid or missing.</para>
750 <para>Verifies the linkEA entry for each and regenerates the linkEA
751 if it is invalid or missing. The
752 <emphasis role="italic">linkEA</emphasis> consists of the file name and
753 parent FID. It is stored as an extended attribute in the file
754 itself. Thus, the linkEA can be used to reconstruct the full path name
757 </itemizedlist></para>
758 <para>Information about where file data is located on the OST(s) is stored
759 as an extended attribute called layout EA in an MDT object identified by
760 the FID for the file (see
761 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
762 linkend="Fig1.3_LayoutEAonMDT" />). If the file is a regular file (not a
763 directory or symbol link), the MDT object points to 1-to-N OST object(s) on
764 the OST(s) that contain the file data. If the MDT file points to one
765 object, all the file data is stored in that object. If the MDT file points
766 to more than one object, the file data is
767 <emphasis role="italic">striped</emphasis> across the objects using RAID 0,
768 and each object is stored on a different OST. (For more information about
769 how striping is implemented in a Lustre file system, see
770 <xref linkend="dbdoclet.50438250_89922" />.</para>
771 <figure xml:id="Fig1.3_LayoutEAonMDT">
772 <title>Layout EA on MDT pointing to file data on OSTs</title>
775 <imagedata scalefit="1" width="80%"
776 fileref="./figures/Metadata_File.png" />
779 <phrase>Layout EA on MDT pointing to file data on OSTs</phrase>
783 <para>When a client wants to read from or write to a file, it first fetches
784 the layout EA from the MDT object for the file. The client then uses this
785 information to perform I/O on the file, directly interacting with the OSS
786 nodes where the objects are stored.
787 <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>
788 This process is illustrated in
789 <xref xmlns:xlink="http://www.w3.org/1999/xlink"
790 linkend="Fig1.4_ClientReqstgData" /><?oxy_custom_end?>
792 <figure xml:id="Fig1.4_ClientReqstgData">
793 <title>Lustre client requesting file data</title>
796 <imagedata scalefit="1" width="75%"
797 fileref="./figures/File_Write.png" />
800 <phrase>Lustre client requesting file data</phrase>
804 <para>The available bandwidth of a Lustre file system is determined as
809 <emphasis>network bandwidth</emphasis> equals the aggregated bandwidth
810 of the OSSs to the targets.</para>
814 <emphasis>disk bandwidth</emphasis> equals the sum of the disk
815 bandwidths of the storage targets (OSTs) up to the limit of the network
820 <emphasis>aggregate bandwidth</emphasis> equals the minimum of the disk
821 bandwidth and the network bandwidth.</para>
825 <emphasis>available file system space</emphasis> equals the sum of the
826 available space of all the OSTs.</para>
829 <section xml:id="dbdoclet.50438250_89922">
832 <primary>Lustre</primary>
833 <secondary>striping</secondary>
836 <primary>striping</primary>
837 <secondary>overview</secondary>
838 </indexterm>Lustre File System and Striping</title>
839 <para>One of the main factors leading to the high performance of Lustre
840 file systems is the ability to stripe data across multiple OSTs in a
841 round-robin fashion. Users can optionally configure for each file the
842 number of stripes, stripe size, and OSTs that are used.</para>
843 <para>Striping can be used to improve performance when the aggregate
844 bandwidth to a single file exceeds the bandwidth of a single OST. The
845 ability to stripe is also useful when a single OST does not have enough
846 free space to hold an entire file. For more information about benefits
847 and drawbacks of file striping, see
848 <xref linkend="dbdoclet.50438209_48033" />.</para>
849 <para>Striping allows segments or 'chunks' of data in a file to be stored
850 on different OSTs, as shown in
851 <xref linkend="understandinglustre.fig.filestripe" />. In the Lustre file
852 system, a RAID 0 pattern is used in which data is "striped" across a
853 certain number of objects. The number of objects in a single file is
855 <literal>stripe_count</literal>.</para>
856 <para>Each object contains a chunk of data from the file. When the chunk
857 of data being written to a particular object exceeds the
858 <literal>stripe_size</literal>, the next chunk of data in the file is
859 stored on the next object.</para>
860 <para>Default values for
861 <literal>stripe_count</literal> and
862 <literal>stripe_size</literal> are set for the file system. The default
864 <literal>stripe_count</literal> is 1 stripe for file and the default value
866 <literal>stripe_size</literal> is 1MB. The user may change these values on
867 a per directory or per file basis. For more details, see
868 <xref linkend="dbdoclet.50438209_78664" />.</para>
870 <xref linkend="understandinglustre.fig.filestripe" />, the
871 <literal>stripe_size</literal> for File C is larger than the
872 <literal>stripe_size</literal> for File A, allowing more data to be stored
873 in a single stripe for File C. The
874 <literal>stripe_count</literal> for File A is 3, resulting in data striped
875 across three objects, while the
876 <literal>stripe_count</literal> for File B and File C is 1.</para>
877 <para>No space is reserved on the OST for unwritten data. File A in
878 <xref linkend="understandinglustre.fig.filestripe" />.</para>
879 <figure xml:id="understandinglustre.fig.filestripe">
880 <title>File striping on a
881 Lustre file system</title>
884 <imagedata scalefit="1" width="100%"
885 fileref="./figures/File_Striping.png" />
888 <phrase>File striping pattern across three OSTs for three different
889 data files. The file is sparse and missing chunk 6.</phrase>
893 <para>The maximum file size is not limited by the size of a single
894 target. In a Lustre file system, files can be striped across multiple
895 objects (up to 2000), and each object can be up to 16 TB in size with
896 ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of
897 31.25 PB for ldiskfs or 8EB with ZFS. Note that a Lustre file system can
898 support files up to 2^63 bytes (8EB), limited only by the space available
901 <para>Versions of the Lustre software prior to Release 2.2 limited the
902 maximum stripe count for a single file to 160 OSTs.</para>
904 <para>Although a single file can only be striped over 2000 objects,
905 Lustre file systems can have thousands of OSTs. The I/O bandwidth to
906 access a single file is the aggregated I/O bandwidth to the objects in a
907 file, which can be as much as a bandwidth of up to 2000 servers. On
908 systems with more than 2000 OSTs, clients can do I/O using multiple files
909 to utilize the full file system bandwidth.</para>
910 <para>For more information about striping, see
911 <xref linkend="managingstripingfreespace" />.</para>