UnderstandingLustre.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
   3   xml:lang="en-US" xml:id="understandinglustre">
   4   <title xml:id="understandinglustre.title">Understanding  Lustre Architecture</title>
   5   <para>This chapter describes the Lustre architecture and features of the Lustre file system. It
   6     includes the following sections:</para>
   7   <itemizedlist>
   8     <listitem>
   9       <para>
  10         <xref linkend="understandinglustre.whatislustre"/>
  11       </para>
  12     </listitem>
  13     <listitem>
  14       <para>
  15         <xref linkend="understandinglustre.components"/>
  16       </para>
  17     </listitem>
  18     <listitem>
  19       <para>
  20         <xref linkend="understandinglustre.storageio"/>
  21       </para>
  22     </listitem>
  23   </itemizedlist>
  24   <section xml:id="understandinglustre.whatislustre">
  25     <title><indexterm>
  26         <primary>Lustre</primary>
  27       </indexterm>What a Lustre File System Is (and What It Isn&apos;t)</title>
  28     <para>The Lustre architecture is a storage architecture for clusters. The central component of
  29       the Lustre architecture is the Lustre file system, which is supported on the Linux operating
  30       system and provides a POSIX<superscript>*</superscript> standard-compliant UNIX file system
  31       interface.</para>
  32     <para>The Lustre storage architecture is used for many different kinds of clusters. It is best
  33       known for powering many of the largest high-performance computing (HPC) clusters worldwide,
  34       with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes
  35       per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide
  36       global file system, serving dozens of clusters.</para>
  37     <para>The ability of a Lustre file system to scale capacity and performance for any need reduces
  38       the need to deploy many separate file systems, such as one for each compute cluster. Storage
  39       management is simplified by avoiding the need to copy data between compute clusters. In
  40       addition to aggregating storage capacity of many servers, the I/O throughput is also
  41       aggregated and scales with additional servers. Moreover, throughput and/or capacity can be
  42       easily increased by adding servers dynamically.</para>
  43     <para>While a Lustre file system can function in many work environments, it is not necessarily
  44       the best choice for all applications. It is best suited for uses that exceed the capacity that
  45       a single server can provide, though in some use cases, a Lustre file system can perform better
  46       with a single server than other file systems due to its strong locking and data
  47       coherency.</para>
  48     <para>A Lustre file system is currently not particularly well suited for
  49       &quot;peer-to-peer&quot; usage models where clients and servers are running on the same node,
  50       each sharing a small amount of storage, due to the lack of data replication at the Lustre
  51       software level. In such uses, if one client/server fails, then the data stored on that node
  52       will not be accessible until the node is restarted.</para>
  53     <section remap="h3">
  54       <title><indexterm>
  55           <primary>Lustre</primary>
  56           <secondary>features</secondary>
  57         </indexterm>Lustre Features</title>
  58       <para>Lustre file systems run on a variety of vendor&apos;s kernels. For more details, see the
  59         Lustre Test Matrix <xref xmlns:xlink="http://www.w3.org/1999/xlink"
  60           linkend="dbdoclet.50438261_99193"/>.</para>
  61       <para>A Lustre installation can be scaled up or down with respect to the number of client
  62         nodes, disk storage and bandwidth. Scalability and performance are dependent on available
  63         disk and network bandwidth and the processing power of the servers in the system. A Lustre
  64         file system can be deployed in a wide variety of configurations that can be scaled well
  65         beyond the size and performance observed in production systems to date.</para>
  66       <para><xref linkend="understandinglustre.tab1"/> shows the practical range of scalability and
  67         performance characteristics of a Lustre file system and some test results in production
  68         systems.</para>
  69       <table frame="all">
  70         <title xml:id="understandinglustre.tab1">Lustre File System Scalability and
  71           Performance</title>
  72         <tgroup cols="3">
  73           <colspec colname="c1" colwidth="1*"/>
  74           <colspec colname="c2" colwidth="2*"/>
  75           <colspec colname="c3" colwidth="3*"/>
  76           <thead>
  77             <row>
  78               <entry>
  79                 <para><emphasis role="bold">Feature</emphasis></para>
  80               </entry>
  81               <entry>
  82                 <para><emphasis role="bold">Current Practical Range</emphasis></para>
  83               </entry>
  84               <entry>
  85                 <para><emphasis role="bold">Tested in Production</emphasis></para>
  86               </entry>
  87             </row>
  88           </thead>
  89           <tbody>
  90             <row>
  91               <entry>
  92                 <para>
  93                   <emphasis role="bold">Client Scalability</emphasis></para>
  94               </entry>
  95               <entry>
  96                 <para> 100-100000</para>
  97               </entry>
  98               <entry>
  99                 <para> 50000+ clients, many in the 10000 to 20000 range</para>
 100               </entry>
 101             </row>
 102             <row>
 103               <entry>
 104                 <para><emphasis role="bold">Client Performance</emphasis></para>
 105               </entry>
 106               <entry>
 107                 <para>
 108                   <emphasis>Single client: </emphasis></para>
 109                 <para>I/O 90% of network bandwidth</para>
 110                 <para><emphasis>Aggregate:</emphasis></para>
 111                 <para>2.5 TB/sec I/O</para>
 112               </entry>
 113               <entry>
 114                 <para>
 115                   <emphasis>Single client: </emphasis></para>
 116                 <para>2 GB/sec I/O, 1000 metadata ops/sec</para>
 117                 <para><emphasis>Aggregate:</emphasis></para>
 118                 <para>240 GB/sec I/O </para>
 119               </entry>
 120             </row>
 121             <row>
 122               <entry>
 123                 <para>
 124                   <emphasis role="bold">OSS Scalability</emphasis></para>
 125               </entry>
 126               <entry>
 127                 <para>
 128                   <emphasis>Single OSS:</emphasis></para>
 129                 <para>1-32 OSTs per OSS,</para>
 130                 <para>128TB per OST</para>
 131                 <para>
 132                   <emphasis>OSS count:</emphasis></para>
 133                 <para>500 OSSs, with up to 4000 OSTs</para>
 134               </entry>
 135               <entry>
 136                 <para>
 137                   <emphasis>Single OSS:</emphasis></para>
 138                 <para>8 OSTs per OSS,</para>
 139                 <para>16TB per OST</para>
 140                 <para>
 141                   <emphasis>OSS count:</emphasis></para>
 142                 <para>450 OSSs with 1000 4TB OSTs</para>
 143                 <para>192 OSSs with 1344 8TB OSTs</para>
 144               </entry>
 145             </row>
 146             <row>
 147               <entry>
 148                 <para>
 149                   <emphasis role="bold">OSS Performance</emphasis></para>
 150               </entry>
 151               <entry>
 152                 <para>
 153                   <emphasis>Single OSS:</emphasis></para>
 154                 <para> 5 GB/sec</para>
 155                 <para>
 156                   <emphasis>Aggregate:</emphasis></para>
 157                 <para> 2.5 TB/sec</para>
 158               </entry>
 159               <entry>
 160                 <para>
 161                   <emphasis>Single OSS:</emphasis></para>
 162                 <para> 2.0+ GB/sec</para>
 163                 <para>
 164                   <emphasis>Aggregate:</emphasis></para>
 165                 <para> 240 GB/sec</para>
 166               </entry>
 167             </row>
 168             <row>
 169               <entry>
 170                 <para>
 171                   <emphasis role="bold">MDS Scalability</emphasis></para>
 172               </entry>
 173               <entry>
 174                 <para>
 175                   <emphasis>Single MDT:</emphasis></para>
 176                 <para> 4 billion files (ldiskfs), 256 trillion files (ZFS)</para>
 177                 <para>
 178                   <emphasis>MDS count:</emphasis></para>
 179                 <para> 1 primary + 1 backup</para>
 180                 <para condition="l24">Up to 4096 MDTs and up to 4096 MDSs</para>
 181               </entry>
 182               <entry>
 183                 <para>
 184                   <emphasis>Single MDT:</emphasis></para>
 185                 <para> 1 billion files</para>
 186                 <para>
 187                   <emphasis>MDS count:</emphasis></para>
 188                 <para> 1 primary + 1 backup</para>
 189               </entry>
 190             </row>
 191             <row>
 192               <entry>
 193                 <para>
 194                   <emphasis role="bold">MDS Performance</emphasis></para>
 195               </entry>
 196               <entry>
 197                 <para> 35000/s create operations,</para>
 198                 <para> 100000/s metadata stat operations</para>
 199               </entry>
 200               <entry>
 201                 <para> 15000/s create operations,</para>
 202                 <para> 35000/s metadata stat operations</para>
 203               </entry>
 204             </row>
 205             <row>
 206               <entry>
 207                 <para>
 208                   <emphasis role="bold">File system Scalability</emphasis></para>
 209               </entry>
 210               <entry>
 211                 <para>
 212                   <emphasis>Single File:</emphasis></para>
 213                 <para>2.5 PB max file size</para>
 214                 <para>
 215                   <emphasis>Aggregate:</emphasis></para>
 216                 <para>512 PB space, 4 billion files</para>
 217               </entry>
 218               <entry>
 219                 <para>
 220                   <emphasis>Single File:</emphasis></para>
 221                 <para>multi-TB max file size</para>
 222                 <para>
 223                   <emphasis>Aggregate:</emphasis></para>
 224                 <para>55 PB space, 1 billion files</para>
 225               </entry>
 226             </row>
 227           </tbody>
 228         </tgroup>
 229       </table>
 230       <para>Other Lustre software features are:</para>
 231       <itemizedlist>
 232         <listitem>
 233           <para><emphasis role="bold">Performance-enhanced ext4 file system:</emphasis> The Lustre
 234             file system uses an improved version of the ext4 journaling file system to store data
 235             and metadata. This version, called <emphasis role="italic">
 236               <literal>ldiskfs</literal></emphasis>, has been enhanced to improve performance and
 237             provide additional functionality needed by the Lustre file system.</para>
 238         </listitem>
 239         <listitem>
 240           <para condition="l24">With the Lustre software release 2.4 and later, it is also possible to use ZFS as the backing filesystem for Lustre for the MDT, OST, and MGS storage.  This allows Lustre to leverage the scalability and data integrity features of ZFS for individual storage targets.</para>
 241         </listitem>
 242         <listitem>
 243           <para><emphasis role="bold">POSIX standard compliance:</emphasis> The full POSIX test
 244             suite passes in an identical manner to a local ext4 file system, with limited exceptions
 245             on Lustre clients. In a cluster, most operations are atomic so that clients never see
 246             stale data or metadata. The Lustre software supports mmap() file I/O.</para>
 247         </listitem>
 248         <listitem>
 249           <para><emphasis role="bold">High-performance heterogeneous networking:</emphasis> The
 250             Lustre software supports a variety of high performance, low latency networks and permits
 251             Remote Direct Memory Access (RDMA) for InfiniBand<superscript>*</superscript> (utilizing
 252             OpenFabrics Enterprise Distribution (OFED<superscript>*</superscript>) and other
 253             advanced networks for fast and efficient network transport. Multiple RDMA networks can
 254             be bridged using Lustre routing for maximum performance. The Lustre software also
 255             includes integrated network diagnostics.</para>
 256         </listitem>
 257         <listitem>
 258           <para><emphasis role="bold">High-availability:</emphasis> The Lustre file system supports
 259             active/active failover using shared storage partitions for OSS targets (OSTs). Lustre
 260             software release 2.3 and earlier releases offer active/passive failover using a shared
 261             storage partition for the MDS target (MDT).  The Lustre file system can work with a variety of high
 262             availability (HA) managers to allow automated failover and has no single point of failure (NSPF).
 263             This allows application transparent recovery.  Multiple mount protection (MMP) provides integrated protection from
 264             errors in highly-available systems that would otherwise cause file system
 265             corruption.</para>
 266         </listitem>
 267         <listitem>
 268           <para condition="l24">With Lustre software release 2.4 or later
 269             servers and clients it is possible to configure active/active
 270             failover of multiple MDTs.  This allows scaling the metadata
 271             performance of Lustre filesystems with the addition of MDT storage
 272             devices and MDS nodes.</para>
 273         </listitem>
 274         <listitem>
 275           <para><emphasis role="bold">Security:</emphasis> By default TCP connections are only
 276             allowed from privileged ports. UNIX group membership is verified on the MDS.</para>
 277         </listitem>
 278         <listitem>
 279           <para><emphasis role="bold">Access control list (ACL), extended attributes:</emphasis> the
 280             Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs.
 281             Noteworthy additional features include root squash.</para>
 282         </listitem>
 283         <listitem>
 284           <para><emphasis role="bold">Interoperability:</emphasis> The Lustre file system runs on a
 285             variety of CPU architectures and mixed-endian clusters and is interoperable between
 286             successive major Lustre software releases.</para>
 287         </listitem>
 288         <listitem>
 289           <para><emphasis role="bold">Object-based architecture:</emphasis> Clients are isolated
 290             from the on-disk file structure enabling upgrading of the storage architecture without
 291             affecting the client.</para>
 292         </listitem>
 293         <listitem>
 294           <para><emphasis role="bold">Byte-granular file and fine-grained metadata
 295               locking:</emphasis> Many clients can read and modify the same file or directory
 296             concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent
 297             between all clients and servers in the file system. The MDT LDLM manages locks on inode
 298             permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored
 299             thereon, which scales the locking performance as the file system grows.</para>
 300         </listitem>
 301         <listitem>
 302           <para><emphasis role="bold">Quotas:</emphasis> User and group quotas are available for a
 303             Lustre file system.</para>
 304         </listitem>
 305         <listitem>
 306           <para><emphasis role="bold">Capacity growth:</emphasis> The size of a Lustre file system
 307             and aggregate cluster bandwidth can be increased without interruption by adding a new
 308             OSS with OSTs to the cluster.</para>
 309         </listitem>
 310         <listitem>
 311           <para><emphasis role="bold">Controlled striping:</emphasis> The layout of files across
 312             OSTs can be configured on a per file, per directory, or per file system basis. This
 313             allows file I/O to be tuned to specific application requirements within a single file
 314             system. The Lustre file system uses RAID-0 striping and balances space usage across
 315             OSTs.</para>
 316         </listitem>
 317         <listitem>
 318           <para><emphasis role="bold">Network data integrity protection:</emphasis> A checksum of
 319             all data sent from the client to the OSS protects against corruption during data
 320             transfer.</para>
 321         </listitem>
 322         <listitem>
 323           <para><emphasis role="bold">MPI I/O:</emphasis> The Lustre architecture has a dedicated
 324             MPI ADIO layer that optimizes parallel I/O to match the underlying file system
 325             architecture.</para>
 326         </listitem>
 327         <listitem>
 328           <para><emphasis role="bold">NFS and CIFS export:</emphasis> Lustre files can be
 329             re-exported using NFS (via Linux knfsd) or CIFS (via Samba) enabling them to be shared
 330             with non-Linux clients, such as Microsoft<superscript>*</superscript>
 331               Windows<superscript>*</superscript> and Apple<superscript>*</superscript> Mac OS
 332               X<superscript>*</superscript>.</para>
 333         </listitem>
 334         <listitem>
 335           <para><emphasis role="bold">Disaster recovery tool:</emphasis> The Lustre file system
 336             provides an online distributed file system check (LFSCK) that can restore consistency between
 337             storage components in case of a major file system error. A Lustre file system can
 338             operate even in the presence of file system inconsistencies, and LFSCK can run while the filesystem is in use, so LFSCK is not required to complete
 339             before returning the file system to production.</para>
 340         </listitem>
 341         <listitem>
 342           <para><emphasis role="bold">Performance monitoring:</emphasis> The Lustre file system
 343             offers a variety of mechanisms to examine performance and tuning.</para>
 344         </listitem>
 345         <listitem>
 346           <para><emphasis role="bold">Open source:</emphasis> The Lustre software is licensed under
 347             the GPL 2.0 license for use with the Linux operating system.</para>
 348         </listitem>
 349       </itemizedlist>
 350     </section>
 351   </section>
 352   <section xml:id="understandinglustre.components">
 353     <title><indexterm>
 354         <primary>Lustre</primary>
 355         <secondary>components</secondary>
 356       </indexterm>Lustre Components</title>
 357     <para>An installation of the Lustre software includes a management server (MGS) and one or more
 358       Lustre file systems interconnected with Lustre networking (LNET).</para>
 359     <para>A basic configuration of Lustre file system components is shown in <xref
 360         linkend="understandinglustre.fig.cluster"/>.</para>
 361     <figure>
 362       <title xml:id="understandinglustre.fig.cluster">Lustre file system components in a basic
 363         cluster </title>
 364       <mediaobject>
 365         <imageobject>
 366           <imagedata scalefit="1" width="100%" fileref="./figures/Basic_Cluster.png"/>
 367         </imageobject>
 368         <textobject>
 369           <phrase> Lustre file system components in a basic cluster </phrase>
 370         </textobject>
 371       </mediaobject>
 372     </figure>
 373     <section remap="h3">
 374       <title><indexterm>
 375           <primary>Lustre</primary>
 376           <secondary>MGS</secondary>
 377         </indexterm>Management Server (MGS)</title>
 378       <para>The MGS stores configuration information for all the Lustre file systems in a cluster
 379         and provides this information to other Lustre components. Each Lustre target contacts the
 380         MGS to provide information, and Lustre clients contact the MGS to retrieve
 381         information.</para>
 382       <para>It is preferable that the MGS have its own storage space so that it can be managed
 383         independently. However, the MGS can be co-located and share storage space with an MDS as
 384         shown in <xref linkend="understandinglustre.fig.cluster"/>.</para>
 385     </section>
 386     <section remap="h3">
 387       <title>Lustre File System Components</title>
 388       <para>Each Lustre file system consists of the following components:</para>
 389       <itemizedlist>
 390         <listitem>
 391           <para><emphasis role="bold">Metadata Server (MDS)</emphasis> - The MDS makes metadata
 392             stored in one or more MDTs available to Lustre clients. Each MDS manages the names and
 393             directories in the Lustre file system(s) and provides network request handling for one
 394             or more local MDTs.</para>
 395         </listitem>
 396         <listitem>
 397           <para><emphasis role="bold">Metadata Target (MDT</emphasis> ) - For Lustre software
 398             release 2.3 and earlier, each file system has one MDT. The MDT stores metadata (such as
 399             filenames, directories, permissions and file layout) on storage attached to an MDS. Each
 400             file system has one MDT. An MDT on a shared storage target can be available to multiple
 401             MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS
 402             can serve the MDT and make it available to clients. This is referred to as MDS
 403             failover.</para>
 404           <para condition="l24">Since Lustre software release 2.4, multiple MDTs are supported. Each
 405             file system has at least one MDT. An MDT on a shared storage target can be available via
 406             multiple MDSs, although only one MDS can export the MDT to the clients at one time. Two
 407             MDS machines share storage for two or more MDTs. After the failure of one MDS, the
 408             remaining MDS begins serving the MDT(s) of the failed MDS.</para>
 409         </listitem>
 410         <listitem>
 411           <para><emphasis role="bold">Object Storage Servers (OSS)</emphasis> : The OSS provides
 412             file I/O service and network request handling for one or more local OSTs. Typically, an
 413             OSS serves between two and eight OSTs, up to 16 TB each. A typical configuration is an
 414             MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a
 415             large number of compute nodes.</para>
 416         </listitem>
 417         <listitem>
 418           <para><emphasis role="bold">Object Storage Target (OST)</emphasis> : User file data is
 419             stored in one or more objects, each object on a separate OST in a Lustre file system.
 420             The number of objects per file is configurable by the user and can be tuned to optimize
 421             performance for a given workload.</para>
 422         </listitem>
 423         <listitem>
 424           <para><emphasis role="bold">Lustre clients</emphasis> : Lustre clients are computational,
 425             visualization or desktop nodes that are running Lustre client software, allowing them to
 426             mount the Lustre file system.</para>
 427         </listitem>
 428       </itemizedlist>
 429       <para>The Lustre client software provides an interface between the Linux virtual file system
 430         and the Lustre servers. The client software includes a management client (MGC), a metadata
 431         client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in
 432         the file system.</para>
 433       <para>A logical object volume (LOV) aggregates the OSCs to provide transparent access across
 434         all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent,
 435         synchronized namespace. Several clients can write to different parts of the same file
 436         simultaneously, while, at the same time, other clients can read from the file.</para>
 437       <para><xref linkend="understandinglustre.tab.storagerequire"/> provides the requirements for
 438         attached storage for each Lustre file system component and describes desirable
 439         characteristics of the hardware used.</para>
 440       <table frame="all">
 441         <title xml:id="understandinglustre.tab.storagerequire"><indexterm>
 442             <primary>Lustre</primary>
 443             <secondary>requirements</secondary>
 444           </indexterm>Storage and hardware requirements for Lustre file system components</title>
 445         <tgroup cols="3">
 446           <colspec colname="c1" colwidth="1*"/>
 447           <colspec colname="c2" colwidth="3*"/>
 448           <colspec colname="c3" colwidth="3*"/>
 449           <thead>
 450             <row>
 451               <entry>
 452                 <para><emphasis role="bold"/></para>
 453               </entry>
 454               <entry>
 455                 <para><emphasis role="bold">Required attached storage</emphasis></para>
 456               </entry>
 457               <entry>
 458                 <para><emphasis role="bold">Desirable hardware characteristics</emphasis></para>
 459               </entry>
 460             </row>
 461           </thead>
 462           <tbody>
 463             <row>
 464               <entry>
 465                 <para>
 466                   <emphasis role="bold">MDSs</emphasis></para>
 467               </entry>
 468               <entry>
 469                 <para> 1-2% of file system capacity</para>
 470               </entry>
 471               <entry>
 472                 <para> Adequate CPU power, plenty of memory, fast disk storage.</para>
 473               </entry>
 474             </row>
 475             <row>
 476               <entry>
 477                 <para>
 478                   <emphasis role="bold">OSSs</emphasis></para>
 479               </entry>
 480               <entry>
 481                 <para> 1-16 TB per OST, 1-8 OSTs per OSS</para>
 482               </entry>
 483               <entry>
 484                 <para> Good bus bandwidth. Recommended that storage be balanced evenly across
 485                   OSSs.</para>
 486               </entry>
 487             </row>
 488             <row>
 489               <entry>
 490                 <para>
 491                   <emphasis role="bold">Clients</emphasis></para>
 492               </entry>
 493               <entry>
 494                 <para> None</para>
 495               </entry>
 496               <entry>
 497                 <para> Low latency, high bandwidth network.</para>
 498               </entry>
 499             </row>
 500           </tbody>
 501         </tgroup>
 502       </table>
 503       <para>For additional hardware requirements and considerations, see <xref
 504           linkend="settinguplustresystem"/>.</para>
 505     </section>
 506     <section remap="h3">
 507       <title><indexterm>
 508           <primary>Lustre</primary>
 509           <secondary>LNET</secondary>
 510         </indexterm>Lustre Networking (LNET)</title>
 511       <para>Lustre Networking (LNET) is a custom networking API that provides the communication
 512         infrastructure that handles metadata and file I/O data for the Lustre file system servers
 513         and clients. For more information about LNET, see <xref
 514           linkend="understandinglustrenetworking"/>.</para>
 515     </section>
 516     <section remap="h3">
 517       <title><indexterm>
 518           <primary>Lustre</primary>
 519           <secondary>cluster</secondary>
 520         </indexterm>Lustre Cluster</title>
 521       <para>At scale, a Lustre file system cluster can include hundreds of OSSs and thousands of
 522         clients (see <xref linkend="understandinglustre.fig.lustrescale"/>). More than one type of
 523         network can be used in a Lustre cluster. Shared storage between OSSs enables failover
 524         capability. For more details about OSS failover, see <xref linkend="understandingfailover"
 525         />.</para>
 526       <figure>
 527         <title xml:id="understandinglustre.fig.lustrescale"><indexterm>
 528             <primary>Lustre</primary>
 529             <secondary>at scale</secondary>
 530           </indexterm>Lustre cluster at scale</title>
 531         <mediaobject>
 532           <imageobject>
 533             <imagedata scalefit="1" width="100%" fileref="./figures/Scaled_Cluster.png"/>
 534           </imageobject>
 535           <textobject>
 536             <phrase> Lustre file system cluster at scale </phrase>
 537           </textobject>
 538         </mediaobject>
 539       </figure>
 540     </section>
 541   </section>
 542   <section xml:id="understandinglustre.storageio">
 543     <title><indexterm>
 544         <primary>Lustre</primary>
 545         <secondary>storage</secondary>
 546       </indexterm>
 547       <indexterm>
 548         <primary>Lustre</primary>
 549         <secondary>I/O</secondary>
 550       </indexterm> Lustre File System Storage and I/O</title>
 551     <para>In Lustre software release 2.0, Lustre file identifiers (FIDs) were introduced to replace
 552       UNIX inode numbers for identifying files or objects. A FID is a 128-bit identifier that
 553       contains a unique 64-bit sequence number, a 32-bit object ID (OID), and a 32-bit version
 554       number. The sequence number is unique across all Lustre targets in a file system (OSTs and
 555       MDTs). This change enabled future support for multiple MDTs (introduced in Lustre software
 556       release 2.3) and ZFS (introduced in Lustre software release 2.4).</para>
 557     <para>Also introduced in release 2.0 is a feature call <emphasis role="italic"
 558         >FID-in-dirent</emphasis> (also known as <emphasis role="italic">dirdata</emphasis>) in
 559       which the FID is stored as part of the name of the file in the parent directory. This feature
 560       significantly improves performance for <literal>ls</literal> command executions by reducing
 561       disk I/O. The FID-in-dirent is generated at the time the file is created.</para>
 562     <note>
 563       <para>The FID-in-dirent feature is not compatible with the Lustre software release 1.8 format.
 564         Therefore, when an upgrade from Lustre software release 1.8 to a Lustre software release 2.x
 565         is performed, the FID-in-dirent feature is not automatically enabled. For upgrades from
 566         Lustre software release 1.8 to Lustre software releases 2.0 through 2.3, FID-in-dirent can
 567         be enabled manually but only takes effect for new files. </para>
 568       <para>For more information about upgrading from Lustre software release 1.8 and enabling
 569         FID-in-dirent for existing files, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 570           linkend="upgradinglustre"/>Chapter 16 “Upgrading a Lustre File System”.</para>
 571     </note>
 572     <para condition="l24">The LFSCK 1.5 file system administration tool released with Lustre
 573       software release 2.4 provides functionality that enables FID-in-dirent for existing files. It
 574       includes the following functionality:<itemizedlist>
 575         <listitem>
 576           <para>Generates IGIF mode FIDs for existing release 1.8 files.</para>
 577         </listitem>
 578         <listitem>
 579           <para>Verifies the FID-in-dirent for each file to determine when it doesn’t exist or is
 580             invalid and then regenerates the FID-in-dirent if needed.</para>
 581         </listitem>
 582         <listitem>
 583           <para>Verifies the linkEA entry for each file to determine when it is missing or invalid
 584             and then regenerates the linkEA if needed. The <emphasis role="italic">linkEA</emphasis>
 585             consists of the file name plus its parent FID and is stored as an extended attribute in
 586             the file itself. Thus, the linkEA can be used to parse out the full path name of a file
 587             from root.</para>
 588         </listitem>
 589       </itemizedlist></para>
 590     <para>Information about where file data is located on the OST(s) is stored as an extended
 591       attribute called layout EA in an MDT object identified by the FID for the file (see <xref
 592         xmlns:xlink="http://www.w3.org/1999/xlink" linkend="Fig1.3_LayoutEAonMDT"/>). If the file is
 593       a data file (not a directory or symbol link), the MDT object points to 1-to-N OST object(s) on
 594       the OST(s) that contain the file data. If the MDT file points to one object, all the file data
 595       is stored in that object. If the MDT file points to more than one object, the file data is
 596         <emphasis role="italic">striped</emphasis> across the objects using RAID 0, and each object
 597       is stored on a different OST. (For more information about how striping is implemented in a
 598       Lustre file system, see <xref linkend="dbdoclet.50438250_89922"/>.</para>
 599     <figure xml:id="Fig1.3_LayoutEAonMDT">
 600       <title>Layout EA on MDT pointing to file data on OSTs</title>
 601       <mediaobject>
 602         <imageobject>
 603           <imagedata scalefit="1" width="80%" fileref="./figures/Metadata_File.png"/>
 604         </imageobject>
 605         <textobject>
 606           <phrase> Layout EA on MDT pointing to file data on OSTs </phrase>
 607         </textobject>
 608       </mediaobject>
 609     </figure>
 610     <para>When a client wants to read from or write to a file, it first fetches the layout EA from
 611       the MDT object for the file. The client then uses this information to perform I/O on the file,
 612       directly interacting with the OSS nodes where the objects are stored.
 613       <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>This process is illustrated
 614       in <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="Fig1.4_ClientReqstgData"
 615       /><?oxy_custom_end?>.</para>
 616     <figure xml:id="Fig1.4_ClientReqstgData">
 617       <title>Lustre client requesting file data</title>
 618       <mediaobject>
 619         <imageobject>
 620           <imagedata scalefit="1" width="75%" fileref="./figures/File_Write.png"/>
 621         </imageobject>
 622         <textobject>
 623           <phrase> Lustre client requesting file data </phrase>
 624         </textobject>
 625       </mediaobject>
 626     </figure>
 627     <para>The available bandwidth of a Lustre file system is determined as follows:</para>
 628     <itemizedlist>
 629       <listitem>
 630         <para>The <emphasis>network bandwidth</emphasis> equals the aggregated bandwidth of the OSSs
 631           to the targets.</para>
 632       </listitem>
 633       <listitem>
 634         <para>The <emphasis>disk bandwidth</emphasis> equals the sum of the disk bandwidths of the
 635           storage targets (OSTs) up to the limit of the network bandwidth.</para>
 636       </listitem>
 637       <listitem>
 638         <para>The <emphasis>aggregate bandwidth</emphasis> equals the minimum of the disk bandwidth
 639           and the network bandwidth.</para>
 640       </listitem>
 641       <listitem>
 642         <para>The <emphasis>available file system space</emphasis> equals the sum of the available
 643           space of all the OSTs.</para>
 644       </listitem>
 645     </itemizedlist>
 646     <section xml:id="dbdoclet.50438250_89922">
 647       <title>
 648         <indexterm>
 649           <primary>Lustre</primary>
 650           <secondary>striping</secondary>
 651         </indexterm>
 652         <indexterm>
 653           <primary>striping</primary>
 654           <secondary>overview</secondary>
 655         </indexterm> Lustre File System and Striping</title>
 656       <para>One of the main factors leading to the high performance of Lustre file systems is the
 657         ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally
 658         configure for each file the number of stripes, stripe size, and OSTs that are used.</para>
 659       <para>Striping can be used to improve performance when the aggregate bandwidth to a single
 660         file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a
 661         single OST does not have enough free space to hold an entire file. For more information
 662         about benefits and drawbacks of file striping, see <xref linkend="dbdoclet.50438209_48033"
 663         />.</para>
 664       <para>Striping allows segments or &apos;chunks&apos; of data in a file to be stored on
 665         different OSTs, as shown in <xref linkend="understandinglustre.fig.filestripe"/>. In the
 666         Lustre file system, a RAID 0 pattern is used in which data is &quot;striped&quot; across a
 667         certain number of objects. The number of objects in a single file is called the
 668           <literal>stripe_count</literal>.</para>
 669       <para>Each object contains a chunk of data from the file. When the chunk of data being written
 670         to a particular object exceeds the <literal>stripe_size</literal>, the next chunk of data in
 671         the file is stored on the next object.</para>
 672       <para>Default values for <literal>stripe_count</literal> and <literal>stripe_size</literal>
 673         are set for the file system. The default value for <literal>stripe_count</literal> is 1
 674         stripe for file and the default value for <literal>stripe_size</literal> is 1MB. The user
 675         may change these values on a per directory or per file basis. For more details, see <xref
 676           linkend="dbdoclet.50438209_78664"/>.</para>
 677       <para><xref linkend="understandinglustre.fig.filestripe"/>, the <literal>stripe_size</literal>
 678         for File C is larger than the <literal>stripe_size</literal> for File A, allowing more data
 679         to be stored in a single stripe for File C. The <literal>stripe_count</literal> for File A
 680         is 3, resulting in data striped across three objects, while the
 681           <literal>stripe_count</literal> for File B and File C is 1.</para>
 682       <para>No space is reserved on the OST for unwritten data. File A in <xref
 683           linkend="understandinglustre.fig.filestripe"/>.</para>
 684       <figure>
 685         <title xml:id="understandinglustre.fig.filestripe">File striping on a Lustre file
 686           system</title>
 687         <mediaobject>
 688           <imageobject>
 689             <imagedata scalefit="1" width="100%" fileref="./figures/File_Striping.png"/>
 690           </imageobject>
 691           <textobject>
 692             <phrase>File striping pattern across three OSTs for three different data files. The file
 693               is sparse and missing chunk 6. </phrase>
 694           </textobject>
 695         </mediaobject>
 696       </figure>
 697       <para>The maximum file size is not limited by the size of a single target. In a Lustre file
 698         system, files can be striped across multiple objects (up to 2000), and each object can be
 699         up to 16 TB in size with ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of 31.25 PB for ldiskfs or 8EB with ZFS. Note that
 700         a Lustre file system can support files up to 2^63 bytes (8EB), limited
 701         only by the space available on the OSTs.</para>
 702       <note>
 703         <para>Versions of the Lustre software prior to Release 2.2 limited the  maximum stripe count
 704           for a single file to 160 OSTs.</para>
 705       </note>
 706       <para>Although a single file can only be striped over 2000 objects, Lustre file systems can
 707         have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O
 708         bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000
 709         servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to
 710         utilize the full file system bandwidth.</para>
 711       <para>For more information about striping, see <xref linkend="managingstripingfreespace"
 712         />.</para>
 713     </section>
 714   </section>
 715 </chapter>