UnderstandingLustre.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="understandinglustre">
   5   <title xml:id="understandinglustre.title">Understanding Lustre
   6   Architecture</title>
   7   <para>This chapter describes the Lustre architecture and features of the
   8   Lustre file system. It includes the following sections:</para>
   9   <itemizedlist>
  10     <listitem>
  11       <para>
  12         <xref linkend="understandinglustre.whatislustre" />
  13       </para>
  14     </listitem>
  15     <listitem>
  16       <para>
  17         <xref linkend="understandinglustre.components" />
  18       </para>
  19     </listitem>
  20     <listitem>
  21       <para>
  22         <xref linkend="understandinglustre.storageio" />
  23       </para>
  24     </listitem>
  25   </itemizedlist>
  26   <section xml:id="understandinglustre.whatislustre">
  27     <title>
  28     <indexterm>
  29       <primary>Lustre</primary>
  30     </indexterm>What a Lustre File System Is (and What It Isn't)</title>
  31     <para>The Lustre architecture is a storage architecture for clusters. The
  32     central component of the Lustre architecture is the Lustre file system,
  33     which is supported on the Linux operating system and provides a POSIX
  34     <superscript>*</superscript>standard-compliant UNIX file system
  35     interface.</para>
  36     <para>The Lustre storage architecture is used for many different kinds of
  37     clusters. It is best known for powering many of the largest
  38     high-performance computing (HPC) clusters worldwide, with tens of thousands
  39     of client systems, petabytes (PiB) of storage and hundreds of gigabytes per
  40     second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system
  41     as a site-wide global file system, serving dozens of clusters.</para>
  42     <para>The ability of a Lustre file system to scale capacity and performance
  43     for any need reduces the need to deploy many separate file systems, such as
  44     one for each compute cluster. Storage management is simplified by avoiding
  45     the need to copy data between compute clusters. In addition to aggregating
  46     storage capacity of many servers, the I/O throughput is also aggregated and
  47     scales with additional servers. Moreover, throughput and/or capacity can be
  48     easily increased by adding servers dynamically.</para>
  49     <para>While a Lustre file system can function in many work environments, it
  50     is not necessarily the best choice for all applications. It is best suited
  51     for uses that exceed the capacity that a single server can provide, though
  52     in some use cases, a Lustre file system can perform better with a single
  53     server than other file systems due to its strong locking and data
  54     coherency.</para>
  55     <para>A Lustre file system is currently not particularly well suited for
  56     "peer-to-peer" usage models where clients and servers are running on the
  57     same node, each sharing a small amount of storage, due to the lack of data
  58     replication at the Lustre software level. In such uses, if one
  59     client/server fails, then the data stored on that node will not be
  60     accessible until the node is restarted.</para>
  61     <section remap="h3">
  62       <title>
  63       <indexterm>
  64         <primary>Lustre</primary>
  65         <secondary>features</secondary>
  66       </indexterm>Lustre Features</title>
  67       <para>Lustre file systems run on a variety of vendor's kernels. For more
  68       details, see the Lustre Test Matrix
  69       <xref xmlns:xlink="http://www.w3.org/1999/xlink"
  70        linkend="preparing_installation" />.</para>
  71       <para>A Lustre installation can be scaled up or down with respect to the
  72       number of client nodes, disk storage and bandwidth. Scalability and
  73       performance are dependent on available disk and network bandwidth and the
  74       processing power of the servers in the system. A Lustre file system can
  75       be deployed in a wide variety of configurations that can be scaled well
  76       beyond the size and performance observed in production systems to
  77       date.</para>
  78       <para>
  79       <xref linkend="understandinglustre.tab1" /> shows some of the
  80       scalability and performance characteristics of a Lustre file system.
  81       For a full list of Lustre file and filesystem limits see
  82       <xref linkend="settinguplustresystem.tab2"/>.</para>
  83       <table frame="all" xml:id="understandinglustre.tab1">
  84         <title>Lustre File System Scalability and Performance</title>
  85         <tgroup cols="3">
  86           <colspec colname="c1" colwidth="1*" />
  87           <colspec colname="c2" colwidth="2*" />
  88           <colspec colname="c3" colwidth="3*" />
  89           <thead>
  90             <row>
  91               <entry>
  92                 <para>
  93                   <emphasis role="bold">Feature</emphasis>
  94                 </para>
  95               </entry>
  96               <entry>
  97                 <para>
  98                   <emphasis role="bold">Current Practical Range</emphasis>
  99                 </para>
 100               </entry>
 101               <entry>
 102                 <para>
 103                   <emphasis role="bold">Known Production Usage</emphasis>
 104                 </para>
 105               </entry>
 106             </row>
 107           </thead>
 108           <tbody>
 109             <row>
 110               <entry>
 111                 <para>
 112                   <emphasis role="bold">Client Scalability</emphasis>
 113                 </para>
 114               </entry>
 115               <entry>
 116                 <para>100-100000</para>
 117               </entry>
 118               <entry>
 119                 <para>50000+ clients, many in the 10000 to 20000 range</para>
 120               </entry>
 121             </row>
 122             <row>
 123               <entry>
 124                 <para>
 125                   <emphasis role="bold">Client Performance</emphasis>
 126                 </para>
 127               </entry>
 128               <entry>
 129                 <para>
 130                   <emphasis>Single client:</emphasis>
 131                 </para>
 132                 <para>I/O 90% of network bandwidth</para>
 133                 <para>
 134                   <emphasis>Aggregate:</emphasis>
 135                 </para>
 136                 <para>50 TB/sec I/O, 50M IOPS</para>
 137               </entry>
 138               <entry>
 139                 <para>
 140                   <emphasis>Single client:</emphasis>
 141                 </para>
 142                 <para>15 GB/sec I/O (HDR IB), 50000 IOPS</para>
 143                 <para>
 144                   <emphasis>Aggregate:</emphasis>
 145                 </para>
 146                 <para>10 TB/sec I/O, 10M IOPS</para>
 147               </entry>
 148             </row>
 149             <row>
 150               <entry>
 151                 <para>
 152                   <emphasis role="bold">OSS Scalability</emphasis>
 153                 </para>
 154               </entry>
 155               <entry>
 156                 <para>
 157                   <emphasis>Single OSS:</emphasis>
 158                 </para>
 159                 <para>1-32 OSTs per OSS</para>
 160                 <para>
 161                   <emphasis>Single OST:</emphasis>
 162                 </para>
 163                 <para>500M objects, 1024TiB per OST</para>
 164                 <para>
 165                   <emphasis>OSS count:</emphasis>
 166                 </para>
 167                 <para>1000 OSSs, 4000 OSTs</para>
 168               </entry>
 169               <entry>
 170                 <para>
 171                   <emphasis>Single OSS:</emphasis>
 172                 </para>
 173                 <para>4 OSTs per OSS</para>
 174                 <para>
 175                   <emphasis>Single OST:</emphasis>
 176                 </para>
 177                 <para>1024TiB OSTs</para>
 178                 <para>
 179                   <emphasis>OSS count:</emphasis>
 180                 </para>
 181                 <para>450 OSSs with 900 750TiB HDD OSTs + 450 25TiB NVMe OSTs</para>
 182                 <para>1024 OSSs with 1024 72TiB OSTs</para>
 183               </entry>
 184             </row>
 185             <row>
 186               <entry>
 187                 <para>
 188                   <emphasis role="bold">OSS Performance</emphasis>
 189                 </para>
 190               </entry>
 191               <entry>
 192                 <para>
 193                   <emphasis>Single OSS:</emphasis>
 194                 </para>
 195                 <para>15 GB/sec, 1.5M IOPS</para>
 196                 <para>
 197                   <emphasis>Aggregate:</emphasis>
 198                 </para>
 199                 <para>50 TB/sec, 50M IOPS</para>
 200               </entry>
 201               <entry>
 202                 <para>
 203                   <emphasis>Single OSS:</emphasis>
 204                 </para>
 205                 <para>10 GB/sec, 1.5M IOPS</para>
 206                 <para>
 207                   <emphasis>Aggregate:</emphasis>
 208                 </para>
 209                 <para>20 TB/sec, 20M IOPS</para>
 210               </entry>
 211             </row>
 212             <row>
 213               <entry>
 214                 <para>
 215                   <emphasis role="bold">MDS Scalability</emphasis>
 216                 </para>
 217               </entry>
 218               <entry>
 219                 <para>
 220                   <emphasis>Single MDS:</emphasis>
 221                 </para>
 222                 <para>1-4 MDTs per MDS</para>
 223                 <para>
 224                   <emphasis>Single MDT:</emphasis>
 225                 </para>
 226                 <para>4 billion files, 16TiB per MDT (ldiskfs)</para>
 227                 <para>64 billion files, 64TiB per MDT (ZFS)</para>
 228                 <para>
 229                   <emphasis>MDS count:</emphasis>
 230                 </para>
 231                 <para>256 MDSs, up to 256 MDTs</para>
 232               </entry>
 233               <entry>
 234                 <para>
 235                   <emphasis>Single MDS:</emphasis>
 236                 </para>
 237                 <para>4 billion files</para>
 238                 <para>
 239                   <emphasis>MDS count:</emphasis>
 240                 </para>
 241                 <para>40 MDS with 40 4TiB MDTs in production</para>
 242                 <para>256 MDS with 256 64GiB MDTs in testing</para>
 243               </entry>
 244             </row>
 245             <row>
 246               <entry>
 247                 <para>
 248                   <emphasis role="bold">MDS Performance</emphasis>
 249                 </para>
 250               </entry>
 251               <entry>
 252                 <para>1M/s create operations</para>
 253                 <para>2M/s stat operations</para>
 254               </entry>
 255               <entry>
 256                 <para>100k/s create operations,</para>
 257                 <para>200k/s metadata stat operations</para>
 258               </entry>
 259             </row>
 260             <row>
 261               <entry>
 262                 <para>
 263                   <emphasis role="bold">File system Scalability</emphasis>
 264                 </para>
 265               </entry>
 266               <entry>
 267                 <para>
 268                   <emphasis>Single File:</emphasis>
 269                 </para>
 270                 <para>32 PiB max file size (ldiskfs)</para>
 271                 <para>2^63 bytes (ZFS)</para>
 272                 <para>
 273                   <emphasis>Aggregate:</emphasis>
 274                 </para>
 275                 <para>512 PiB space, 1 trillion files</para>
 276               </entry>
 277               <entry>
 278                 <para>
 279                   <emphasis>Single File:</emphasis>
 280                 </para>
 281                 <para>multi-TiB max file size</para>
 282                 <para>
 283                   <emphasis>Aggregate:</emphasis>
 284                 </para>
 285                 <para>700 PiB space, 25 billion files</para>
 286               </entry>
 287             </row>
 288           </tbody>
 289         </tgroup>
 290       </table>
 291       <para>Other Lustre software features are:</para>
 292       <itemizedlist>
 293         <listitem>
 294           <para>
 295           <emphasis role="bold">Performance-enhanced ext4 file
 296           system:</emphasis>The Lustre file system uses an improved version of
 297           the ext4 journaling file system to store data and metadata. This
 298           version, called
 299           <emphasis role="italic">
 300             <literal>ldiskfs</literal>
 301           </emphasis>, has been enhanced to improve performance and provide
 302           additional functionality needed by the Lustre file system.</para>
 303         </listitem>
 304         <listitem>
 305           <para>It is also possible to use ZFS as the backing filesystem for
 306           Lustre for the MDT, OST, and MGS storage. This allows Lustre to
 307           leverage the scalability and data integrity features of ZFS for
 308           individual storage targets.</para>
 309         </listitem>
 310         <listitem>
 311           <para>
 312           <emphasis role="bold">POSIX standard compliance:</emphasis>The full
 313           POSIX test suite passes in an identical manner to a local ext4 file
 314           system, with limited exceptions on Lustre clients. In a cluster, most
 315           operations are atomic so that clients never see stale data or
 316           metadata. The Lustre software supports mmap() file I/O.</para>
 317         </listitem>
 318         <listitem>
 319           <para>
 320           <emphasis role="bold">High-performance heterogeneous
 321           networking:</emphasis>The Lustre software supports a variety of high
 322           performance, low latency networks and permits Remote Direct Memory
 323           Access (RDMA) for InfiniBand
 324           <superscript>*</superscript>(utilizing OpenFabrics Enterprise
 325           Distribution (OFED<superscript>*</superscript>), Intel OmniPath®,
 326           and other advanced networks for fast
 327           and efficient network transport. Multiple RDMA networks can be
 328           bridged using Lustre routing for maximum performance. The Lustre
 329           software also includes integrated network diagnostics.</para>
 330         </listitem>
 331         <listitem>
 332           <para>
 333           <emphasis role="bold">High-availability:</emphasis>The Lustre file
 334           system supports active/active failover using shared storage
 335           partitions for OSS targets (OSTs), and for MDS targets (MDTs).
 336           The Lustre file system can work
 337           with a variety of high availability (HA) managers to allow automated
 338           failover and has no single point of failure (NSPF). This allows
 339           application transparent recovery. Multiple mount protection (MMP)
 340           provides integrated protection from errors in highly-available
 341           systems that would otherwise cause file system corruption.</para>
 342         </listitem>
 343         <listitem>
 344           <para>
 345           <emphasis role="bold">Security:</emphasis>By default TCP connections
 346           are only allowed from privileged ports. UNIX group membership is
 347           verified on the MDS.</para>
 348         </listitem>
 349         <listitem>
 350           <para>
 351           <emphasis role="bold">Access control list (ACL), extended
 352           attributes:</emphasis>the Lustre security model follows that of a
 353           UNIX file system, enhanced with POSIX ACLs. Noteworthy additional
 354           features include root squash.</para>
 355         </listitem>
 356         <listitem>
 357           <para>
 358           <emphasis role="bold">Interoperability:</emphasis>The Lustre file
 359           system runs on a variety of CPU architectures and mixed-endian
 360           clusters and is interoperable between successive major Lustre
 361           software releases.</para>
 362         </listitem>
 363         <listitem>
 364           <para>
 365           <emphasis role="bold">Object-based architecture:</emphasis>Clients
 366           are isolated from the on-disk file structure enabling upgrading of
 367           the storage architecture without affecting the client.</para>
 368         </listitem>
 369         <listitem>
 370           <para>
 371           <emphasis role="bold">Byte-granular file and fine-grained metadata
 372           locking:</emphasis>Many clients can read and modify the same file or
 373           directory concurrently. The Lustre distributed lock manager (LDLM)
 374           ensures that files are coherent between all clients and servers in
 375           the file system. The MDT LDLM manages locks on inode permissions and
 376           pathnames. Each OST has its own LDLM for locks on file stripes stored
 377           thereon, which scales the locking performance as the file system
 378           grows.</para>
 379         </listitem>
 380         <listitem>
 381           <para>
 382           <emphasis role="bold">Quotas:</emphasis>User and group quotas are
 383           available for a Lustre file system.</para>
 384         </listitem>
 385         <listitem>
 386           <para>
 387           <emphasis role="bold">Capacity growth:</emphasis>The size of a Lustre
 388           file system and aggregate cluster bandwidth can be increased without
 389           interruption by adding new OSTs and MDTs to the cluster.</para>
 390         </listitem>
 391         <listitem>
 392           <para>
 393           <emphasis role="bold">Controlled file layout:</emphasis>The layout of
 394           files across OSTs can be configured on a per file, per directory, or
 395           per file system basis. This allows file I/O to be tuned to specific
 396           application requirements within a single file system. The Lustre file
 397           system uses RAID-0 striping and balances space usage across
 398           OSTs.</para>
 399         </listitem>
 400         <listitem>
 401           <para>
 402           <emphasis role="bold">Network data integrity protection:</emphasis>A
 403           checksum of all data sent from the client to the OSS protects against
 404           corruption during data transfer.</para>
 405         </listitem>
 406         <listitem>
 407           <para>
 408           <emphasis role="bold">MPI I/O:</emphasis>The Lustre architecture has
 409           a dedicated MPI ADIO layer that optimizes parallel I/O to match the
 410           underlying file system architecture.</para>
 411         </listitem>
 412         <listitem>
 413           <para>
 414           <emphasis role="bold">NFS and CIFS export:</emphasis>Lustre files can
 415           be re-exported using NFS (via Linux knfsd or Ganesha) or CIFS (via
 416           Samba), enabling them to be shared with non-Linux clients such as
 417           Microsoft<superscript>*</superscript>Windows,
 418           <superscript>*</superscript>Apple
 419           <superscript>*</superscript>Mac OS X
 420           <superscript>*</superscript>, and others.</para>
 421         </listitem>
 422         <listitem>
 423           <para>
 424           <emphasis role="bold">Disaster recovery tool:</emphasis>The Lustre
 425           file system provides an online distributed file system check (LFSCK)
 426           that can restore consistency between storage components in case of a
 427           major file system error. A Lustre file system can operate even in the
 428           presence of file system inconsistencies, and LFSCK can run while the
 429           filesystem is in use, so LFSCK is not required to complete before
 430           returning the file system to production.</para>
 431         </listitem>
 432         <listitem>
 433           <para>
 434           <emphasis role="bold">Performance monitoring:</emphasis>The Lustre
 435           file system offers a variety of mechanisms to examine performance and
 436           tuning.</para>
 437         </listitem>
 438         <listitem>
 439           <para>
 440           <emphasis role="bold">Open source:</emphasis>The Lustre software is
 441           licensed under the GPL 2.0 license for use with the Linux operating
 442           system.</para>
 443         </listitem>
 444       </itemizedlist>
 445     </section>
 446   </section>
 447   <section xml:id="understandinglustre.components">
 448     <title>
 449     <indexterm>
 450       <primary>Lustre</primary>
 451       <secondary>components</secondary>
 452     </indexterm>Lustre Components</title>
 453     <para>An installation of the Lustre software includes a management server
 454     (MGS) and one or more Lustre file systems interconnected with Lustre
 455     networking (LNet).</para>
 456     <para>A basic configuration of Lustre file system components is shown in
 457     <xref linkend="understandinglustre.fig.cluster" />.</para>
 458     <figure xml:id="understandinglustre.fig.cluster">
 459       <title>Lustre file system components in a basic cluster</title>
 460       <mediaobject>
 461         <imageobject>
 462           <imagedata scalefit="1" width="100%"
 463           fileref="./figures/Basic_Cluster.png" />
 464         </imageobject>
 465         <textobject>
 466           <phrase>Lustre file system components in a basic cluster</phrase>
 467         </textobject>
 468       </mediaobject>
 469     </figure>
 470     <section remap="h3">
 471       <title>
 472       <indexterm>
 473         <primary>Lustre</primary>
 474         <secondary>MGS</secondary>
 475       </indexterm>Management Server (MGS)</title>
 476       <para>The MGS stores configuration information for all the Lustre file
 477       systems in a cluster and provides this information to other Lustre
 478       components. Each Lustre target contacts the MGS to provide information,
 479       and Lustre clients contact the MGS to retrieve information.</para>
 480       <para>It is preferable that the MGS have its own storage space so that it
 481       can be managed independently. However, the MGS can be co-located and
 482       share storage space with an MDS as shown in
 483       <xref linkend="understandinglustre.fig.cluster" />.</para>
 484     </section>
 485     <section remap="h3">
 486       <title>Lustre File System Components</title>
 487       <para>Each Lustre file system consists of the following
 488       components:</para>
 489       <itemizedlist>
 490         <listitem>
 491           <para>
 492           <emphasis role="bold">Metadata Servers (MDS)</emphasis>- The MDS makes
 493           metadata stored in one or more MDTs available to Lustre clients. Each
 494           MDS manages the names and directories in the Lustre file system(s)
 495           and provides network request handling for one or more local
 496           MDTs.</para>
 497         </listitem>
 498         <listitem>
 499           <para>
 500           <emphasis role="bold">Metadata Targets (MDT</emphasis>) - Each
 501           filesystem has at least one MDT, which holds the root directory. The
 502           MDT stores metadata (such as filenames, directories, permissions and
 503           file layout) on storage attached to an MDS. Each file system has one
 504           MDT. An MDT on a shared storage target can be available to multiple
 505           MDSs, although only one can access it at a time. If an active MDS
 506           fails, a second MDS node can serve the MDT and make it available to
 507           clients. This is referred to as MDS failover.</para>
 508           <para>Multiple MDTs are supported with the Distributed Namespace
 509           Environment (<xref linkend="DNE"/>).
 510           In addition to the primary MDT that holds the filesystem root, it
 511           is possible to add additional MDS nodes, each with their own MDTs,
 512           to hold sub-directory trees of the filesystem.</para>
 513           <para condition="l28">Since Lustre software release 2.8, DNE also
 514           allows the filesystem to distribute files of a single directory over
 515           multiple MDT nodes. A directory which is distributed across multiple
 516           MDTs is known as a <emphasis><xref linkend="stripeddirectory"/></emphasis>.</para>
 517         </listitem>
 518         <listitem>
 519           <para>
 520           <emphasis role="bold">Object Storage Servers (OSS)</emphasis>: The
 521           OSS provides file I/O service and network request handling for one or
 522           more local OSTs. Typically, an OSS serves between two and eight OSTs,
 523           up to 16 TiB each. A typical configuration is an MDT on a dedicated
 524           node, two or more OSTs on each OSS node, and a client on each of a
 525           large number of compute nodes.</para>
 526         </listitem>
 527         <listitem>
 528           <para>
 529           <emphasis role="bold">Object Storage Target (OST)</emphasis>: User
 530           file data is stored in one or more objects, each object on a separate
 531           OST in a Lustre file system. The number of objects per file is
 532           configurable by the user and can be tuned to optimize performance for
 533           a given workload.</para>
 534         </listitem>
 535         <listitem>
 536           <para>
 537           <emphasis role="bold">Lustre clients</emphasis>: Lustre clients are
 538           computational, visualization or desktop nodes that are running Lustre
 539           client software, allowing them to mount the Lustre file
 540           system.</para>
 541         </listitem>
 542       </itemizedlist>
 543       <para>The Lustre client software provides an interface between the Linux
 544       virtual file system and the Lustre servers. The client software includes
 545       a management client (MGC), a metadata client (MDC), and multiple object
 546       storage clients (OSCs), one corresponding to each OST in the file
 547       system.</para>
 548       <para>A logical object volume (LOV) aggregates the OSCs to provide
 549       transparent access across all the OSTs. Thus, a client with the Lustre
 550       file system mounted sees a single, coherent, synchronized namespace.
 551       Several clients can write to different parts of the same file
 552       simultaneously, while, at the same time, other clients can read from the
 553       file.</para>
 554       <para>A logical metadata volume (LMV) aggregates the MDCs to provide
 555       transparent access across all the MDTs in a similar manner as the LOV
 556       does for file access.  This allows the client to see the directory tree
 557       on multiple MDTs as a single coherent namespace, and striped directories
 558       are merged on the clients to form a single visible directory to users
 559       and applications.
 560       </para>
 561       <para>
 562       <xref linkend="understandinglustre.tab.storagerequire" />provides the
 563       requirements for attached storage for each Lustre file system component
 564       and describes desirable characteristics of the hardware used.</para>
 565       <table frame="all" xml:id="understandinglustre.tab.storagerequire">
 566         <title>
 567         <indexterm>
 568           <primary>Lustre</primary>
 569           <secondary>requirements</secondary>
 570         </indexterm>Storage and hardware requirements for Lustre file system
 571         components</title>
 572         <tgroup cols="3">
 573           <colspec colname="c1" colwidth="1*" />
 574           <colspec colname="c2" colwidth="3*" />
 575           <colspec colname="c3" colwidth="3*" />
 576           <thead>
 577             <row>
 578               <entry>
 579                 <para>
 580                   <emphasis role="bold" />
 581                 </para>
 582               </entry>
 583               <entry>
 584                 <para>
 585                   <emphasis role="bold">Required attached storage</emphasis>
 586                 </para>
 587               </entry>
 588               <entry>
 589                 <para>
 590                   <emphasis role="bold">Desirable hardware
 591                   characteristics</emphasis>
 592                 </para>
 593               </entry>
 594             </row>
 595           </thead>
 596           <tbody>
 597             <row>
 598               <entry>
 599                 <para>
 600                   <emphasis role="bold">MDSs</emphasis>
 601                 </para>
 602               </entry>
 603               <entry>
 604                 <para>1-2% of file system capacity</para>
 605               </entry>
 606               <entry>
 607                 <para>Adequate CPU power, plenty of memory, fast disk
 608                 storage.</para>
 609               </entry>
 610             </row>
 611             <row>
 612               <entry>
 613                 <para>
 614                   <emphasis role="bold">OSSs</emphasis>
 615                 </para>
 616               </entry>
 617               <entry>
 618                 <para>1-128 TiB per OST, 1-8 OSTs per OSS</para>
 619               </entry>
 620               <entry>
 621                 <para>Good bus bandwidth. Recommended that storage be balanced
 622                 evenly across OSSs and matched to network bandwidth.</para>
 623               </entry>
 624             </row>
 625             <row>
 626               <entry>
 627                 <para>
 628                   <emphasis role="bold">Clients</emphasis>
 629                 </para>
 630               </entry>
 631               <entry>
 632                 <para>No local storage needed</para>
 633               </entry>
 634               <entry>
 635                 <para>Low latency, high bandwidth network.</para>
 636               </entry>
 637             </row>
 638           </tbody>
 639         </tgroup>
 640       </table>
 641       <para>For additional hardware requirements and considerations, see
 642       <xref linkend="settinguplustresystem" />.</para>
 643     </section>
 644     <section remap="h3">
 645       <title>
 646       <indexterm>
 647         <primary>Lustre</primary>
 648         <secondary>LNet</secondary>
 649       </indexterm>Lustre Networking (LNet)</title>
 650       <para>Lustre Networking (LNet) is a custom networking API that provides
 651       the communication infrastructure that handles metadata and file I/O data
 652       for the Lustre file system servers and clients. For more information
 653       about LNet, see
 654       <xref linkend="understandinglustrenetworking" />.</para>
 655     </section>
 656     <section remap="h3">
 657       <title>
 658       <indexterm>
 659         <primary>Lustre</primary>
 660         <secondary>cluster</secondary>
 661       </indexterm>Lustre Cluster</title>
 662       <para>At scale, a Lustre file system cluster can include hundreds of OSSs
 663       and thousands of clients (see
 664       <xref linkend="understandinglustre.fig.lustrescale" />). More than one
 665       type of network can be used in a Lustre cluster. Shared storage between
 666       OSSs enables failover capability. For more details about OSS failover,
 667       see
 668       <xref linkend="understandingfailover" />.</para>
 669       <figure xml:id="understandinglustre.fig.lustrescale">
 670         <title>
 671         <indexterm>
 672           <primary>Lustre</primary>
 673           <secondary>at scale</secondary>
 674         </indexterm>Lustre cluster at scale</title>
 675         <mediaobject>
 676           <imageobject>
 677             <imagedata scalefit="1" width="100%"
 678             fileref="./figures/Scaled_Cluster.png" />
 679           </imageobject>
 680           <textobject>
 681             <phrase>Lustre file system cluster at scale</phrase>
 682           </textobject>
 683         </mediaobject>
 684       </figure>
 685     </section>
 686   </section>
 687   <section xml:id="understandinglustre.storageio">
 688     <title>
 689     <indexterm>
 690       <primary>Lustre</primary>
 691       <secondary>storage</secondary>
 692     </indexterm>
 693     <indexterm>
 694       <primary>Lustre</primary>
 695       <secondary>I/O</secondary>
 696     </indexterm>Lustre File System Storage and I/O</title>
 697     <para>Lustre File IDentifiers (FIDs) are used internally for identifying
 698     files or objects, similar to inode numbers in local filesystems.  A FID
 699     is a 128-bit identifier, which contains a unique 64-bit sequence number
 700     (SEQ), a 32-bit object ID (OID), and a 32-bit version number. The sequence
 701     number is unique across all Lustre targets in a file system (OSTs and
 702     MDTs). This allows multiple MDTs and OSTs to uniquely identify objects
 703     without depending on identifiers in the underlying filesystem (e.g. inode
 704     numbers) that are likely to be duplicated between targets.  The FID SEQ
 705     number also allows mapping a FID to a particular MDT or OST.</para>
 706     <para>The LFSCK file system consistency checking tool provides
 707     functionality that enables FID-in-dirent for existing files. It
 708     includes the following functionality:
 709     <itemizedlist>
 710       <listitem>
 711         <para>Verifies the FID stored with each directory entry and regenerates
 712         it from the inode if it is invalid or missing.</para>
 713       </listitem>
 714       <listitem>
 715         <para>Verifies the linkEA entry for each inode and regenerates it if
 716         invalid or missing. The <emphasis role="italic">linkEA</emphasis>
 717         stores of the file name and parent FID. It is stored as an extended
 718         attribute in each inode. Thus, the linkEA can be used to
 719         reconstruct the full path name of a file from only the FID.</para>
 720       </listitem>
 721     </itemizedlist></para>
 722     <para>Information about where file data is located on the OST(s) is stored
 723     as an extended attribute called layout EA in an MDT object identified by
 724     the FID for the file (see
 725     <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 726     linkend="Fig1.3_LayoutEAonMDT" />). If the file is a regular file (not a
 727     directory or symbol link), the MDT object points to 1-to-N OST object(s) on
 728     the OST(s) that contain the file data. If the MDT file points to one
 729     object, all the file data is stored in that object. If the MDT file points
 730     to more than one object, the file data is
 731     <emphasis role="italic">striped</emphasis> across the objects using RAID 0,
 732     and each object is stored on a different OST. (For more information about
 733     how striping is implemented in a Lustre file system, see
 734     <xref linkend="lustre_striping" />.</para>
 735     <figure xml:id="Fig1.3_LayoutEAonMDT">
 736       <title>Layout EA on MDT pointing to file data on OSTs</title>
 737       <mediaobject>
 738         <imageobject>
 739           <imagedata scalefit="1" width="80%"
 740           fileref="./figures/Metadata_File.png" />
 741         </imageobject>
 742         <textobject>
 743           <phrase>Layout EA on MDT pointing to file data on OSTs</phrase>
 744         </textobject>
 745       </mediaobject>
 746     </figure>
 747     <para>When a client wants to read from or write to a file, it first fetches
 748     the layout EA from the MDT object for the file. The client then uses this
 749     information to perform I/O on the file, directly interacting with the OSS
 750     nodes where the objects are stored.
 751     <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>
 752     This process is illustrated in
 753     <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 754     linkend="Fig1.4_ClientReqstgData" /><?oxy_custom_end?>
 755     .</para>
 756     <figure xml:id="Fig1.4_ClientReqstgData">
 757       <title>Lustre client requesting file data</title>
 758       <mediaobject>
 759         <imageobject>
 760           <imagedata scalefit="1" width="75%"
 761           fileref="./figures/File_Write.png" />
 762         </imageobject>
 763         <textobject>
 764           <phrase>Lustre client requesting file data</phrase>
 765         </textobject>
 766       </mediaobject>
 767     </figure>
 768     <para>The available bandwidth of a Lustre file system is determined as
 769     follows:</para>
 770     <itemizedlist>
 771       <listitem>
 772         <para>The
 773         <emphasis>network bandwidth</emphasis> equals the aggregated bandwidth
 774         of the OSSs to the targets.</para>
 775       </listitem>
 776       <listitem>
 777         <para>The
 778         <emphasis>disk bandwidth</emphasis> equals the sum of the disk
 779         bandwidths of the storage targets (OSTs) up to the limit of the network
 780         bandwidth.</para>
 781       </listitem>
 782       <listitem>
 783         <para>The
 784         <emphasis>aggregate bandwidth</emphasis> equals the minimum of the disk
 785         bandwidth and the network bandwidth.</para>
 786       </listitem>
 787       <listitem>
 788         <para>The
 789         <emphasis>available file system space</emphasis> equals the sum of the
 790         available space of all the OSTs.</para>
 791       </listitem>
 792     </itemizedlist>
 793     <section xml:id="lustre_striping">
 794       <title>
 795       <indexterm>
 796         <primary>Lustre</primary>
 797         <secondary>striping</secondary>
 798       </indexterm>
 799       <indexterm>
 800         <primary>striping</primary>
 801         <secondary>overview</secondary>
 802       </indexterm>Lustre File System and Striping</title>
 803       <para>One of the main factors leading to the high performance of Lustre
 804       file systems is the ability to stripe data across multiple OSTs in a
 805       round-robin fashion. Users can optionally configure for each file the
 806       number of stripes, stripe size, and OSTs that are used.</para>
 807       <para>Striping can be used to improve performance when the aggregate
 808       bandwidth to a single file exceeds the bandwidth of a single OST. The
 809       ability to stripe is also useful when a single OST does not have enough
 810       free space to hold an entire file. For more information about benefits
 811       and drawbacks of file striping, see
 812       <xref linkend="file_striping.considerations" />.</para>
 813       <para>Striping allows segments or 'chunks' of data in a file to be stored
 814       on different OSTs, as shown in
 815       <xref linkend="understandinglustre.fig.filestripe" />. In the Lustre file
 816       system, a RAID 0 pattern is used in which data is "striped" across a
 817       certain number of objects. The number of objects in a single file is
 818       called the
 819       <literal>stripe_count</literal>.</para>
 820       <para>Each object contains a chunk of data from the file. When the chunk
 821       of data being written to a particular object exceeds the
 822       <literal>stripe_size</literal>, the next chunk of data in the file is
 823       stored on the next object.</para>
 824       <para>Default values for
 825       <literal>stripe_count</literal> and
 826       <literal>stripe_size</literal> are set for the file system. The default
 827       value for
 828       <literal>stripe_count</literal> is 1 stripe for file and the default value
 829       for
 830       <literal>stripe_size</literal> is 1MB. The user may change these values on
 831       a per directory or per file basis. For more details, see
 832       <xref linkend="file_striping.lfs_setstripe" />.</para>
 833       <para>
 834       <xref linkend="understandinglustre.fig.filestripe" />, the
 835       <literal>stripe_size</literal> for File C is larger than the
 836       <literal>stripe_size</literal> for File A, allowing more data to be stored
 837       in a single stripe for File C. The
 838       <literal>stripe_count</literal> for File A is 3, resulting in data striped
 839       across three objects, while the
 840       <literal>stripe_count</literal> for File B and File C is 1.</para>
 841       <para>No space is reserved on the OST for unwritten data. File A in
 842       <xref linkend="understandinglustre.fig.filestripe" />.</para>
 843       <figure xml:id="understandinglustre.fig.filestripe">
 844         <title>File striping on a
 845         Lustre file system</title>
 846         <mediaobject>
 847           <imageobject>
 848             <imagedata scalefit="1" width="100%"
 849             fileref="./figures/File_Striping.png" />
 850           </imageobject>
 851           <textobject>
 852             <phrase>File striping pattern across three OSTs for three different
 853             data files. The file is sparse and missing chunk 6.</phrase>
 854           </textobject>
 855         </mediaobject>
 856       </figure>
 857       <para>The maximum file size is not limited by the size of a single
 858       target. In a Lustre file system, files can be striped across multiple
 859       objects (up to 2000), and each object can be up to 16 TiB in size with
 860       ldiskfs, or up to 256PiB with ZFS. This leads to a maximum file size of
 861       31.25 PiB for ldiskfs or 8EiB with ZFS. Note that a Lustre file system can
 862       support files up to 2^63 bytes (8EiB), limited only by the space available
 863       on the OSTs.</para>
 864       <note>
 865         <para>ldiskfs filesystems without the <literal>ea_inode</literal>
 866         feature limit the maximum stripe count for a single file to 160 OSTs.
 867         </para>
 868       </note>
 869       <para>Although a single file can only be striped over 2000 objects,
 870       Lustre file systems can have thousands of OSTs. The I/O bandwidth to
 871       access a single file is the aggregated I/O bandwidth to the objects in a
 872       file, which can be as much as a bandwidth of up to 2000 servers. On
 873       systems with more than 2000 OSTs, clients can do I/O using multiple files
 874       to utilize the full file system bandwidth.</para>
 875       <para>For more information about striping, see
 876       <xref linkend="managingstripingfreespace" />.</para>
 877       <para>
 878         <emphasis role="bold">Extended Attributes(xattrs)</emphasis></para>
 879          <para>Lustre uses lov_user_md_v1/lov_user_md_v3 data-structures to
 880          maintain its file striping information under xattrs. Extended
 881          attributes are created when files and directory are created. Lustre
 882          uses <literal>trusted</literal> extended attributes to store its
 883          parameters which are root-only accessible. The parameters are:</para>
 884       <itemizedlist>
 885         <listitem>
 886           <para>
 887             <emphasis role="bold"><literal>trusted.lov</literal>:</emphasis>
 888             Holds layout for a regular file, or default file layout stored
 889             on a directory (also accessible as <literal>lustre.lov</literal>
 890             for non-root users).
 891           </para>
 892         </listitem>
 893         <listitem>
 894           <para>
 895             <emphasis role="bold"><literal>trusted.lma</literal>:</emphasis>
 896             Holds FID and extra state flags for current file</para>
 897         </listitem>
 898         <listitem>
 899           <para>
 900             <emphasis role="bold"><literal>trusted.lmv</literal>:</emphasis>
 901             Holds layout for a striped directory (DNE 2), not present otherwise
 902           </para>
 903         </listitem>
 904         <listitem>
 905           <para>
 906             <emphasis role="bold"><literal>trusted.link</literal>:</emphasis>
 907             Holds parent directory FID + filename for each link to a file
 908             (for <literal>lfs fid2path</literal>)</para>
 909         </listitem>
 910       </itemizedlist>
 911       <para>xattr which are stored and present in the file could be verify
 912         using:</para>
 913       <para><screen># getfattr -d -m - /mnt/testfs/file></screen></para>
 914     </section>
 915   </section>
 916 </chapter>
 917 <!--
 918   vim:expandtab:shiftwidth=2:tabstop=8:textwidth=80:
 919   -->