UnderstandingLustre.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="understandinglustre">
   5   <title xml:id="understandinglustre.title">Understanding Lustre
   6   Architecture</title>
   7   <para>This chapter describes the Lustre architecture and features of the
   8   Lustre file system. It includes the following sections:</para>
   9   <itemizedlist>
  10     <listitem>
  11       <para>
  12         <xref linkend="understandinglustre.whatislustre" />
  13       </para>
  14     </listitem>
  15     <listitem>
  16       <para>
  17         <xref linkend="understandinglustre.components" />
  18       </para>
  19     </listitem>
  20     <listitem>
  21       <para>
  22         <xref linkend="understandinglustre.storageio" />
  23       </para>
  24     </listitem>
  25   </itemizedlist>
  26   <section xml:id="understandinglustre.whatislustre">
  27     <title>
  28     <indexterm>
  29       <primary>Lustre</primary>
  30     </indexterm>What a Lustre File System Is (and What It Isn't)</title>
  31     <para>The Lustre architecture is a storage architecture for clusters. The
  32     central component of the Lustre architecture is the Lustre file system,
  33     which is supported on the Linux operating system and provides a POSIX
  34     <superscript>*</superscript>standard-compliant UNIX file system
  35     interface.</para>
  36     <para>The Lustre storage architecture is used for many different kinds of
  37     clusters. It is best known for powering many of the largest
  38     high-performance computing (HPC) clusters worldwide, with tens of thousands
  39     of client systems, petabytes (PB) of storage and hundreds of gigabytes per
  40     second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system
  41     as a site-wide global file system, serving dozens of clusters.</para>
  42     <para>The ability of a Lustre file system to scale capacity and performance
  43     for any need reduces the need to deploy many separate file systems, such as
  44     one for each compute cluster. Storage management is simplified by avoiding
  45     the need to copy data between compute clusters. In addition to aggregating
  46     storage capacity of many servers, the I/O throughput is also aggregated and
  47     scales with additional servers. Moreover, throughput and/or capacity can be
  48     easily increased by adding servers dynamically.</para>
  49     <para>While a Lustre file system can function in many work environments, it
  50     is not necessarily the best choice for all applications. It is best suited
  51     for uses that exceed the capacity that a single server can provide, though
  52     in some use cases, a Lustre file system can perform better with a single
  53     server than other file systems due to its strong locking and data
  54     coherency.</para>
  55     <para>A Lustre file system is currently not particularly well suited for
  56     "peer-to-peer" usage models where clients and servers are running on the
  57     same node, each sharing a small amount of storage, due to the lack of data
  58     replication at the Lustre software level. In such uses, if one
  59     client/server fails, then the data stored on that node will not be
  60     accessible until the node is restarted.</para>
  61     <section remap="h3">
  62       <title>
  63       <indexterm>
  64         <primary>Lustre</primary>
  65         <secondary>features</secondary>
  66       </indexterm>Lustre Features</title>
  67       <para>Lustre file systems run on a variety of vendor's kernels. For more
  68       details, see the Lustre Test Matrix
  69       <xref xmlns:xlink="http://www.w3.org/1999/xlink"
  70       linkend="dbdoclet.50438261_99193" />.</para>
  71       <para>A Lustre installation can be scaled up or down with respect to the
  72       number of client nodes, disk storage and bandwidth. Scalability and
  73       performance are dependent on available disk and network bandwidth and the
  74       processing power of the servers in the system. A Lustre file system can
  75       be deployed in a wide variety of configurations that can be scaled well
  76       beyond the size and performance observed in production systems to
  77       date.</para>
  78       <para>
  79       <xref linkend="understandinglustre.tab1" />shows the practical range of
  80       scalability and performance characteristics of a Lustre file system and
  81       some test results in production systems.</para>
  82       <table frame="all">
  83         <title xml:id="understandinglustre.tab1">Lustre File System Scalability
  84         and Performance</title>
  85         <tgroup cols="3">
  86           <colspec colname="c1" colwidth="1*" />
  87           <colspec colname="c2" colwidth="2*" />
  88           <colspec colname="c3" colwidth="3*" />
  89           <thead>
  90             <row>
  91               <entry>
  92                 <para>
  93                   <emphasis role="bold">Feature</emphasis>
  94                 </para>
  95               </entry>
  96               <entry>
  97                 <para>
  98                   <emphasis role="bold">Current Practical Range</emphasis>
  99                 </para>
 100               </entry>
 101               <entry>
 102                 <para>
 103                   <emphasis role="bold">Known Production Usage</emphasis>
 104                 </para>
 105               </entry>
 106             </row>
 107           </thead>
 108           <tbody>
 109             <row>
 110               <entry>
 111                 <para>
 112                   <emphasis role="bold">Client Scalability</emphasis>
 113                 </para>
 114               </entry>
 115               <entry>
 116                 <para>100-100000</para>
 117               </entry>
 118               <entry>
 119                 <para>50000+ clients, many in the 10000 to 20000 range</para>
 120               </entry>
 121             </row>
 122             <row>
 123               <entry>
 124                 <para>
 125                   <emphasis role="bold">Client Performance</emphasis>
 126                 </para>
 127               </entry>
 128               <entry>
 129                 <para>
 130                   <emphasis>Single client:</emphasis>
 131                 </para>
 132                 <para>I/O 90% of network bandwidth</para>
 133                 <para>
 134                   <emphasis>Aggregate:</emphasis>
 135                 </para>
 136                 <para>2.5 TB/sec I/O</para>
 137               </entry>
 138               <entry>
 139                 <para>
 140                   <emphasis>Single client:</emphasis>
 141                 </para>
 142                 <para>2 GB/sec I/O, 1000 metadata ops/sec</para>
 143                 <para>
 144                   <emphasis>Aggregate:</emphasis>
 145                 </para>
 146                 <para>2.5 TB/sec I/O </para>
 147               </entry>
 148             </row>
 149             <row>
 150               <entry>
 151                 <para>
 152                   <emphasis role="bold">OSS Scalability</emphasis>
 153                 </para>
 154               </entry>
 155               <entry>
 156                 <para>
 157                   <emphasis>Single OSS:</emphasis>
 158                 </para>
 159                 <para>1-32 OSTs per OSS,</para>
 160                 <para>128TB per OST</para>
 161                 <para>
 162                   <emphasis>OSS count:</emphasis>
 163                 </para>
 164                 <para>1000 OSSs, with up to 4000 OSTs</para>
 165               </entry>
 166               <entry>
 167                 <para>
 168                   <emphasis>Single OSS:</emphasis>
 169                 </para>
 170                 <para>32x 8TB OSTs per OSS,</para>
 171                 <para>8x 32TB OSTs per OSS</para>
 172                 <para>
 173                   <emphasis>OSS count:</emphasis>
 174                 </para>
 175                 <para>450 OSSs with 1000 4TB OSTs</para>
 176                 <para>192 OSSs with 1344 8TB OSTs</para>
 177                 <para>768 OSSs with 768 72TB OSTs</para>
 178               </entry>
 179             </row>
 180             <row>
 181               <entry>
 182                 <para>
 183                   <emphasis role="bold">OSS Performance</emphasis>
 184                 </para>
 185               </entry>
 186               <entry>
 187                 <para>
 188                   <emphasis>Single OSS:</emphasis>
 189                 </para>
 190                 <para>5 GB/sec</para>
 191                 <para>
 192                   <emphasis>Aggregate:</emphasis>
 193                 </para>
 194                 <para>10 TB/sec</para>
 195               </entry>
 196               <entry>
 197                 <para>
 198                   <emphasis>Single OSS:</emphasis>
 199                 </para>
 200                 <para>2.0+ GB/sec</para>
 201                 <para>
 202                   <emphasis>Aggregate:</emphasis>
 203                 </para>
 204                 <para>2.5 TB/sec</para>
 205               </entry>
 206             </row>
 207             <row>
 208               <entry>
 209                 <para>
 210                   <emphasis role="bold">MDS Scalability</emphasis>
 211                 </para>
 212               </entry>
 213               <entry>
 214                 <para>
 215                   <emphasis>Single MDT:</emphasis>
 216                 </para>
 217                 <para>4 billion files (ldiskfs), 256 trillion files
 218                 (ZFS)</para>
 219                 <para>
 220                   <emphasis>MDS count:</emphasis>
 221                 </para>
 222                 <para>1 primary + 1 backup</para>
 223                 <para condition="l24">Up to 256 MDTs and up to 256 MDSs</para>
 224               </entry>
 225               <entry>
 226                 <para>
 227                   <emphasis>Single MDT:</emphasis>
 228                 </para>
 229                 <para>2 billion files</para>
 230                 <para>
 231                   <emphasis>MDS count:</emphasis>
 232                 </para>
 233                 <para>1 primary + 1 backup</para>
 234               </entry>
 235             </row>
 236             <row>
 237               <entry>
 238                 <para>
 239                   <emphasis role="bold">MDS Performance</emphasis>
 240                 </para>
 241               </entry>
 242               <entry>
 243                 <para>50000/s create operations,</para>
 244                 <para>200000/s metadata stat operations</para>
 245               </entry>
 246               <entry>
 247                 <para>15000/s create operations,</para>
 248                 <para>50000/s metadata stat operations</para>
 249               </entry>
 250             </row>
 251             <row>
 252               <entry>
 253                 <para>
 254                   <emphasis role="bold">File system Scalability</emphasis>
 255                 </para>
 256               </entry>
 257               <entry>
 258                 <para>
 259                   <emphasis>Single File:</emphasis>
 260                 </para>
 261                 <para>32 PB max file size (ldiskfs), 2^63 bytes (ZFS)</para>
 262                 <para>
 263                   <emphasis>Aggregate:</emphasis>
 264                 </para>
 265                 <para>512 PB space, 32 billion files</para>
 266               </entry>
 267               <entry>
 268                 <para>
 269                   <emphasis>Single File:</emphasis>
 270                 </para>
 271                 <para>multi-TB max file size</para>
 272                 <para>
 273                   <emphasis>Aggregate:</emphasis>
 274                 </para>
 275                 <para>55 PB space, 2 billion files</para>
 276               </entry>
 277             </row>
 278           </tbody>
 279         </tgroup>
 280       </table>
 281       <para>Other Lustre software features are:</para>
 282       <itemizedlist>
 283         <listitem>
 284           <para>
 285           <emphasis role="bold">Performance-enhanced ext4 file
 286           system:</emphasis>The Lustre file system uses an improved version of
 287           the ext4 journaling file system to store data and metadata. This
 288           version, called
 289           <emphasis role="italic">
 290             <literal>ldiskfs</literal>
 291           </emphasis>, has been enhanced to improve performance and provide
 292           additional functionality needed by the Lustre file system.</para>
 293         </listitem>
 294         <listitem>
 295           <para condition="l24">With the Lustre software release 2.4 and later,
 296           it is also possible to use ZFS as the backing filesystem for Lustre
 297           for the MDT, OST, and MGS storage. This allows Lustre to leverage the
 298           scalability and data integrity features of ZFS for individual storage
 299           targets.</para>
 300         </listitem>
 301         <listitem>
 302           <para>
 303           <emphasis role="bold">POSIX standard compliance:</emphasis>The full
 304           POSIX test suite passes in an identical manner to a local ext4 file
 305           system, with limited exceptions on Lustre clients. In a cluster, most
 306           operations are atomic so that clients never see stale data or
 307           metadata. The Lustre software supports mmap() file I/O.</para>
 308         </listitem>
 309         <listitem>
 310           <para>
 311           <emphasis role="bold">High-performance heterogeneous
 312           networking:</emphasis>The Lustre software supports a variety of high
 313           performance, low latency networks and permits Remote Direct Memory
 314           Access (RDMA) for InfiniBand
 315           <superscript>*</superscript>(utilizing OpenFabrics Enterprise
 316           Distribution (OFED
 317           <superscript>*</superscript>) and other advanced networks for fast
 318           and efficient network transport. Multiple RDMA networks can be
 319           bridged using Lustre routing for maximum performance. The Lustre
 320           software also includes integrated network diagnostics.</para>
 321         </listitem>
 322         <listitem>
 323           <para>
 324           <emphasis role="bold">High-availability:</emphasis>The Lustre file
 325           system supports active/active failover using shared storage
 326           partitions for OSS targets (OSTs). Lustre software release 2.3 and
 327           earlier releases offer active/passive failover using a shared storage
 328           partition for the MDS target (MDT). The Lustre file system can work
 329           with a variety of high availability (HA) managers to allow automated
 330           failover and has no single point of failure (NSPF). This allows
 331           application transparent recovery. Multiple mount protection (MMP)
 332           provides integrated protection from errors in highly-available
 333           systems that would otherwise cause file system corruption.</para>
 334         </listitem>
 335         <listitem>
 336           <para condition="l24">With Lustre software release 2.4 or later
 337           servers and clients it is possible to configure active/active
 338           failover of multiple MDTs. This allows scaling the metadata
 339           performance of Lustre filesystems with the addition of MDT storage
 340           devices and MDS nodes.</para>
 341         </listitem>
 342         <listitem>
 343           <para>
 344           <emphasis role="bold">Security:</emphasis>By default TCP connections
 345           are only allowed from privileged ports. UNIX group membership is
 346           verified on the MDS.</para>
 347         </listitem>
 348         <listitem>
 349           <para>
 350           <emphasis role="bold">Access control list (ACL), extended
 351           attributes:</emphasis>the Lustre security model follows that of a
 352           UNIX file system, enhanced with POSIX ACLs. Noteworthy additional
 353           features include root squash.</para>
 354         </listitem>
 355         <listitem>
 356           <para>
 357           <emphasis role="bold">Interoperability:</emphasis>The Lustre file
 358           system runs on a variety of CPU architectures and mixed-endian
 359           clusters and is interoperable between successive major Lustre
 360           software releases.</para>
 361         </listitem>
 362         <listitem>
 363           <para>
 364           <emphasis role="bold">Object-based architecture:</emphasis>Clients
 365           are isolated from the on-disk file structure enabling upgrading of
 366           the storage architecture without affecting the client.</para>
 367         </listitem>
 368         <listitem>
 369           <para>
 370           <emphasis role="bold">Byte-granular file and fine-grained metadata
 371           locking:</emphasis>Many clients can read and modify the same file or
 372           directory concurrently. The Lustre distributed lock manager (LDLM)
 373           ensures that files are coherent between all clients and servers in
 374           the file system. The MDT LDLM manages locks on inode permissions and
 375           pathnames. Each OST has its own LDLM for locks on file stripes stored
 376           thereon, which scales the locking performance as the file system
 377           grows.</para>
 378         </listitem>
 379         <listitem>
 380           <para>
 381           <emphasis role="bold">Quotas:</emphasis>User and group quotas are
 382           available for a Lustre file system.</para>
 383         </listitem>
 384         <listitem>
 385           <para>
 386           <emphasis role="bold">Capacity growth:</emphasis>The size of a Lustre
 387           file system and aggregate cluster bandwidth can be increased without
 388           interruption by adding a new OSS with OSTs to the cluster.</para>
 389         </listitem>
 390         <listitem>
 391           <para>
 392           <emphasis role="bold">Controlled striping:</emphasis>The layout of
 393           files across OSTs can be configured on a per file, per directory, or
 394           per file system basis. This allows file I/O to be tuned to specific
 395           application requirements within a single file system. The Lustre file
 396           system uses RAID-0 striping and balances space usage across
 397           OSTs.</para>
 398         </listitem>
 399         <listitem>
 400           <para>
 401           <emphasis role="bold">Network data integrity protection:</emphasis>A
 402           checksum of all data sent from the client to the OSS protects against
 403           corruption during data transfer.</para>
 404         </listitem>
 405         <listitem>
 406           <para>
 407           <emphasis role="bold">MPI I/O:</emphasis>The Lustre architecture has
 408           a dedicated MPI ADIO layer that optimizes parallel I/O to match the
 409           underlying file system architecture.</para>
 410         </listitem>
 411         <listitem>
 412           <para>
 413           <emphasis role="bold">NFS and CIFS export:</emphasis>Lustre files can
 414           be re-exported using NFS (via Linux knfsd) or CIFS (via Samba)
 415           enabling them to be shared with non-Linux clients, such as Microsoft
 416           <superscript>*</superscript>Windows
 417           <superscript>*</superscript>and Apple
 418           <superscript>*</superscript>Mac OS X
 419           <superscript>*</superscript>.</para>
 420         </listitem>
 421         <listitem>
 422           <para>
 423           <emphasis role="bold">Disaster recovery tool:</emphasis>The Lustre
 424           file system provides an online distributed file system check (LFSCK)
 425           that can restore consistency between storage components in case of a
 426           major file system error. A Lustre file system can operate even in the
 427           presence of file system inconsistencies, and LFSCK can run while the
 428           filesystem is in use, so LFSCK is not required to complete before
 429           returning the file system to production.</para>
 430         </listitem>
 431         <listitem>
 432           <para>
 433           <emphasis role="bold">Performance monitoring:</emphasis>The Lustre
 434           file system offers a variety of mechanisms to examine performance and
 435           tuning.</para>
 436         </listitem>
 437         <listitem>
 438           <para>
 439           <emphasis role="bold">Open source:</emphasis>The Lustre software is
 440           licensed under the GPL 2.0 license for use with the Linux operating
 441           system.</para>
 442         </listitem>
 443       </itemizedlist>
 444     </section>
 445   </section>
 446   <section xml:id="understandinglustre.components">
 447     <title>
 448     <indexterm>
 449       <primary>Lustre</primary>
 450       <secondary>components</secondary>
 451     </indexterm>Lustre Components</title>
 452     <para>An installation of the Lustre software includes a management server
 453     (MGS) and one or more Lustre file systems interconnected with Lustre
 454     networking (LNET).</para>
 455     <para>A basic configuration of Lustre file system components is shown in
 456     <xref linkend="understandinglustre.fig.cluster" />.</para>
 457     <figure>
 458       <title xml:id="understandinglustre.fig.cluster">Lustre file system
 459       components in a basic cluster</title>
 460       <mediaobject>
 461         <imageobject>
 462           <imagedata scalefit="1" width="100%"
 463           fileref="./figures/Basic_Cluster.png" />
 464         </imageobject>
 465         <textobject>
 466           <phrase>Lustre file system components in a basic cluster</phrase>
 467         </textobject>
 468       </mediaobject>
 469     </figure>
 470     <section remap="h3">
 471       <title>
 472       <indexterm>
 473         <primary>Lustre</primary>
 474         <secondary>MGS</secondary>
 475       </indexterm>Management Server (MGS)</title>
 476       <para>The MGS stores configuration information for all the Lustre file
 477       systems in a cluster and provides this information to other Lustre
 478       components. Each Lustre target contacts the MGS to provide information,
 479       and Lustre clients contact the MGS to retrieve information.</para>
 480       <para>It is preferable that the MGS have its own storage space so that it
 481       can be managed independently. However, the MGS can be co-located and
 482       share storage space with an MDS as shown in
 483       <xref linkend="understandinglustre.fig.cluster" />.</para>
 484     </section>
 485     <section remap="h3">
 486       <title>Lustre File System Components</title>
 487       <para>Each Lustre file system consists of the following
 488       components:</para>
 489       <itemizedlist>
 490         <listitem>
 491           <para>
 492           <emphasis role="bold">Metadata Server (MDS)</emphasis>- The MDS makes
 493           metadata stored in one or more MDTs available to Lustre clients. Each
 494           MDS manages the names and directories in the Lustre file system(s)
 495           and provides network request handling for one or more local
 496           MDTs.</para>
 497         </listitem>
 498         <listitem>
 499           <para>
 500           <emphasis role="bold">Metadata Target (MDT</emphasis>) - For Lustre
 501           software release 2.3 and earlier, each file system has one MDT. The
 502           MDT stores metadata (such as filenames, directories, permissions and
 503           file layout) on storage attached to an MDS. Each file system has one
 504           MDT. An MDT on a shared storage target can be available to multiple
 505           MDSs, although only one can access it at a time. If an active MDS
 506           fails, a standby MDS can serve the MDT and make it available to
 507           clients. This is referred to as MDS failover.</para>
 508           <para condition="l24">Since Lustre software release 2.4, multiple
 509           MDTs are supported. Each file system has at least one MDT. An MDT on
 510           a shared storage target can be available via multiple MDSs, although
 511           only one MDS can export the MDT to the clients at one time. Two MDS
 512           machines share storage for two or more MDTs. After the failure of one
 513           MDS, the remaining MDS begins serving the MDT(s) of the failed
 514           MDS.</para>
 515           <para condition="l28">Since Lustre software release 2.8,
 516           multiple MDTs can be employed to share the inode records for files
 517           contained in a single directory. A directory for which inode records
 518           are distributed across multiple MDTs is known as a <emphasis>striped
 519           directory</emphasis>. In the case of a Lustre filesystem the inode
 520           records maybe also be referred to as the 'metadata' portion of the
 521           file record.</para>
 522         </listitem>
 523         <listitem>
 524           <para>
 525           <emphasis role="bold">Object Storage Servers (OSS)</emphasis>: The
 526           OSS provides file I/O service and network request handling for one or
 527           more local OSTs. Typically, an OSS serves between two and eight OSTs,
 528           up to 16 TB each. A typical configuration is an MDT on a dedicated
 529           node, two or more OSTs on each OSS node, and a client on each of a
 530           large number of compute nodes.</para>
 531         </listitem>
 532         <listitem>
 533           <para>
 534           <emphasis role="bold">Object Storage Target (OST)</emphasis>: User
 535           file data is stored in one or more objects, each object on a separate
 536           OST in a Lustre file system. The number of objects per file is
 537           configurable by the user and can be tuned to optimize performance for
 538           a given workload.</para>
 539         </listitem>
 540         <listitem>
 541           <para>
 542           <emphasis role="bold">Lustre clients</emphasis>: Lustre clients are
 543           computational, visualization or desktop nodes that are running Lustre
 544           client software, allowing them to mount the Lustre file
 545           system.</para>
 546         </listitem>
 547       </itemizedlist>
 548       <para>The Lustre client software provides an interface between the Linux
 549       virtual file system and the Lustre servers. The client software includes
 550       a management client (MGC), a metadata client (MDC), and multiple object
 551       storage clients (OSCs), one corresponding to each OST in the file
 552       system.</para>
 553       <para>A logical object volume (LOV) aggregates the OSCs to provide
 554       transparent access across all the OSTs. Thus, a client with the Lustre
 555       file system mounted sees a single, coherent, synchronized namespace.
 556       Several clients can write to different parts of the same file
 557       simultaneously, while, at the same time, other clients can read from the
 558       file.</para>
 559       <para>
 560       <xref linkend="understandinglustre.tab.storagerequire" />provides the
 561       requirements for attached storage for each Lustre file system component
 562       and describes desirable characteristics of the hardware used.</para>
 563       <table frame="all">
 564         <title xml:id="understandinglustre.tab.storagerequire">
 565         <indexterm>
 566           <primary>Lustre</primary>
 567           <secondary>requirements</secondary>
 568         </indexterm>Storage and hardware requirements for Lustre file system
 569         components</title>
 570         <tgroup cols="3">
 571           <colspec colname="c1" colwidth="1*" />
 572           <colspec colname="c2" colwidth="3*" />
 573           <colspec colname="c3" colwidth="3*" />
 574           <thead>
 575             <row>
 576               <entry>
 577                 <para>
 578                   <emphasis role="bold" />
 579                 </para>
 580               </entry>
 581               <entry>
 582                 <para>
 583                   <emphasis role="bold">Required attached storage</emphasis>
 584                 </para>
 585               </entry>
 586               <entry>
 587                 <para>
 588                   <emphasis role="bold">Desirable hardware
 589                   characteristics</emphasis>
 590                 </para>
 591               </entry>
 592             </row>
 593           </thead>
 594           <tbody>
 595             <row>
 596               <entry>
 597                 <para>
 598                   <emphasis role="bold">MDSs</emphasis>
 599                 </para>
 600               </entry>
 601               <entry>
 602                 <para>1-2% of file system capacity</para>
 603               </entry>
 604               <entry>
 605                 <para>Adequate CPU power, plenty of memory, fast disk
 606                 storage.</para>
 607               </entry>
 608             </row>
 609             <row>
 610               <entry>
 611                 <para>
 612                   <emphasis role="bold">OSSs</emphasis>
 613                 </para>
 614               </entry>
 615               <entry>
 616                 <para>1-16 TB per OST, 1-8 OSTs per OSS</para>
 617               </entry>
 618               <entry>
 619                 <para>Good bus bandwidth. Recommended that storage be balanced
 620                 evenly across OSSs.</para>
 621               </entry>
 622             </row>
 623             <row>
 624               <entry>
 625                 <para>
 626                   <emphasis role="bold">Clients</emphasis>
 627                 </para>
 628               </entry>
 629               <entry>
 630                 <para>None</para>
 631               </entry>
 632               <entry>
 633                 <para>Low latency, high bandwidth network.</para>
 634               </entry>
 635             </row>
 636           </tbody>
 637         </tgroup>
 638       </table>
 639       <para>For additional hardware requirements and considerations, see
 640       <xref linkend="settinguplustresystem" />.</para>
 641     </section>
 642     <section remap="h3">
 643       <title>
 644       <indexterm>
 645         <primary>Lustre</primary>
 646         <secondary>LNET</secondary>
 647       </indexterm>Lustre Networking (LNET)</title>
 648       <para>Lustre Networking (LNET) is a custom networking API that provides
 649       the communication infrastructure that handles metadata and file I/O data
 650       for the Lustre file system servers and clients. For more information
 651       about LNET, see
 652       <xref linkend="understandinglustrenetworking" />.</para>
 653     </section>
 654     <section remap="h3">
 655       <title>
 656       <indexterm>
 657         <primary>Lustre</primary>
 658         <secondary>cluster</secondary>
 659       </indexterm>Lustre Cluster</title>
 660       <para>At scale, a Lustre file system cluster can include hundreds of OSSs
 661       and thousands of clients (see
 662       <xref linkend="understandinglustre.fig.lustrescale" />). More than one
 663       type of network can be used in a Lustre cluster. Shared storage between
 664       OSSs enables failover capability. For more details about OSS failover,
 665       see
 666       <xref linkend="understandingfailover" />.</para>
 667       <figure>
 668         <title xml:id="understandinglustre.fig.lustrescale">
 669         <indexterm>
 670           <primary>Lustre</primary>
 671           <secondary>at scale</secondary>
 672         </indexterm>Lustre cluster at scale</title>
 673         <mediaobject>
 674           <imageobject>
 675             <imagedata scalefit="1" width="100%"
 676             fileref="./figures/Scaled_Cluster.png" />
 677           </imageobject>
 678           <textobject>
 679             <phrase>Lustre file system cluster at scale</phrase>
 680           </textobject>
 681         </mediaobject>
 682       </figure>
 683     </section>
 684   </section>
 685   <section xml:id="understandinglustre.storageio">
 686     <title>
 687     <indexterm>
 688       <primary>Lustre</primary>
 689       <secondary>storage</secondary>
 690     </indexterm>
 691     <indexterm>
 692       <primary>Lustre</primary>
 693       <secondary>I/O</secondary>
 694     </indexterm>Lustre File System Storage and I/O</title>
 695     <para>In Lustre software release 2.0, Lustre file identifiers (FIDs) were
 696     introduced to replace UNIX inode numbers for identifying files or objects.
 697     A FID is a 128-bit identifier that contains a unique 64-bit sequence
 698     number, a 32-bit object ID (OID), and a 32-bit version number. The sequence
 699     number is unique across all Lustre targets in a file system (OSTs and
 700     MDTs). This change enabled future support for multiple MDTs (introduced in
 701     Lustre software release 2.4) and ZFS (introduced in Lustre software release
 702     2.4).</para>
 703     <para>Also introduced in release 2.0 is a feature call
 704     <emphasis role="italic">FID-in-dirent</emphasis>(also known as
 705     <emphasis role="italic">dirdata</emphasis>) in which the FID is stored as
 706     part of the name of the file in the parent directory. This feature
 707     significantly improves performance for
 708     <literal>ls</literal> command executions by reducing disk I/O. The
 709     FID-in-dirent is generated at the time the file is created.</para>
 710     <note>
 711       <para>The FID-in-dirent feature is not compatible with the Lustre
 712       software release 1.8 format. Therefore, when an upgrade from Lustre
 713       software release 1.8 to a Lustre software release 2.x is performed, the
 714       FID-in-dirent feature is not automatically enabled. For upgrades from
 715       Lustre software release 1.8 to Lustre software releases 2.0 through 2.3,
 716       FID-in-dirent can be enabled manually but only takes effect for new
 717       files.</para>
 718       <para>For more information about upgrading from Lustre software release
 719       1.8 and enabling FID-in-dirent for existing files, see
 720       <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 721       linkend="upgradinglustre" />Chapter 16 “Upgrading a Lustre File
 722       System”.</para>
 723     </note>
 724     <para condition="l24">The LFSCK file system consistency checking tool
 725     released with Lustre software release 2.4 provides functionality that
 726     enables FID-in-dirent for existing files. It includes the following
 727     functionality:
 728     <itemizedlist>
 729       <listitem>
 730         <para>Generates IGIF mode FIDs for existing files from a 1.8 version
 731         file system files.</para>
 732       </listitem>
 733       <listitem>
 734         <para>Verifies the FID-in-dirent for each file and regenerates the
 735         FID-in-dirent if it is invalid or missing.</para>
 736       </listitem>
 737       <listitem>
 738         <para>Verifies the linkEA entry for each and regenerates the linkEA
 739         if it is invalid or missing. The
 740         <emphasis role="italic">linkEA</emphasis>consists of the file name and
 741         parent FID. It is stored as an extended attribute in the file
 742         itself. Thus, the linkEA can be used to reconstruct the full path name of
 743         a file.</para>
 744       </listitem>
 745     </itemizedlist></para>
 746     <para>Information about where file data is located on the OST(s) is stored
 747     as an extended attribute called layout EA in an MDT object identified by
 748     the FID for the file (see
 749     <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 750     linkend="Fig1.3_LayoutEAonMDT" />). If the file is a regular file (not a
 751     directory or symbol link), the MDT object points to 1-to-N OST object(s) on
 752     the OST(s) that contain the file data. If the MDT file points to one
 753     object, all the file data is stored in that object. If the MDT file points
 754     to more than one object, the file data is
 755     <emphasis role="italic">striped</emphasis>across the objects using RAID 0,
 756     and each object is stored on a different OST. (For more information about
 757     how striping is implemented in a Lustre file system, see
 758     <xref linkend="dbdoclet.50438250_89922" />.</para>
 759     <figure xml:id="Fig1.3_LayoutEAonMDT">
 760       <title>Layout EA on MDT pointing to file data on OSTs</title>
 761       <mediaobject>
 762         <imageobject>
 763           <imagedata scalefit="1" width="80%"
 764           fileref="./figures/Metadata_File.png" />
 765         </imageobject>
 766         <textobject>
 767           <phrase>Layout EA on MDT pointing to file data on OSTs</phrase>
 768         </textobject>
 769       </mediaobject>
 770     </figure>
 771     <para>When a client wants to read from or write to a file, it first fetches
 772     the layout EA from the MDT object for the file. The client then uses this
 773     information to perform I/O on the file, directly interacting with the OSS
 774     nodes where the objects are stored.
 775     <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>
 776     This process is illustrated in
 777     <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 778     linkend="Fig1.4_ClientReqstgData" /><?oxy_custom_end?>
 779     .</para>
 780     <figure xml:id="Fig1.4_ClientReqstgData">
 781       <title>Lustre client requesting file data</title>
 782       <mediaobject>
 783         <imageobject>
 784           <imagedata scalefit="1" width="75%"
 785           fileref="./figures/File_Write.png" />
 786         </imageobject>
 787         <textobject>
 788           <phrase>Lustre client requesting file data</phrase>
 789         </textobject>
 790       </mediaobject>
 791     </figure>
 792     <para>The available bandwidth of a Lustre file system is determined as
 793     follows:</para>
 794     <itemizedlist>
 795       <listitem>
 796         <para>The
 797         <emphasis>network bandwidth</emphasis>equals the aggregated bandwidth
 798         of the OSSs to the targets.</para>
 799       </listitem>
 800       <listitem>
 801         <para>The
 802         <emphasis>disk bandwidth</emphasis>equals the sum of the disk
 803         bandwidths of the storage targets (OSTs) up to the limit of the network
 804         bandwidth.</para>
 805       </listitem>
 806       <listitem>
 807         <para>The
 808         <emphasis>aggregate bandwidth</emphasis>equals the minimum of the disk
 809         bandwidth and the network bandwidth.</para>
 810       </listitem>
 811       <listitem>
 812         <para>The
 813         <emphasis>available file system space</emphasis>equals the sum of the
 814         available space of all the OSTs.</para>
 815       </listitem>
 816     </itemizedlist>
 817     <section xml:id="dbdoclet.50438250_89922">
 818       <title>
 819       <indexterm>
 820         <primary>Lustre</primary>
 821         <secondary>striping</secondary>
 822       </indexterm>
 823       <indexterm>
 824         <primary>striping</primary>
 825         <secondary>overview</secondary>
 826       </indexterm>Lustre File System and Striping</title>
 827       <para>One of the main factors leading to the high performance of Lustre
 828       file systems is the ability to stripe data across multiple OSTs in a
 829       round-robin fashion. Users can optionally configure for each file the
 830       number of stripes, stripe size, and OSTs that are used.</para>
 831       <para>Striping can be used to improve performance when the aggregate
 832       bandwidth to a single file exceeds the bandwidth of a single OST. The
 833       ability to stripe is also useful when a single OST does not have enough
 834       free space to hold an entire file. For more information about benefits
 835       and drawbacks of file striping, see
 836       <xref linkend="dbdoclet.50438209_48033" />.</para>
 837       <para>Striping allows segments or 'chunks' of data in a file to be stored
 838       on different OSTs, as shown in
 839       <xref linkend="understandinglustre.fig.filestripe" />. In the Lustre file
 840       system, a RAID 0 pattern is used in which data is "striped" across a
 841       certain number of objects. The number of objects in a single file is
 842       called the
 843       <literal>stripe_count</literal>.</para>
 844       <para>Each object contains a chunk of data from the file. When the chunk
 845       of data being written to a particular object exceeds the
 846       <literal>stripe_size</literal>, the next chunk of data in the file is
 847       stored on the next object.</para>
 848       <para>Default values for
 849       <literal>stripe_count</literal> and
 850       <literal>stripe_size</literal> are set for the file system. The default
 851       value for
 852       <literal>stripe_count</literal> is 1 stripe for file and the default value
 853       for
 854       <literal>stripe_size</literal> is 1MB. The user may change these values on
 855       a per directory or per file basis. For more details, see
 856       <xref linkend="dbdoclet.50438209_78664" />.</para>
 857       <para>
 858       <xref linkend="understandinglustre.fig.filestripe" />, the
 859       <literal>stripe_size</literal> for File C is larger than the
 860       <literal>stripe_size</literal> for File A, allowing more data to be stored
 861       in a single stripe for File C. The
 862       <literal>stripe_count</literal> for File A is 3, resulting in data striped
 863       across three objects, while the
 864       <literal>stripe_count</literal> for File B and File C is 1.</para>
 865       <para>No space is reserved on the OST for unwritten data. File A in
 866       <xref linkend="understandinglustre.fig.filestripe" />.</para>
 867       <figure>
 868         <title xml:id="understandinglustre.fig.filestripe">File striping on a
 869         Lustre file system</title>
 870         <mediaobject>
 871           <imageobject>
 872             <imagedata scalefit="1" width="100%"
 873             fileref="./figures/File_Striping.png" />
 874           </imageobject>
 875           <textobject>
 876             <phrase>File striping pattern across three OSTs for three different
 877             data files. The file is sparse and missing chunk 6.</phrase>
 878           </textobject>
 879         </mediaobject>
 880       </figure>
 881       <para>The maximum file size is not limited by the size of a single
 882       target. In a Lustre file system, files can be striped across multiple
 883       objects (up to 2000), and each object can be up to 16 TB in size with
 884       ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of
 885       31.25 PB for ldiskfs or 8EB with ZFS. Note that a Lustre file system can
 886       support files up to 2^63 bytes (8EB), limited only by the space available
 887       on the OSTs.</para>
 888       <note>
 889         <para>Versions of the Lustre software prior to Release 2.2 limited the
 890         maximum stripe count for a single file to 160 OSTs.</para>
 891       </note>
 892       <para>Although a single file can only be striped over 2000 objects,
 893       Lustre file systems can have thousands of OSTs. The I/O bandwidth to
 894       access a single file is the aggregated I/O bandwidth to the objects in a
 895       file, which can be as much as a bandwidth of up to 2000 servers. On
 896       systems with more than 2000 OSTs, clients can do I/O using multiple files
 897       to utilize the full file system bandwidth.</para>
 898       <para>For more information about striping, see
 899       <xref linkend="managingstripingfreespace" />.</para>
 900     </section>
 901   </section>
 902 </chapter>