+ </itemizedlist>
+ <para>The Lustre client software provides an interface between the Linux
+ virtual file system and the Lustre servers. The client software includes
+ a management client (MGC), a metadata client (MDC), and multiple object
+ storage clients (OSCs), one corresponding to each OST in the file
+ system.</para>
+ <para>A logical object volume (LOV) aggregates the OSCs to provide
+ transparent access across all the OSTs. Thus, a client with the Lustre
+ file system mounted sees a single, coherent, synchronized namespace.
+ Several clients can write to different parts of the same file
+ simultaneously, while, at the same time, other clients can read from the
+ file.</para>
+ <para>A logical metadata volume (LMV) aggregates the MDCs to provide
+ transparent access across all the MDTs in a similar manner as the LOV
+ does for file access. This allows the client to see the directory tree
+ on multiple MDTs as a single coherent namespace, and striped directories
+ are merged on the clients to form a single visible directory to users
+ and applications.
+ </para>
+ <para>
+ <xref linkend="understandinglustre.tab.storagerequire" />provides the
+ requirements for attached storage for each Lustre file system component
+ and describes desirable characteristics of the hardware used.</para>
+ <table frame="all">
+ <title xml:id="understandinglustre.tab.storagerequire">
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>requirements</secondary>
+ </indexterm>Storage and hardware requirements for Lustre file system
+ components</title>
+ <tgroup cols="3">
+ <colspec colname="c1" colwidth="1*" />
+ <colspec colname="c2" colwidth="3*" />
+ <colspec colname="c3" colwidth="3*" />
+ <thead>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold" />
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Required attached storage</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>
+ <emphasis role="bold">Desirable hardware
+ characteristics</emphasis>
+ </para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">MDSs</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>1-2% of file system capacity</para>
+ </entry>
+ <entry>
+ <para>Adequate CPU power, plenty of memory, fast disk
+ storage.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">OSSs</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>1-128 TB per OST, 1-8 OSTs per OSS</para>
+ </entry>
+ <entry>
+ <para>Good bus bandwidth. Recommended that storage be balanced
+ evenly across OSSs and matched to network bandwidth.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para>
+ <emphasis role="bold">Clients</emphasis>
+ </para>
+ </entry>
+ <entry>
+ <para>No local storage needed</para>
+ </entry>
+ <entry>
+ <para>Low latency, high bandwidth network.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ <para>For additional hardware requirements and considerations, see
+ <xref linkend="settinguplustresystem" />.</para>
+ </section>
+ <section remap="h3">
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>LNET</secondary>
+ </indexterm>Lustre Networking (LNET)</title>
+ <para>Lustre Networking (LNET) is a custom networking API that provides
+ the communication infrastructure that handles metadata and file I/O data
+ for the Lustre file system servers and clients. For more information
+ about LNET, see
+ <xref linkend="understandinglustrenetworking" />.</para>
+ </section>
+ <section remap="h3">
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>cluster</secondary>
+ </indexterm>Lustre Cluster</title>
+ <para>At scale, a Lustre file system cluster can include hundreds of OSSs
+ and thousands of clients (see
+ <xref linkend="understandinglustre.fig.lustrescale" />). More than one
+ type of network can be used in a Lustre cluster. Shared storage between
+ OSSs enables failover capability. For more details about OSS failover,
+ see
+ <xref linkend="understandingfailover" />.</para>
+ <figure>
+ <title xml:id="understandinglustre.fig.lustrescale">
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>at scale</secondary>
+ </indexterm>Lustre cluster at scale</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata scalefit="1" width="100%"
+ fileref="./figures/Scaled_Cluster.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Lustre file system cluster at scale</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ </section>
+ </section>
+ <section xml:id="understandinglustre.storageio">
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>storage</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>I/O</secondary>
+ </indexterm>Lustre File System Storage and I/O</title>
+ <para>In Lustre software release 2.0, Lustre file identifiers (FIDs) were
+ introduced to replace UNIX inode numbers for identifying files or objects.
+ A FID is a 128-bit identifier that contains a unique 64-bit sequence
+ number, a 32-bit object ID (OID), and a 32-bit version number. The sequence
+ number is unique across all Lustre targets in a file system (OSTs and
+ MDTs). This change enabled future support for multiple MDTs (introduced in
+ Lustre software release 2.4) and ZFS (introduced in Lustre software release
+ 2.4).</para>
+ <para>Also introduced in release 2.0 is an ldiskfs feature named
+ <emphasis role="italic">FID-in-dirent</emphasis>(also known as
+ <emphasis role="italic">dirdata</emphasis>) in which the FID is stored as
+ part of the name of the file in the parent directory. This feature
+ significantly improves performance for
+ <literal>ls</literal> command executions by reducing disk I/O. The
+ FID-in-dirent is generated at the time the file is created.</para>
+ <note>
+ <para>The FID-in-dirent feature is not backward compatible with the
+ release 1.8 ldiskfs disk format. Therefore, when an upgrade from
+ release 1.8 to release 2.x is performed, the FID-in-dirent feature is
+ not automatically enabled. For upgrades from release 1.8 to releases
+ 2.0 through 2.3, FID-in-dirent can be enabled manually but only takes
+ effect for new files.</para>
+ <para>For more information about upgrading from Lustre software release
+ 1.8 and enabling FID-in-dirent for existing files, see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="upgradinglustre" />Chapter 16 “Upgrading a Lustre File
+ System”.</para>
+ </note>
+ <para condition="l24">The LFSCK file system consistency checking tool
+ released with Lustre software release 2.4 provides functionality that
+ enables FID-in-dirent for existing files. It includes the following
+ functionality:
+ <itemizedlist>
+ <listitem>
+ <para>Generates IGIF mode FIDs for existing files from a 1.8 version
+ file system files.</para>
+ </listitem>
+ <listitem>
+ <para>Verifies the FID-in-dirent for each file and regenerates the
+ FID-in-dirent if it is invalid or missing.</para>
+ </listitem>
+ <listitem>
+ <para>Verifies the linkEA entry for each and regenerates the linkEA
+ if it is invalid or missing. The
+ <emphasis role="italic">linkEA</emphasis>consists of the file name and
+ parent FID. It is stored as an extended attribute in the file
+ itself. Thus, the linkEA can be used to reconstruct the full path name
+ of a file.</para>
+ </listitem>
+ </itemizedlist></para>
+ <para>Information about where file data is located on the OST(s) is stored
+ as an extended attribute called layout EA in an MDT object identified by
+ the FID for the file (see
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="Fig1.3_LayoutEAonMDT" />). If the file is a regular file (not a
+ directory or symbol link), the MDT object points to 1-to-N OST object(s) on
+ the OST(s) that contain the file data. If the MDT file points to one
+ object, all the file data is stored in that object. If the MDT file points
+ to more than one object, the file data is
+ <emphasis role="italic">striped</emphasis>across the objects using RAID 0,
+ and each object is stored on a different OST. (For more information about
+ how striping is implemented in a Lustre file system, see
+ <xref linkend="dbdoclet.50438250_89922" />.</para>
+ <figure xml:id="Fig1.3_LayoutEAonMDT">
+ <title>Layout EA on MDT pointing to file data on OSTs</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata scalefit="1" width="80%"
+ fileref="./figures/Metadata_File.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Layout EA on MDT pointing to file data on OSTs</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>When a client wants to read from or write to a file, it first fetches
+ the layout EA from the MDT object for the file. The client then uses this
+ information to perform I/O on the file, directly interacting with the OSS
+ nodes where the objects are stored.
+ <?oxy_custom_start type="oxy_content_highlight" color="255,255,0"?>
+ This process is illustrated in
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="Fig1.4_ClientReqstgData" /><?oxy_custom_end?>
+ .</para>
+ <figure xml:id="Fig1.4_ClientReqstgData">
+ <title>Lustre client requesting file data</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata scalefit="1" width="75%"
+ fileref="./figures/File_Write.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Lustre client requesting file data</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>The available bandwidth of a Lustre file system is determined as
+ follows:</para>
+ <itemizedlist>
+ <listitem>
+ <para>The
+ <emphasis>network bandwidth</emphasis>equals the aggregated bandwidth
+ of the OSSs to the targets.</para>
+ </listitem>
+ <listitem>
+ <para>The
+ <emphasis>disk bandwidth</emphasis>equals the sum of the disk
+ bandwidths of the storage targets (OSTs) up to the limit of the network
+ bandwidth.</para>
+ </listitem>
+ <listitem>
+ <para>The
+ <emphasis>aggregate bandwidth</emphasis>equals the minimum of the disk
+ bandwidth and the network bandwidth.</para>
+ </listitem>
+ <listitem>
+ <para>The
+ <emphasis>available file system space</emphasis>equals the sum of the
+ available space of all the OSTs.</para>
+ </listitem>
+ </itemizedlist>
+ <section xml:id="dbdoclet.50438250_89922">
+ <title>
+ <indexterm>
+ <primary>Lustre</primary>
+ <secondary>striping</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>striping</primary>
+ <secondary>overview</secondary>
+ </indexterm>Lustre File System and Striping</title>
+ <para>One of the main factors leading to the high performance of Lustre
+ file systems is the ability to stripe data across multiple OSTs in a
+ round-robin fashion. Users can optionally configure for each file the
+ number of stripes, stripe size, and OSTs that are used.</para>
+ <para>Striping can be used to improve performance when the aggregate
+ bandwidth to a single file exceeds the bandwidth of a single OST. The
+ ability to stripe is also useful when a single OST does not have enough
+ free space to hold an entire file. For more information about benefits
+ and drawbacks of file striping, see
+ <xref linkend="dbdoclet.50438209_48033" />.</para>
+ <para>Striping allows segments or 'chunks' of data in a file to be stored
+ on different OSTs, as shown in
+ <xref linkend="understandinglustre.fig.filestripe" />. In the Lustre file
+ system, a RAID 0 pattern is used in which data is "striped" across a
+ certain number of objects. The number of objects in a single file is
+ called the
+ <literal>stripe_count</literal>.</para>
+ <para>Each object contains a chunk of data from the file. When the chunk
+ of data being written to a particular object exceeds the
+ <literal>stripe_size</literal>, the next chunk of data in the file is
+ stored on the next object.</para>
+ <para>Default values for
+ <literal>stripe_count</literal> and
+ <literal>stripe_size</literal> are set for the file system. The default
+ value for
+ <literal>stripe_count</literal> is 1 stripe for file and the default value
+ for
+ <literal>stripe_size</literal> is 1MB. The user may change these values on
+ a per directory or per file basis. For more details, see
+ <xref linkend="dbdoclet.50438209_78664" />.</para>
+ <para>
+ <xref linkend="understandinglustre.fig.filestripe" />, the
+ <literal>stripe_size</literal> for File C is larger than the
+ <literal>stripe_size</literal> for File A, allowing more data to be stored
+ in a single stripe for File C. The
+ <literal>stripe_count</literal> for File A is 3, resulting in data striped
+ across three objects, while the
+ <literal>stripe_count</literal> for File B and File C is 1.</para>
+ <para>No space is reserved on the OST for unwritten data. File A in
+ <xref linkend="understandinglustre.fig.filestripe" />.</para>
+ <figure>
+ <title xml:id="understandinglustre.fig.filestripe">File striping on a
+ Lustre file system</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata scalefit="1" width="100%"
+ fileref="./figures/File_Striping.png" />
+ </imageobject>
+ <textobject>
+ <phrase>File striping pattern across three OSTs for three different
+ data files. The file is sparse and missing chunk 6.</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>The maximum file size is not limited by the size of a single
+ target. In a Lustre file system, files can be striped across multiple
+ objects (up to 2000), and each object can be up to 16 TB in size with
+ ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of
+ 31.25 PB for ldiskfs or 8EB with ZFS. Note that a Lustre file system can
+ support files up to 2^63 bytes (8EB), limited only by the space available
+ on the OSTs.</para>
+ <note>
+ <para>Versions of the Lustre software prior to Release 2.2 limited the
+ maximum stripe count for a single file to 160 OSTs.</para>
+ </note>
+ <para>Although a single file can only be striped over 2000 objects,
+ Lustre file systems can have thousands of OSTs. The I/O bandwidth to
+ access a single file is the aggregated I/O bandwidth to the objects in a
+ file, which can be as much as a bandwidth of up to 2000 servers. On
+ systems with more than 2000 OSTs, clients can do I/O using multiple files
+ to utilize the full file system bandwidth.</para>
+ <para>For more information about striping, see
+ <xref linkend="managingstripingfreespace" />.</para>