<?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="configuringfailover">
- <title xml:id="configuringfailover.title">Configuring Lustre Failover</title>
- <para>This chapter describes how to configure Lustre failover using the Heartbeat cluster infrastructure daemon. It includes:</para>
+ <title xml:id="configuringfailover.title">Configuring Failover in a Lustre File System</title>
+ <para>This chapter describes how to configure failover in a Lustre file system. It
+ includes:</para>
<itemizedlist>
<listitem>
- <para><xref linkend="dbdoclet.50438188_82389"/></para>
+ <para>
+ <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_82389"/></para>
</listitem>
<listitem>
- <para><xref linkend="dbdoclet.50438188_92688"/></para>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_92688"
+ /></para>
+ </listitem>
+ <listitem>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_tnq_kbr_xl"/></para>
</listitem>
</itemizedlist>
- <note>
- <para>Using Lustre Failover is optional
- </para>
- </note>
+ <para>For an overview of failover functionality in a Lustre file system, see <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="understandingfailover"/>.</para>
<section xml:id="dbdoclet.50438188_82389">
- <title><indexterm><primary>High availability</primary><see>failover</see></indexterm><indexterm><primary>failover</primary></indexterm>Creating a Failover Environment</title>
- <para>Lustre provides failover mechanisms only at the file system level. No failover functionality is provided for system-level components, such as node failure detection or power control, as would typically be provided in a complete failover solution. Additional tools are also needed to provide resource fencing, control and monitoring.</para>
+ <title><indexterm>
+ <primary>High availability</primary>
+ <see>failover</see>
+ </indexterm><indexterm>
+ <primary>failover</primary>
+ </indexterm>Setting Up a Failover Environment</title>
+ <para>The Lustre software provides failover mechanisms only at the layer of the Lustre file
+ system. No failover functionality is provided for system-level components such as failing
+ hardware or applications, or even for the entire failure of a node, as would typically be
+ provided in a complete failover solution. Failover functionality such as node monitoring,
+ failure detection, and resource fencing must be provided by external HA software, such as
+ PowerMan or the open source Corosync and Pacemaker packages provided by Linux operating system
+ vendors. Corosync provides support for detecting failures, and Pacemaker provides the actions
+ to take once a failure has been detected.</para>
<section remap="h3">
- <title><indexterm><primary>failover</primary><secondary>power management</secondary></indexterm>Power Management Software</title>
- <para>Lustre failover requires power control and management capability to verify that a failed node is shut down before I/O is directed to the failover node. This avoids double-mounting the two nodes, and the risk of unrecoverable data corruption. A variety of power management tools will work, but two packages that are commonly used with Lustre are STONITH and PowerMan.</para>
- <para>Shoot The Other Node In The HEAD (STONITH), is a set of power management tools provided with the Linux-HA package. STONITH has native support for many power control devices and is extensible. It uses expect scripts to automate control.</para>
- <para>PowerMan, available from the Lawrence Livermore National Laboratory (LLNL), is used to control remote power control (RPC) devices from a central location. PowerMan provides native support for several RPC varieties and expect-like configuration simplifies the addition of new devices.</para>
- <para>The latest versions of PowerMan are available at:</para>
- <para><link xl:href="http://sourceforge.net/projects/powerman">http://sourceforge.net/projects/powerman</link></para>
- <para>For more information about PowerMan, go to:</para>
- <para><link xl:href="https://computing.llnl.gov/linux/powerman.html">https://computing.llnl.gov/linux/powerman.html</link></para>
+ <title><indexterm>
+ <primary>failover</primary>
+ <secondary>power control device</secondary>
+ </indexterm>Selecting Power Equipment</title>
+ <para>Failover in a Lustre file system requires the use of a remote power control (RPC)
+ mechanism, which comes in different configurations. For example, Lustre server nodes may be
+ equipped with IPMI/BMC devices that allow remote power control. In the past, software or
+ even “sneakerware” has been used, but these are not recommended. For recommended devices,
+ refer to the list of supported RPC devices on the website for the PowerMan cluster power
+ management utility:</para>
+ <para><link xmlns:xlink="http://www.w3.org/1999/xlink"
+ xlink:href="http://code.google.com/p/powerman/wiki/SupportedDevs"
+ >http://code.google.com/p/powerman/wiki/SupportedDevs</link></para>
</section>
<section remap="h3">
- <title><indexterm><primary>failover</primary><secondary>power equipment</secondary></indexterm>Power Equipment</title>
- <para>Lustre failover also requires the use of RPC devices, which come in different configurations. Lustre server nodes may be equipped with some kind of service processor that allows remote power control. If a Lustre server node is not equipped with a service processor, then a multi-port, Ethernet-addressable RPC may be used as an alternative. For recommended products, refer to the list of supported RPC devices on the PowerMan website.</para>
- <para><link xl:href="https://computing.llnl.gov/linux/powerman.html">https://computing.llnl.gov/linux/powerman.html</link></para>
+ <title><indexterm>
+ <primary>failover</primary>
+ <secondary>power management software</secondary>
+ </indexterm>Selecting Power Management Software</title>
+ <para>Lustre failover requires RPC and management capability to verify that a failed node is
+ shut down before I/O is directed to the failover node. This avoids double-mounting the two
+ nodes and the risk of unrecoverable data corruption. A variety of power management tools
+ will work. Two packages that have been commonly used with the Lustre software are PowerMan
+ and Linux-HA (aka. STONITH ).</para>
+ <para>The PowerMan cluster power management utility is used to control RPC devices from a
+ central location. PowerMan provides native support for several RPC varieties and Expect-like
+ configuration simplifies the addition of new devices. The latest versions of PowerMan are
+ available at: </para>
+ <para><link xmlns:xlink="http://www.w3.org/1999/xlink"
+ xlink:href="http://code.google.com/p/powerman/"
+ >http://code.google.com/p/powerman/</link></para>
+ <para>STONITH, or “Shoot The Other Node In The Head”, is a set of power management tools
+ provided with the Linux-HA package prior to Red Hat Enterprise Linux 6. Linux-HA has native
+ support for many power control devices, is extensible (uses Expect scripts to automate
+ control), and provides the software to detect and respond to failures. With Red Hat
+ Enterprise Linux 6, Linux-HA is being replaced in the open source community by the
+ combination of Corosync and Pacemaker. For Red Hat Enterprise Linux subscribers, cluster
+ management using CMAN is available from Red Hat.</para>
+ </section>
+ <section>
+ <title><indexterm>
+ <primary>failover</primary>
+ <secondary>high-availability (HA) software</secondary>
+ </indexterm>Selecting High-Availability (HA) Software</title>
+ <para>The Lustre file system must be set up with high-availability (HA) software to enable a
+ complete Lustre failover solution. Except for PowerMan, the HA software packages mentioned
+ above provide both power management and cluster management. For information about setting
+ up failover with Pacemaker, see:</para>
+ <itemizedlist>
+ <listitem>
+ <para>Pacemaker Project website: <link xmlns:xlink="http://www.w3.org/1999/xlink"
+ xlink:href="http://clusterlabs.org/"><link xlink:href="http://clusterlabs.org/"
+ >http://clusterlabs.org/</link></link></para>
+ </listitem>
+ <listitem>
+ <para>Article <emphasis role="italic">Using Pacemaker with a Lustre File
+ System</emphasis>: <link xmlns:xlink="http://www.w3.org/1999/xlink"
+ xlink:href="https://wiki.hpdd.intel.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System"
+ ><link
+ xlink:href="https://wiki.hpdd.intel.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System"
+ >https://wiki.hpdd.intel.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System</link></link></para>
+ </listitem>
+ </itemizedlist>
</section>
</section>
<section xml:id="dbdoclet.50438188_92688">
- <title><indexterm><primary>failover</primary><secondary>setup</secondary></indexterm>Setting up High-Availability (HA) Software with Lustre</title>
- <para>Lustre must be combined with high-availability (HA) software to enable a complete Lustre failover solution. Lustre can be used with several HA packages including:</para>
- <itemizedlist>
- <listitem>
- <para><emphasis>Red Hat<superscript>*</superscript>Cluster Manager</emphasis> - For more
- information about Red Hat Cluster Manager, see <link
- xlink:href="http://www.redhat.com/software/rha/cluster/manager/"
- xmlns:xlink="http://www.w3.org/1999/xlink"
- >http://www.redhat.com/software/rha/cluster/manager/</link>.</para>
- </listitem>
- <listitem>
- <para><emphasis>Pacemaker</emphasis> - For more information about Pacemaker, see <link
- xlink:href="http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/index.html"
- xmlns:xlink="http://www.w3.org/1999/xlink"
- >http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/index.html</link>.</para>
- </listitem>
- </itemizedlist>
+ <title><indexterm>
+ <primary>failover</primary>
+ <secondary>setup</secondary>
+ </indexterm>Preparing a Lustre File System for Failover</title>
+ <para>To prepare a Lustre file system to be configured and managed as an HA system by a
+ third-party HA application, each storage target (MGT, MGS, OST) must be associated with a
+ second node to create a failover pair. This configuration information is then communicated by
+ the MGS to a client when the client mounts the file system.</para>
+ <para>The per-target configuration is relayed to the MGS at mount time. Some rules related to
+ this are:<itemizedlist>
+ <listitem>
+ <para> When a target is <emphasis role="underline"><emphasis role="italic"
+ >initially</emphasis></emphasis> mounted, the MGS reads the configuration
+ information from the target (such as mgt vs. ost, failnode, fsname) to configure the
+ target into a Lustre file system. If the MGS is reading the initial mount configuration,
+ the mounting node becomes that target's “primary” node.</para>
+ </listitem>
+ <listitem>
+ <para>When a target is <emphasis role="underline"><emphasis role="italic"
+ >subsequently</emphasis></emphasis> mounted, the MGS reads the current configuration
+ from the target and, as needed, will reconfigure the MGS database target
+ information</para>
+ </listitem>
+ </itemizedlist></para>
+ <para>When the target is formatted using the <literal>mkfs.lustre</literal>command, the failover
+ service node(s) for the target are designated using the <literal>--servicenode</literal>
+ option. In the example below, an OST with index <literal>0</literal> in the file system
+ <literal>testfs</literal> is formatted with two service nodes designated to serve as a
+ failover
+ pair:<screen>mkfs.lustre --reformat --ost --fsname testfs --mgsnode=192.168.10.1@o3ib \
+ --index=0 --servicenode=192.168.10.7@o2ib \
+ --servicenode=192.168.10.8@o2ib \
+ /dev/sdb</screen></para>
+ <para>More than two potential service nodes caan be designated for a target. The target can then
+ be mounted on any of the designated service nodes.</para>
+ <para>When HA is configured on a storage target, the Lustre software enables multi-mount
+ protection (MMP) on that storage target. MMP prevents multiple nodes from simultaneously
+ mounting and thus corrupting the data on the target. For more about MMP, see <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="managingfailover"/>.</para>
+ <para>If the MGT has been formatted with multiple service nodes designated, this information
+ must be conveyed to the Lustre client in the mount command used to mount the file system. In
+ the example below, NIDs for two MGSs that have been designated as service nodes for the MGT
+ are specified in the mount command executed on the
+ client:<screen>mount -t lustre 10.10.120.1@tcp1:10.10.120.2@tcp1:/testfs /lustre/testfs</screen></para>
+ <para>When a client mounts the file system, the MGS provides configuration information to the
+ client for the MDT(s) and OST(s) in the file system along with the NIDs for all service nodes
+ associated with each target and the service node on which the target is mounted. Later, when
+ the client attempts to access data on a target, it will try the NID for each specified service
+ node until it connects to the target.</para>
+ <para>Previous to Lustre software release 2.0, the <literal>--failnode</literal> option to
+ <literal>mkfs.lustre</literal> was used to designate a failover service node for a primary
+ server for a target. When the <literal>--failnode</literal> option is used, certain
+ restrictions apply:<itemizedlist>
+ <listitem>
+ <para>The target must be initially mounted on the primary service node, not the failover
+ node designated by the <literal>--failnode</literal> option.</para>
+ </listitem>
+ <listitem>
+ <para>If the <literal>tunefs.lustre –-writeconf</literal> option is used to erase and
+ regenerate the configuration log for the file system, a target cannot be initially
+ mounted on a designated failnode.</para>
+ </listitem>
+ <listitem>
+ <para>If a <literal>--failnode</literal> option is added to a target to designate a
+ failover server for the target, the target must be re-mounted on the primary node before
+ the <literal>--failnode</literal> option takes effect</para>
+ </listitem>
+ </itemizedlist></para>
+ </section>
+ <section xml:id="section_tnq_kbr_xl">
+ <title>Administering Failover in a Lustre File System</title>
+ <para>For additional information about administering failover features in a Lustre file system, see:<itemizedlist>
+ <listitem>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_57420"
+ /></para>
+ </listitem>
+ <listitem>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_41817"
+ /></para>
+ </listitem>
+ <listitem>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438199_62333"
+ /></para>
+ </listitem>
+ <listitem>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438219_75432"
+ /></para>
+ </listitem>
+ </itemizedlist></para>
</section>
</chapter>
<section xml:id="dbdoclet.50438199_62333">
<title><indexterm><primary>maintenance</primary><secondary>changing failover node address</secondary></indexterm>
Changing the Address of a Failover Node</title>
- <para>To change the address of a failover node (e.g, to use node X instead of node Y), run this command on the OSS/OST partition:
- <screen>oss# tunefs.lustre --erase-params --failnode=<replaceable>NID</replaceable> <replaceable>/dev/ost_device</replaceable></screen>
- or
- <screen>oss# tunefs.lustre --erase-params --servicenode=<replaceable>NID</replaceable> <replaceable>/dev/ost_device</replaceable></screen>
- </para>
+ <para>To change the address of a failover node (e.g, to use node X instead of node Y), run
+ this command on the OSS/OST partition (depending on which option was used to originally
+ identify the NID):
+ <screen>oss# tunefs.lustre --erase-params --servicenode=<replaceable>NID</replaceable> <replaceable>/dev/ost_device</replaceable></screen>
+ or
+ <screen>oss# tunefs.lustre --erase-params --failnode=<replaceable>NID</replaceable> <replaceable>/dev/ost_device</replaceable></screen>
+ For more information about the <literal>--servicenode</literal> and
+ <literal>--failnode</literal> options, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="configuringfailover"/>.</para>
</section>
<section xml:id="dbdoclet.50438199_62545">
<title><indexterm><primary>maintenance</primary><secondary>separate a combined MGS/MDT</secondary></indexterm>
</section>
<section xml:id="dbdoclet.50438194_57420">
<title><indexterm><primary>operations</primary><secondary>failover</secondary></indexterm>Specifying Failout/Failover Mode for OSTs</title>
- <para>Lustre uses two modes, failout and failover, to handle an OST that has become unreachable because it fails, is taken off the network, is unmounted, etc.</para>
+ <para>In a Lustre file system, an OST that has become unreachable because it fails, is taken off
+ the network, or is unmounted can be handled in one of two ways:</para>
<itemizedlist>
<listitem>
- <para> In <emphasis>failout</emphasis> mode, Lustre clients immediately receive errors (EIOs) after a timeout, instead of waiting for the OST to recover.</para>
+ <para> In <literal>failout</literal> mode, Lustre clients immediately receive errors (EIOs)
+ after a timeout, instead of waiting for the OST to recover.</para>
</listitem>
<listitem>
- <para> In <emphasis>failover</emphasis> mode, Lustre clients wait for the OST to recover.</para>
+ <para> In <literal>failover</literal> mode, Lustre clients wait for the OST to
+ recover.</para>
</listitem>
</itemizedlist>
- <para>By default, the Lustre file system uses failover mode for OSTs. To specify failout mode instead, use the <literal>--param="failover.mode=failout"</literal> option:</para>
- <screen>oss# mkfs.lustre --fsname=<replaceable>fsname</replaceable> --mgsnode=<replaceable>mgs_NID</replaceable> --param=failover.mode=failout --ost --index=<replaceable>ost_index</replaceable> <replaceable>/dev/ost_block_device</replaceable></screen>
- <para>In this example, failout mode is specified for the OSTs on MGS <literal>mds0</literal>, file system <literal>testfs</literal>.</para>
- <screen>oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout --ost --index=3 /dev/sdb </screen>
+ <para>By default, the Lustre file system uses <literal>failover</literal> mode for OSTs. To
+ specify <literal>failout</literal> mode instead, use the
+ <literal>--param="failover.mode=failout"</literal> option as shown below (entered
+ on one line):</para>
+ <screen>oss# mkfs.lustre --fsname=<replaceable>fsname</replaceable> --mgsnode=<replaceable>mgs_NID</replaceable> --param=failover.mode=failout
+ --ost --index=<replaceable>ost_index</replaceable> <replaceable>/dev/ost_block_device</replaceable></screen>
+ <para>In the example below, <literal>failout</literal> mode is specified for the OSTs on the MGS
+ <literal>mds0</literal> in the file system <literal>testfs</literal> (entered on one
+ line).</para>
+ <screen>oss# mkfs.lustre --fsname=testfs --mgsnode=mds0 --param=failover.mode=failout
+ --ost --index=3 /dev/sdb </screen>
<caution>
- <para>Before running this command, unmount all OSTs that will be affected by the change in the failover/failout mode.</para>
+ <para>Before running this command, unmount all OSTs that will be affected by a change in
+ <literal>failover</literal> / <literal>failout</literal> mode.</para>
</caution>
<note>
- <para>After initial file system configuration, use the tunefs.lustre utility to change the failover/failout mode. For example, to set the failout mode, run:</para>
+ <para>After initial file system configuration, use the <literal>tunefs.lustre</literal>
+ utility to change the mode. For example, to set the <literal>failout</literal> mode,
+ run:</para>
<para><screen>$ tunefs.lustre --param failover.mode=failout <replaceable>/dev/ost_device</replaceable></screen></para>
</note>
</section>
</section>
<section xml:id="dbdoclet.50438194_41817">
<title><indexterm><primary>operations</primary><secondary>failover</secondary></indexterm>Specifying NIDs and Failover</title>
- <para>If a node has multiple network interfaces, it may have multiple NIDs. When a node is specified, all of its NIDs must be listed, delimited by commas (<literal>,</literal>) so other nodes can choose the NID that is appropriate for their network interfaces. When failover nodes are specified, they are delimited by a colon (<literal>:</literal>) or by repeating a keyword (<literal>--mgsnode=</literal> or <literal>--failnode=</literal> or <literal>--servicenode=</literal>). To obtain all NIDs from a node (while LNET is running), run:</para>
+ <para>If a node has multiple network interfaces, it may have multiple NIDs, which must all be
+ identified so other nodes can choose the NID that is appropriate for their network interfaces.
+ Typically, NIDs are specified in a list delimited by commas (<literal>,</literal>). However,
+ when failover nodes are specified, the NIDs are delimited by a colon (<literal>:</literal>) or
+ by repeating a keyword such as <literal>--mgsnode=</literal> or
+ <literal>--servicenode=</literal>). </para>
+ <para>To display the NIDs of all servers in networks configured to work with the Lustre file
+ system, run (while LNET is running):</para>
<screen>lctl list_nids</screen>
- <para>This displays the server's NIDs (networks configured to work with Lustre).</para>
- <para>This example has a combined MGS/MDT failover pair on mds0 and mds1, and a OST failover pair on oss0 and oss1. There are corresponding Elan addresses on mds0 and mds1.</para>
- <screen>mds0# mkfs.lustre --fsname=testfs --mdt --mgs --failnode=mds1,2@elan /dev/sda1
+ <para>In the example below, <literal>mds0</literal> and <literal>mds1</literal> are configured
+ as a combined MGS/MDT failover pair and <literal>oss0</literal> and <literal>oss1</literal>
+ are configured as an OST failover pair. The Ethernet address for <literal>mds0</literal> is
+ 192.168.10.1, and for <literal>mds1</literal> is 192.168.10.2. The Ethernet addresses for
+ <literal>oss0</literal> and <literal>oss1</literal> are 192.168.10.20 and 192.168.10.21
+ respectively.</para>
+ <screen>mds0# mkfs.lustre --fsname=testfs --mdt --mgs \
+ --servicenode=192.168.10.2@tcp0 \
+ -–servicenode=192.168.10.1@tcp0 /dev/sda1
mds0# mount -t lustre /dev/sda1 /mnt/test/mdt
-oss0# mkfs.lustre --fsname=testfs --failnode=oss1 --ost --index=0 \
- --mgsnode=mds0,1@elan --mgsnode=mds1,2@elan /dev/sdb
+oss0# mkfs.lustre --fsname=testfs --servicenode=192.168.10.20@tcp0 \
+ --servicenode=192.168.10.21 --ost --index=0 \
+ --mgsnode=192.168.10.1@tcp0 --mgsnode=192.168.10.2@tcp0 \
+ /dev/sdb
oss0# mount -t lustre /dev/sdb /mnt/test/ost0
-client# mount -t lustre mds0,1@elan:mds1,2@elan:/testfs /mnt/testfs
+client# mount -t lustre 192.168.10.1@tcp0:192.168.10.2@tcp0:/testfs \
+ /mnt/testfs
mds0# umount /mnt/mdt
mds1# mount -t lustre /dev/sda1 /mnt/test/mdt
mds1# cat /proc/fs/lustre/mds/testfs-MDT0000/recovery_status</screen>
- <para>Where multiple NIDs are specified, comma-separation (for example, <literal>mds1,2@elan</literal>) means that the two NIDs refer to the same host, and that Lustre needs to choose the <emphasis>best</emphasis> one for communication. Colon-separation (for example, <literal>mds0:mds1</literal>) means that the two NIDs refer to two different hosts, and should be treated as failover locations (Lustre tries the first one, and if that fails, it tries the second one.)</para>
- <para>Two options exist to specify failover nodes. <literal>--failnode</literal> and <literal>--servicenode</literal>. <literal>--failnode</literal> specifies the NIDs of failover nodes. <literal>--servicenode</literal> specifies all service NIDs, including those of the primary node and of failover nodes. Option <literal>--servicenode</literal> makes the MDT or OST treat all its service nodes equally. The first service node to load the target device becomes the primary service node. Other node NIDs will become failover locations for the target device.</para>
- <note>
- <para>If you have an MGS or MDT configured for failover, perform these steps:</para>
- <orderedlist>
- <listitem>
- <para>On the oss0 node, list the NIDs of all MGS nodes at <literal>mkfs</literal> time.</para>
- <screen>oss0# mkfs.lustre --fsname sunfs --mgsnode=10.0.0.1 \
- --mgsnode=10.0.0.2 --ost --index=0 /dev/sdb</screen>
- </listitem>
- <listitem>
- <para>On the client, mount the file system.</para>
- <para><screen>client# mount -t lustre 10.0.0.1:10.0.0.2:/sunfs /cfs/client/</screen></para>
- </listitem>
- </orderedlist>
- </note>
+ <para>Where multiple NIDs are specified separated by commas (for example,
+ <literal>10.67.73.200@tcp,192.168.10.1@tcp</literal>), the two NIDs refer to the same host,
+ and the Lustre software chooses the <emphasis>best</emphasis> one for communication. When a
+ pair of NIDs is separated by a colon (for example,
+ <literal>10.67.73.200@tcp:10.67.73.201@tcp</literal>), the two NIDs refer to two different
+ hosts and are treated as a failover pair (the Lustre software tries the first one, and if that
+ fails, it tries the second one.)</para>
+ <para>Two options to <literal>mkfs.lustre</literal> can be used to specify failover nodes.
+ Introduced in Lustre software release 2.0, the <literal>--servicenode</literal> option is used
+ to specify all service NIDs, including those for primary nodes and failover nodes. When the
+ <literal>--servicenode</literal>option is used, the first service node to load the target
+ device becomes the primary service node, while nodes corresponding to the other specified NIDs
+ become failover locations for the target device. An older option,
+ <literal>--failnode</literal>, specifies just the NIDS of failover nodes. For more
+ information about the <literal>--servicenode</literal> and <literal>--failnode</literal>
+ options, see <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="configuringfailover"
+ />.</para>
</section>
<section xml:id="dbdoclet.50438194_70905">
<title><indexterm><primary>operations</primary><secondary>erasing a file system</secondary></indexterm>Erasing a File System</title>
<?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="managingfailover">
- <title xml:id="managingfailover.title">Managing Failover</title>
- <para>This chapter describes failover in a Lustre system and includes the following sections:</para>
+ <title xml:id="managingfailover.title">Lustre File System Failover and Multiple-Mount Protection</title>
+ <para>This chapter describes the multiple-mount protection (MMP) feature, which protects the file
+ system from being mounted simultaneously to more than one node. It includes the following
+ sections:</para>
<itemizedlist>
<listitem>
<para><xref linkend="dbdoclet.50438213_13563"/></para>
</listitem>
+ <listitem>
+ <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_etn_4zf_tl"/></para>
+ </listitem>
</itemizedlist>
<note>
- <para>For information about high availability(HA) management software, see the documentation for:<itemizedlist>
- <listitem>
- <para>Red Hat Cluster Manager at <link
- xlink:href="http://www.redhat.com/software/rha/cluster/manager/"
- xmlns:xlink="http://www.w3.org/1999/xlink"
- >http://www.redhat.com/software/rha/cluster/manager/</link></para>
- </listitem>
- <listitem>
- <para>Pacemaker at <link
- xlink:href="http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/index.html"
- xmlns:xlink="http://www.w3.org/1999/xlink"
- >http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/index.html</link></para>
- </listitem>
- </itemizedlist></para>
+ <para>For information about configuring a Lustre file system for failover, see <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="configuringfailover"/></para>
</note>
<section xml:id="dbdoclet.50438213_13563">
<title>
- <indexterm><primary>multiple-mount protection</primary><see>failover</see></indexterm>
- <indexterm><primary>failover</primary></indexterm>
- Lustre Failover and Multiple-Mount Protection</title>
- <para>The failover functionality in Lustre is implemented with the multiple-mount protection (MMP) feature, which protects the file system from being mounted simultaneously to more than one node. This feature is important in a shared storage environment (for example, when a failover pair of OSTs share a partition).</para>
- <para>Lustre's backend file system, <literal>ldiskfs</literal>, supports the MMP mechanism. A block in the file system is updated by a <literal>kmmpd</literal> daemon at one second intervals, and a sequence number is written in this block. If the file system is cleanly unmounted, then a special "clean" sequence is written to this block. When mounting the file system, <literal>ldiskfs</literal> checks if the MMP block has a clean sequence or not.</para>
+ <indexterm>
+ <primary>multiple-mount protection</primary>
+ </indexterm> Overview of Multiple-Mount Protection</title>
+ <para>The multiple-mount protection (MMP) feature protects the Lustre file system from being
+ mounted simultaneously to more than one node. This feature is important in a shared storage
+ environment (for example, when a failover pair of OSSs share a LUN).</para>
+ <para>The backend file system, <literal>ldiskfs</literal>, supports the MMP mechanism. A block
+ in the file system is updated by a <literal>kmmpd</literal> daemon at one second intervals,
+ and a sequence number is written in this block. If the file system is cleanly unmounted, then
+ a special "clean" sequence is written to this block. When mounting the file system,
+ <literal>ldiskfs</literal> checks if the MMP block has a clean sequence or not.</para>
<para>Even if the MMP block has a clean sequence, <literal>ldiskfs</literal> waits for some interval to guard against the following situations:</para>
<itemizedlist>
<listitem>
<note>
<para>The MMP feature is only supported on Linux kernel versions newer than 2.6.9.</para>
</note>
- <section remap="h3">
- <title><indexterm><primary>failover</primary><secondary>multiple-mount protection</secondary></indexterm>Working with Multiple-Mount Protection</title>
- <para>On a new Lustre file system, MMP is automatically enabled by mkfs.lustre at format time if failover is being used and the kernel and <literal>e2fsprogs</literal> version support it. On an existing file system, a Lustre administrator can manually enable MMP when the file system is unmounted.</para>
- <para>Use the following commands to determine whether MMP is running in Lustre and to enable or disable the MMP feature.</para>
- <para>To determine if MMP is enabled, run:</para>
- <screen>dumpe2fs -h <replaceable>/dev/block_device</replaceable> | grep mmp</screen>
- <para>Here is a sample command:</para>
- <screen>dumpe2fs -h /dev/sdc | grep mmp
+ </section>
+ <section xml:id="section_etn_4zf_tl">
+ <title>Working with Multiple-Mount Protection</title>
+ <para>On a new Lustre file system, MMP is automatically enabled by
+ <literal>mkfs.lustre</literal> at format time if failover is being used and the kernel and
+ <literal>e2fsprogs</literal> version support it. On an existing file system, a Lustre file
+ system administrator can manually enable MMP when the file system is unmounted.</para>
+ <para>Use the following commands to determine whether MMP is running in the Lustre file system
+ and to enable or disable the MMP feature.</para>
+ <para>To determine if MMP is enabled, run:</para>
+ <screen>dumpe2fs -h <replaceable>/dev/block_device</replaceable> | grep mmp</screen>
+ <para>Here is a sample command:</para>
+ <screen>dumpe2fs -h /dev/sdc | grep mmp
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent mmp sparse_super large_file uninit_bg</screen>
- <para>To manually disable MMP, run:</para>
- <screen>tune2fs -O ^mmp <replaceable>/dev/block_device</replaceable></screen>
- <para>To manually enable MMP, run:</para>
- <screen>tune2fs -O mmp <replaceable>/dev/block_device</replaceable></screen>
- <para>When MMP is enabled, if <literal>ldiskfs</literal> detects multiple mount attempts after the file system is mounted, it blocks these later mount attempts and reports the time when the MMP block was last updated, the node name, and the device name of the node where the file system is currently mounted.</para>
- </section>
+ <para>To manually disable MMP, run:</para>
+ <screen>tune2fs -O ^mmp <replaceable>/dev/block_device</replaceable></screen>
+ <para>To manually enable MMP, run:</para>
+ <screen>tune2fs -O mmp <replaceable>/dev/block_device</replaceable></screen>
+ <para>When MMP is enabled, if <literal>ldiskfs</literal> detects multiple mount attempts after
+ the file system is mounted, it blocks these later mount attempts and reports the time when the
+ MMP block was last updated, the node name, and the device name of the node where the file
+ system is currently mounted.</para>
</section>
</chapter>
<section xml:id="dbdoclet.50438219_75432">
<title><indexterm><primary>mkfs.lustre</primary></indexterm>
mkfs.lustre</title>
- <para>The mkfs.lustre utility formats a disk for a Lustre service.</para>
+ <para>The <literal>mkfs.lustre</literal> utility formats a disk for a Lustre service.</para>
<section remap="h5">
<title>Synopsis</title>
<screen>mkfs.lustre <replaceable>target_type</replaceable> [options] <replaceable>device</replaceable></screen>
<para> <literal>--ost</literal></para>
</entry>
<entry>
- <para> Object Storage Target (OST)</para>
+ <para> Object storage target (OST)</para>
</entry>
</row>
<row>
<para> <literal>--mdt</literal></para>
</entry>
<entry>
- <para> Metadata Storage Target (MDT)</para>
+ <para> Metadata storage target (MDT)</para>
</entry>
</row>
<row>
<para> <literal>--mgs</literal></para>
</entry>
<entry>
- <para> Configuration Management Service (MGS), one per site. This service can be combined with one <literal>--mdt</literal> service by specifying both types.</para>
+ <para> Configuration management service (MGS), one per site. This service can be
+ combined with one <literal>--mdt</literal> service by specifying both
+ types.</para>
</entry>
</row>
</tbody>
</section>
<section remap="h5">
<title>Description</title>
- <para>mkfs.lustre is used to format a disk device for use as part of a Lustre file system. After formatting, a disk can be mounted to start the Lustre service defined by this command.</para>
- <para>When the file system is created, parameters can simply be added as a --param option to the mkfs.lustre command. See <xref linkend="dbdoclet.50438194_17237"/>.</para>
+ <para><literal>mkfs.lustre</literal> is used to format a disk device for use as part of a
+ Lustre file system. After formatting, a disk can be mounted to start the Lustre service
+ defined by this command.</para>
+ <para>When the file system is created, parameters can simply be added as a
+ <literal>--param</literal> option to the <literal>mkfs.lustre</literal> command. See <xref
+ linkend="dbdoclet.50438194_17237"/>.</para>
<informaltable frame="all">
<tgroup cols="3">
- <colspec colname="c1" colwidth="33*"/>
- <colspec colname="c2" colwidth="33*"/>
- <colspec colname="c3" colwidth="33*"/>
+ <colspec colname="c1" colwidth="1*"/>
+ <colspec colname="c2" colwidth="1*"/>
+ <colspec colname="c3" colwidth="3*"/>
<thead>
<row>
<entry nameend="c2" namest="c1">
<para> <literal>--comment=<replaceable>comment</replaceable></literal></para>
</entry>
<entry>
- <para> Sets a user comment about this disk, ignored by Lustre.</para>
+ <para> Sets a user comment about this disk, ignored by the Lustre software.</para>
</entry>
</row>
<row>
<para> <literal>--device-size=<replaceable>#</replaceable>>KB</literal></para>
</entry>
<entry>
- <para> Sets the device size for loop devices.</para>
+ <para>Sets the device size for loop devices.</para>
</entry>
</row>
<row>
<para> <literal>--dryrun</literal></para>
</entry>
<entry>
- <para> Only prints what would be done; it does not affect the disk.</para>
+ <para>Only prints what would be done; it does not affect the disk.</para>
</entry>
</row>
<row>
- <entry nameend="c2" namest="c1">
- <para> <literal>--failnode=<replaceable>nid,...</replaceable></literal></para>
- </entry>
- <entry>
- <para> Sets the NID(s) of a failover partner. This option can be repeated as needed.</para>
- <warning><para>This cannot be used with <literal>--servicenode</literal>.</para></warning>
- </entry>
+ <entry nameend="c2" namest="c1"
+ ><literal>--servicenode=<replaceable>nid,...</replaceable></literal></entry>
+ <entry>Sets the NID(s) of all service nodes, including primary and failover partner
+ service nodes. The <literal>--servicenode</literal> option cannot be used with
+ <literal>--failnode</literal> option. See <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_92688"/> for
+ more details.</entry>
</row>
<row>
<entry nameend="c2" namest="c1">
- <para> <literal>--servicenode=<replaceable>nid,...</replaceable></literal></para>
+ <para> <literal>--failnode=<replaceable>nid,...</replaceable></literal></para>
</entry>
<entry>
- <para>Sets the NID(s) of all service node, including failover partner as well as
- primary node service nids. This option can be repeated as needed.</para>
- <warning><para>This cannot be used with <literal>--failnode</literal>.</para></warning>
+ <para>Sets the NID(s) of a failover service node for a primary server for a target.
+ The <literal>--failnode</literal> option cannot be used with
+ <literal>--servicenode</literal> option. See <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_92688"/>
+ for more details.<note>
+ <para>When the <literal>--failnode</literal> option is used, certain
+ restrictions apply (see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="dbdoclet.50438188_92688"/>).</para>
+ </note></para>
</entry>
</row>
<row>
<para> <literal>--fsname=<replaceable>filesystem_name</replaceable></literal></para>
</entry>
<entry>
- <para>The name of the Lustre file system to which this node belongs. The file system
- name is limited to 8 characters. The default file system name is
- <literal>lustre</literal>.</para>
+ <para> The Lustre file system of which this service/node will be a part. The default
+ file system name is <literal>lustre</literal>.</para>
+ <para> </para>
+ <note>
+ <para>The file system name is limited to 8 characters.</para>
+ </note>
</entry>
</row>
<row>
</entry>
<entry>
<para> Sets the mount options used when the backing file system is mounted.</para>
- <warning><para>Unlike earlier versions of mkfs.lustre, this version completely replaces the default mount options with those specified on the command line, and issues a warning on stderr if any default mount options are omitted.</para></warning>
+ <warning><para>Unlike earlier versions of <literal>mkfs.lustre</literal>, this version completely replaces
+ the default mount options with those specified on the command line, and issues a
+ warning on stderr if any default mount options are omitted.</para></warning>
<para>The defaults for ldiskfs are:</para>
<para>OST: <literal>errors=remount-ro</literal>;</para>
<para>MGS/MDT: <literal>errors=remount-ro,iopen_nopriv,user_xattr</literal></para>
- <para>Do not alter the default mount options unless you know what you are doing.</para>
+ <para>Use care when altering the default mount options.</para>
</entry>
</row>
<row>
<title>Examples</title>
<para>Creates a combined MGS and MDT for file system <literal>testfs</literal> on, e.g., node <literal>cfs21</literal>:</para>
<screen>mkfs.lustre --fsname=testfs --mdt --mgs /dev/sda1</screen>
- <para>Creates an OST for file system <literal>testfs</literal> on any node (using the above MGS):</para>
+ <para>Creates an OST for file system <literal>testfs</literal> on any node (using the above
+ MGS):</para>
<screen>mkfs.lustre --fsname=testfs --mgsnode=cfs21@tcp0 --ost --index=0 /dev/sdb</screen>
<para>Creates a standalone MGS on, e.g., node <literal>cfs22</literal>:</para>
<screen>mkfs.lustre --mgs /dev/sda1</screen>
</row>
<row>
<entry>
- <para> <literal>--failnode=<replaceable>nid,...</replaceable></literal></para>
- </entry>
- <entry>
- <para> Sets the NID(s) of a failover partner. This option can be repeated as needed.</para>
- <warning><para>Cannot be used with <literal>--servicenode</literal>.</para></warning>
- </entry>
+ <literal>--servicenode=<replaceable>nid,...</replaceable></literal></entry>
+ <entry>Sets the NID(s) of all service nodes, including primary and failover partner
+ service nodes. The <literal>--servicenode</literal> option cannot be used with
+ <literal>--failnode</literal> option. See <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_92688"/> for
+ more details.</entry>
</row>
<row>
<entry>
- <para> <literal>--servicenode=<replaceable>nid,...</replaceable></literal></para>
+ <para> <literal>--failnode=<replaceable>nid,...</replaceable></literal></para>
</entry>
<entry>
- <para> Sets the NID(s) of all service node, including failover partner as well as local service nids. This option can be repeated as needed.</para>
- <warning><para>: Cannot be used with <literal>--failnode</literal>.</para></warning>
+ <para>Sets the NID(s) of a failover service node for a primary server for a target.
+ The <literal>--failnode</literal> option cannot be used with
+ <literal>--servicenode</literal> option. See <xref
+ xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_92688"/>
+ for more details.<note>
+ <para>When the <literal>--failnode</literal> option is used, certain
+ restrictions apply (see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+ linkend="dbdoclet.50438188_92688"/>).</para>
+ </note></para>
</entry>
</row>
<row>
<section xml:id="dbdoclet.50540653_59957">
<title><indexterm><primary>failover</primary></indexterm>
What is Failover?</title>
- <para>A computer system is ''highly available'' when the services it provides are available with minimal downtime. In a highly-available system, if a failure condition occurs, such as the loss of a server or a network or software fault, the system's services continue without interruption. Generally, we measure availability by the percentage of time the system is required to be available.</para>
- <para>Availability is accomplished by replicating hardware and/or software so that when a primary server fails or is unavailable, a standby server can be switched into its place to run applications and associated resources. This process, called <emphasis role="italic">failover</emphasis>, should be automatic and, in most cases, completely application-transparent.</para>
- <para>A failover hardware setup requires a pair of servers with a shared resource (typically a physical storage device, which may be based on SAN, NAS, hardware RAID, SCSI or FC technology). The method of sharing storage should be essentially transparent at the device level; the same physical logical unit number (LUN) should be visible from both servers. To ensure high availability at the physical storage level, we encourage the use of RAID arrays to protect against drive-level failures.</para>
+ <para>In a high-availability (HA) system, unscheduled downtime is minimized by using redundant
+ hardware and software components and software components that automate recovery when a failure
+ occurs. If a failure condition occurs, such as the loss of a server or storage device or a
+ network or software fault, the system's services continue with minimal interruption.
+ Generally, availability is specified as the percentage of time the system is required to be
+ available.</para>
+ <para>Availability is accomplished by replicating hardware and/or software so that when a
+ primary server fails or is unavailable, a standby server can be switched into its place to run
+ applications and associated resources. This process, called <emphasis role="italic"
+ >failover</emphasis>, is automatic in an HA system and, in most cases, completely
+ application-transparent.</para>
+ <para>A failover hardware setup requires a pair of servers with a shared resource (typically a
+ physical storage device, which may be based on SAN, NAS, hardware RAID, SCSI or Fibre Channel
+ (FC) technology). The method of sharing storage should be essentially transparent at the
+ device level; the same physical logical unit number (LUN) should be visible from both servers.
+ To ensure high availability at the physical storage level, we encourage the use of RAID arrays
+ to protect against drive-level failures.</para>
<note>
<para>The Lustre software does not provide redundancy for data; it depends exclusively on
redundancy of backing storage devices. The backing OST storage should be RAID 5 or,
- preferably, RAID 6 storage. MDT storage should be RAID 1 or RAID 0+1.</para>
+ preferably, RAID 6 storage. MDT storage should be RAID 1 or RAID 10.</para>
</note>
<section remap="h3">
<title><indexterm><primary>failover</primary><secondary>capabilities</secondary></indexterm>Failover Capabilities</title>
</itemizedlist>
<para>These capabilities can be provided by a variety of software and/or hardware solutions.
For more information about using power management software or hardware and high availability
- (HA) software with the Lustre software, see <xref linkend="configuringfailover"/>.</para>
+ (HA) software with a Lustre file system, see <xref linkend="configuringfailover"/>.</para>
<para>HA software is responsible for detecting failure of the primary Lustre server node and
- controlling the failover. The Lustre software works with any HA software that includes
+ controlling the failover.The Lustre software works with any HA software that includes
resource (I/O) fencing. For proper resource fencing, the HA software must be able to
completely power off the failed server or disconnect it from the shared storage device. If
two active nodes have access to the same storage device, data may be severely
<para><emphasis role="bold">Active/active</emphasis> pair - In this configuration, both nodes are active, each providing a subset of resources. In case of a failure, the second node takes over resources from the failed node.</para>
</listitem>
</itemizedlist>
- <para>Before Lustre software release 2.4, MDSs are configured as an active/passive pair, while
- OSSs are deployed in an active/active configuration that provides redundancy without extra
- overhead. Often the standby MDS is the active MDS for another Lustre file system or the MGS,
- so no nodes are idle in the cluster.</para>
+ <para>In Lustre software releases previous to Lustre software release 2.4, MDSs can be
+ configured as an active/passive pair, while OSSs can be deployed in an active/active
+ configuration that provides redundancy without extra overhead. Often the standby MDS is the
+ active MDS for another Lustre file system or the MGS, so no nodes are idle in the
+ cluster.</para>
<para condition="l24">Lustre software release 2.4 introduces metadata targets for individual
sub-directories. Active-active failover configurations are available for MDSs that serve
MDTs on shared storage.</para>
<primary>failover</primary>
<secondary>and Lustre</secondary>
</indexterm>Failover Functionality in a Lustre File System</title>
- <para>The failover functionality provided in the Lustre software can be used for the following
+ <para>The failover functionality provided by the Lustre software can be used for the following
failover scenario. When a client attempts to do I/O to a failed Lustre target, it continues to
try until it receives an answer from any of the configured failover nodes for the Lustre
target. A user-space application does not detect anything unusual, except that the I/O may
take longer to complete.</para>
- <para>Lustre failover requires two nodes configured as a failover pair, which must share one or
- more storage devices. A Lustre file system can be configured to provide MDT or OST
- failover.</para>
+ <para>Failover in a Lustre file system requires that two nodes be configured as a failover pair,
+ which must share one or more storage devices. A Lustre file system can be configured to
+ provide MDT or OST failover.</para>
<itemizedlist>
<listitem>
- <para>For MDT failover, two MDSs are configured to serve the same MDT. Only one MDS node can serve an MDT at a time.</para>
- <para condition="l24">Lustre software release 2.4 allows multiple MDTs. By placing two or
+ <para>For MDT failover, two MDSs can be configured to serve the same MDT. Only one MDS node
+ can serve an MDT at a time.</para>
+ <para condition="l24">Lustresoftware release 2.4 allows multiple MDTs. By placing two or
more MDT partitions on storage shared by two MDSs, one MDS can fail and the remaining MDS
can begin serving the unserved MDT. This is described as an active/active failover
pair.</para>
</listitem>
<listitem>
- <para>For OST failover, multiple OSS nodes are configured to be able to serve the same OST. However, only one OSS node can serve the OST at a time. An OST can be moved between OSS nodes that have access to the same storage device using <literal>umount/mount</literal> commands.</para>
+ <para>For OST failover, multiple OSS nodes can be configured to be able to serve the same
+ OST. However, only one OSS node can serve the OST at a time. An OST can be moved between
+ OSS nodes that have access to the same storage device using
+ <literal>umount/mount</literal> commands.</para>
</listitem>
</itemizedlist>
- <para>To add a failover partner to a Lustre file system configuration, the
- <literal>--failnode</literal> or <literal>--servicenode</literal> option is used. This can
- be done at creation time (using <literal>mkfs.lustre</literal>) or later when the Lustre file
- system is active (using <literal>tunefs.lustre</literal>). For explanations of these
- utilities, see <xref linkend="dbdoclet.50438219_75432"/> and <xref
+ <para>The <literal>--servicenode</literal> option is used to set up nodes in a Lustre file
+ system for failover at creation time (using <literal>mkfs.lustre</literal>) or later when the
+ Lustre file system is active (using <literal>tunefs.lustre</literal>). For explanations of
+ these utilities, see <xref linkend="dbdoclet.50438219_75432"/> and <xref
linkend="dbdoclet.50438219_39574"/>.</para>
- <para>Lustre failover capability can be used to upgrade the Lustre software between successive minor versions without cluster downtime. For more information, see <xref linkend="upgradinglustre"/>.</para>
+ <para>Failover capability in a Lustre file system can be used to upgrade the Lustre software
+ between successive minor versions without cluster downtime. For more information, see <xref
+ linkend="upgradinglustre"/>.</para>
<para>For information about configuring failover, see <xref linkend="configuringfailover"/>.</para>
<note>
- <para>Failover functionality is provided only at the file system level in the Lustre
- software. In a complete failover solution, failover functionality for system-level
- components, such as node failure detection or power control, must be provided by a
- third-party tool.</para>
+ <para>The Lustre software provides failover functionality only at the file system level. In a
+ complete failover solution, failover functionality for system-level components, such as node
+ failure detection or power control, must be provided by a third-party tool.</para>
</note>
<caution>
<para>OST failover functionality does not protect against corruption caused by a disk failure.
- If the storage media (i.e., physical disk) used for an OST fails, the Lustre software does
- not provide a means to recover it. We strongly recommend that some form of RAID be used for
- OSTs. Failover functionality provided in the Lustre software assumes that the storage is
- reliable, so no extra reliability features are included.</para>
+ If the storage media (i.e., physical disk) used for an OST fails, it cannot be recovered by
+ functionality provided in the Lustre software. We strongly recommend that some form of RAID
+ be used for OSTs. Lustre functionality assumes that the storage is reliable, so it adds no
+ extra reliability features.</para>
</caution>
<section remap="h3">
<title><indexterm><primary>failover</primary><secondary>MDT</secondary></indexterm>MDT Failover Configuration (Active/Passive)</title>
<title xml:id="understandingfailover.fig.configmdts"> Lustre failover configuration for a active/active MDTs </title>
<mediaobject>
<imageobject>
- <imagedata scalefit="1" width="50%" fileref="./figures/MDTs_Failover.svg"/>
+ <imagedata scalefit="1" width="50%" fileref="figures/MDTs_Failover.png"/>
</imageobject>
<textobject>
<phrase>Lustre failover configuration for two MDTs</phrase>
</figure>
<para>In an active configuration, 50% of the available OSTs are assigned to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS serves as the primary node for half the OSTs and as a failover node for the remaining OSTs.</para>
<para>In this mode, if one OSS fails, the other OSS takes over all of the failed OSTs. The clients attempt to connect to each OSS serving the OST, until one of them responds. Data on the OST is written synchronously, and the clients replay transactions that were in progress and uncommitted to disk before the OST failure.</para>
+ <para>For more information about configuring failover, see <xref linkend="configuringfailover"
+ />.</para>
</section>
</section>
</chapter>
--- /dev/null
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:cc="http://creativecommons.org/ns#"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+ xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+ width="300.34375"
+ height="252.86719"
+ id="svg2"
+ version="1.1"
+ inkscape:version="0.48.4 r9939"
+ sodipodi:docname="MDTs_Failover.png">
+ <defs
+ id="defs4" />
+ <sodipodi:namedview
+ id="base"
+ pagecolor="#ffffff"
+ bordercolor="#666666"
+ borderopacity="1.0"
+ inkscape:pageopacity="0.0"
+ inkscape:pageshadow="2"
+ inkscape:zoom="1.5860717"
+ inkscape:cx="140.02343"
+ inkscape:cy="200.22648"
+ inkscape:document-units="px"
+ inkscape:current-layer="layer1"
+ showgrid="true"
+ showguides="true"
+ inkscape:guide-bbox="true"
+ fit-margin-top="0.2"
+ fit-margin-left="0"
+ fit-margin-right="0"
+ fit-margin-bottom="0"
+ inkscape:window-width="1440"
+ inkscape:window-height="839"
+ inkscape:window-x="-9"
+ inkscape:window-y="-9"
+ inkscape:window-maximized="1"
+ units="in">
+ <inkscape:grid
+ type="xygrid"
+ id="grid3153"
+ empspacing="5"
+ visible="true"
+ enabled="true"
+ snapvisiblegridlinesonly="true"
+ originx="-219.97656px"
+ originy="-371.63331px" />
+ </sodipodi:namedview>
+ <metadata
+ id="metadata7">
+ <rdf:RDF>
+ <cc:Work
+ rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type
+ rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+ <dc:title></dc:title>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g
+ inkscape:groupmode="layer"
+ id="layer2"
+ inkscape:label="Layer"
+ transform="translate(-219.97656,-427.86218)">
+ <path
+ style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
+ d="m 290,567.36218 20,-55"
+ id="path3156"
+ inkscape:connector-curvature="0"
+ sodipodi:nodetypes="cc" />
+ <path
+ sodipodi:nodetypes="cc"
+ inkscape:connector-curvature="0"
+ id="path3926"
+ d="m 305,577.36218 85,-70"
+ style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
+ <path
+ style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
+ d="m 435,565.36218 -20,-58"
+ id="path3928"
+ inkscape:connector-curvature="0"
+ sodipodi:nodetypes="cc" />
+ <path
+ sodipodi:nodetypes="cc"
+ inkscape:connector-curvature="0"
+ id="path3930"
+ d="m 415,577.36218 -80,-70"
+ style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
+ <text
+ xml:space="preserve"
+ style="font-size:40px;font-style:normal;font-weight:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans"
+ x="295"
+ y="457.36218"
+ id="text3932"
+ sodipodi:linespacing="125%"><tspan
+ sodipodi:role="line"
+ id="tspan3934"
+ x="295"
+ y="457.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans">MDT0</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-size:40px;font-style:normal;font-weight:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans"
+ x="380"
+ y="457.36218"
+ id="text3932-3"
+ sodipodi:linespacing="125%"><tspan
+ sodipodi:role="line"
+ id="tspan3934-0"
+ x="380"
+ y="457.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans">MDT1</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-size:40px;font-style:normal;font-weight:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans"
+ x="260"
+ y="637.36218"
+ id="text3932-8"
+ sodipodi:linespacing="125%"><tspan
+ sodipodi:role="line"
+ id="tspan3934-4"
+ x="260"
+ y="637.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans">MDS0</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-size:40px;font-style:normal;font-weight:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans"
+ x="415"
+ y="637.36218"
+ id="text3932-8-4"
+ sodipodi:linespacing="125%"><tspan
+ sodipodi:role="line"
+ id="tspan3934-4-4"
+ x="415"
+ y="637.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans">MDS1</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-size:40px;font-style:normal;font-weight:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans"
+ x="220"
+ y="657.36218"
+ id="text3932-8-5"
+ sodipodi:linespacing="125%"><tspan
+ sodipodi:role="line"
+ id="tspan3934-4-42"
+ x="220"
+ y="657.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans">Active for MDT0, </tspan><tspan
+ sodipodi:role="line"
+ x="220"
+ y="677.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans"
+ id="tspan3996"> standby for MDT1</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-size:40px;font-style:normal;font-weight:normal;line-height:125%;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans"
+ x="385"
+ y="657.36218"
+ id="text3932-8-5-7"
+ sodipodi:linespacing="125%"><tspan
+ sodipodi:role="line"
+ id="tspan3934-4-42-7"
+ x="385"
+ y="657.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans">Active for MDT1, </tspan><tspan
+ sodipodi:role="line"
+ x="385"
+ y="677.36218"
+ style="font-size:16px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:Liberation Sans;-inkscape-font-specification:Liberation Sans"
+ id="tspan3996-5"> standby for MDT0</tspan></text>
+ </g>
+ <g
+ inkscape:label="Layer 1"
+ inkscape:groupmode="layer"
+ id="layer1"
+ transform="translate(-219.97656,-427.86218)">
+ <path
+ inkscape:connector-curvature="0"
+ id="path3092"
+ d="m 325.63111,516.07236 c -9.32466,2.68065 -12.31484,0.82334 -18.99634,0.44889 -7.5071,-0.42072 -14.71276,-8.39091 -18.50244,-15.74496 -3.51662,-6.82418 -3.70958,-17.04292 -0.44051,-23.3298 6.97202,-13.40822 20.62884,-20.18558 34.78675,-17.26335 7.98141,1.64738 12.19069,3.95382 17.26764,9.46165 5.87108,6.36937 7.48325,10.54237 7.54274,19.5239 0.0418,6.3151 -0.3695,8.24814 -2.60361,12.23546 0,0 -2.90778,4.85088 -6.32187,7.87995 -3.41409,3.02906 -3.40769,4.10761 -12.73236,6.78826 z m -1.79077,-11.98966 c 7.30462,-1.56656 8.03466,-2.65557 8.03466,-11.98544 0,-6.38095 -0.29279,-7.92857 -1.5,-7.92857 -1.20872,0 -1.5,1.55831 -1.5,8.02475 0,7.82922 -0.067,8.05072 -2.75,9.09059 -2.98431,1.15665 -19.9603,0.80159 -22,-0.46014 -0.8559,-0.52945 -1.25,-3.72437 -1.25,-10.13361 l 0,-9.36038 3.73858,1.03228 c 5.51873,1.52379 19.46478,0.55851 22.7169,-1.57236 3.29897,-2.16157 2.92419,-3.61355 -1.51503,-5.86958 -5.3484,-2.71807 -24.65451,-2.20176 -23.60451,0.63126 0.22773,0.61446 1.03281,0.93217 1.78906,0.70603 6.39049,-1.91095 22.875,-0.5337 22.875,1.91116 0,3.24428 -20.57655,3.81792 -25.32587,0.70605 -1.34469,-0.88107 -2.72147,-1.32538 -3.05951,-0.98734 -0.33804,0.33805 -0.61462,5.41988 -0.61462,11.29297 0,12.26136 0.64131,13.44017 8.10994,14.90699 6.27881,1.23314 10.08895,1.23202 15.8554,-0.005 z"
+ style="fill:#52b4d3"
+ sodipodi:nodetypes="zssssssszzcsssssscssssssscsscc" />
+ <path
+ inkscape:connector-curvature="0"
+ id="path3092-5"
+ d="m 415.01753,516.07236 c -9.32466,2.68065 -12.31484,0.82334 -18.99634,0.44889 -7.5071,-0.42072 -14.71276,-8.39091 -18.50244,-15.74496 -3.51662,-6.82418 -3.70958,-17.04292 -0.44051,-23.3298 6.97202,-13.40822 20.62884,-20.18558 34.78675,-17.26335 7.98141,1.64738 12.19069,3.95382 17.26764,9.46165 5.87108,6.36937 7.48325,10.54237 7.54274,19.5239 0.0418,6.3151 -0.3695,8.24814 -2.60361,12.23546 0,0 -2.90778,4.85088 -6.32187,7.87995 -3.41409,3.02906 -3.40769,4.10761 -12.73236,6.78826 z m -1.79077,-11.98966 c 7.30462,-1.56656 8.03466,-2.65557 8.03466,-11.98544 0,-6.38095 -0.29279,-7.92857 -1.5,-7.92857 -1.20872,0 -1.5,1.55831 -1.5,8.02475 0,7.82922 -0.067,8.05072 -2.75,9.09059 -2.98431,1.15665 -19.9603,0.80159 -22,-0.46014 -0.8559,-0.52945 -1.25,-3.72437 -1.25,-10.13361 l 0,-9.36038 3.73858,1.03228 c 5.51873,1.52379 19.46478,0.55851 22.7169,-1.57236 3.29897,-2.16157 2.92419,-3.61355 -1.51503,-5.86958 -5.3484,-2.71807 -24.65451,-2.20176 -23.60451,0.63126 0.22773,0.61446 1.03281,0.93217 1.78906,0.70603 6.39049,-1.91095 22.875,-0.5337 22.875,1.91116 0,3.24428 -20.57655,3.81792 -25.32587,0.70605 -1.34469,-0.88107 -2.72147,-1.32538 -3.05951,-0.98734 -0.33804,0.33805 -0.61462,5.41988 -0.61462,11.29297 0,12.26136 0.64131,13.44017 8.10994,14.90699 6.27881,1.23314 10.08895,1.23202 15.8554,-0.005 z"
+ style="fill:#52b4d3"
+ sodipodi:nodetypes="zssssssszzcsssssscssssssscsscc" />
+ <path
+ inkscape:connector-curvature="0"
+ id="path3088"
+ d="m 273.2251,620.86242 c -9.19938,-2.60094 -15.60559,-8.44179 -19.89538,-18.13958 -1.72548,-3.90074 -2.05873,-6.17143 -1.68084,-11.45292 1.00398,-14.03197 10.65486,-24.5457 24.93416,-27.16343 l 6.66116,-1.22115 3.17769,0.19426 c 0,0.49287 1.66657,1.20877 3.70349,1.5909 5.58469,1.04769 11.37286,4.35957 15.48991,8.86303 l 3.6934,4.04004 1.41285,2.81082 1.30628,3.65696 c 4.57788,12.81582 -2.35355,28.58078 -15.24827,34.68096 -6.68477,3.1624 -16.71526,4.07375 -23.55445,2.14011 z m 26.19679,-27.29332 c 0,-17.64578 0.85003,-16.45462 -8.75193,-12.26423 l -5.25193,2.292 -7.99807,-3.89951 c -4.39894,-2.14473 -7.99764,-4.14691 -7.99712,-4.44928 5.3e-4,-0.30238 2.36303,-1.51109 5.25,-2.68602 l 5.24905,-2.13623 6.72504,3.38623 c 7.28423,3.66781 7.77496,3.82624 7.77496,2.51011 0,-0.48187 -3.19521,-2.55517 -7.10046,-4.60732 l -7.10047,-3.73119 -6.64953,2.77106 c -10.66886,4.44604 -10.69427,4.24134 1.25368,10.09935 l 10.6382,5.21585 5.5078,-2.43581 c 3.0293,-1.33969 5.73267,-2.43581 6.00751,-2.43581 0.27484,0 0.37451,6.13256 0.22149,13.62791 l -0.27822,13.6279 -5.29305,2.3345 -5.29305,2.33449 0.29305,-7.18732 0.29305,-7.18732 4.5,-1.65644 c 4.37478,-1.61035 5.74478,-3.51046 4.6285,-6.41946 -0.46279,-1.20599 -1.04644,-1.19758 -4.05246,0.0584 -4.43072,1.85129 -5.57604,1.84645 -5.57604,-0.0235 0,-1.36917 -18.48575,-11.50912 -20.98183,-11.50912 -0.64057,0 -1.01817,5.00248 -1.01817,13.48884 l 0,13.48884 7.9778,4.01116 c 4.38779,2.20614 8.43779,4.01116 9,4.01116 2.76343,0 0.26688,-1.96782 -6.9778,-5.5 l -8,-3.90045 0,-11.29977 c 0,-6.21488 0.21839,-11.29978 0.4853,-11.29978 0.26692,0 4.31692,1.90029 9,4.22286 5.74384,2.84865 8.5147,4.78197 8.5147,5.94098 0,2.65288 1.57718,3.03967 5.57811,1.36797 4.2406,-1.77184 4.42189,-1.80174 4.42189,-0.72935 0,0.44135 -2.25,1.61154 -5,2.60043 l -5,1.79797 0,8.7329 c 0,4.8031 0.27864,9.01155 0.61921,9.35211 0.34056,0.34057 3.71556,-0.94009 7.5,-2.8459 l 6.88079,-3.46511 0,-15.60406 z"
+ style="fill:#59b124"
+ sodipodi:nodetypes="ssssccssccssssscscscssscsscssscccccsssssscssscssssssscscsccs" />
+ <path
+ inkscape:connector-curvature="0"
+ id="path3088-9"
+ d="m 428.80321,620.86242 c -9.19938,-2.60094 -15.60559,-8.44179 -19.89538,-18.13958 -1.72548,-3.90074 -2.05873,-6.17143 -1.68084,-11.45292 1.00398,-14.03197 10.65486,-24.5457 24.93416,-27.16343 l 6.66116,-1.22115 L 442,563.0796 c 0,0.49287 1.66657,1.20877 3.70349,1.5909 5.58469,1.04769 11.37286,4.35957 15.48991,8.86303 l 3.6934,4.04004 1.41285,2.81082 1.30628,3.65696 c 4.57788,12.81582 -2.35355,28.58078 -15.24827,34.68096 -6.68477,3.1624 -16.71526,4.07375 -23.55445,2.14011 z M 455,593.5691 c 0,-17.64578 0.85003,-16.45462 -8.75193,-12.26423 l -5.25193,2.292 -7.99807,-3.89951 c -4.39894,-2.14473 -7.99764,-4.14691 -7.99712,-4.44928 5.3e-4,-0.30238 2.36303,-1.51109 5.25,-2.68602 l 5.24905,-2.13623 6.72504,3.38623 c 7.28423,3.66781 7.77496,3.82624 7.77496,2.51011 0,-0.48187 -3.19521,-2.55517 -7.10046,-4.60732 l -7.10047,-3.73119 -6.64953,2.77106 c -10.66886,4.44604 -10.69427,4.24134 1.25368,10.09935 l 10.6382,5.21585 5.5078,-2.43581 c 3.0293,-1.33969 5.73267,-2.43581 6.00751,-2.43581 0.27484,0 0.37451,6.13256 0.22149,13.62791 l -0.27822,13.6279 -5.29305,2.3345 -5.29305,2.33449 0.29305,-7.18732 0.29305,-7.18732 4.5,-1.65644 c 4.37478,-1.61035 5.74478,-3.51046 4.6285,-6.41946 -0.46279,-1.20599 -1.04644,-1.19758 -4.05246,0.0584 -4.43072,1.85129 -5.57604,1.84645 -5.57604,-0.0235 0,-1.36917 -18.48575,-11.50912 -20.98183,-11.50912 -0.64057,0 -1.01817,5.00248 -1.01817,13.48884 l 0,13.48884 7.9778,4.01116 c 4.38779,2.20614 8.43779,4.01116 9,4.01116 2.76343,0 0.26688,-1.96782 -6.9778,-5.5 l -8,-3.90045 0,-11.29977 c 0,-6.21488 0.21839,-11.29978 0.4853,-11.29978 0.26692,0 4.31692,1.90029 9,4.22286 5.74384,2.84865 8.5147,4.78197 8.5147,5.94098 0,2.65288 1.57718,3.03967 5.57811,1.36797 4.2406,-1.77184 4.42189,-1.80174 4.42189,-0.72935 0,0.44135 -2.25,1.61154 -5,2.60043 l -5,1.79797 0,8.7329 c 0,4.8031 0.27864,9.01155 0.61921,9.35211 0.34056,0.34057 3.71556,-0.94009 7.5,-2.8459 L 455,609.1732 l 0,-15.60406 z"
+ style="fill:#59b124"
+ sodipodi:nodetypes="ssssccssccssssscscscssscsscssscccccsssssscssscssssssscscsccs" />
+ </g>
+</svg>