ConfiguringFailover.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="configuringfailover">
   5   <title xml:id="configuringfailover.title">Configuring Failover in a Lustre
   6   File System</title>
   7   <para>This chapter describes how to configure failover in a Lustre file
   8   system. It includes:</para>
   9   <itemizedlist>
  10     <listitem>
  11       <para>
  12         <xref xmlns:xlink="http://www.w3.org/1999/xlink"
  13          linkend="high_availability"/></para>
  14     </listitem>
  15     <listitem>
  16       <para><xref xmlns:xlink="http://www.w3.org/1999/xlink"
  17        linkend="failover_setup"
  18         /></para>
  19     </listitem>
  20     <listitem>
  21       <para><xref xmlns:xlink="http://www.w3.org/1999/xlink"
  22        linkend="administering_failover"/></para>
  23     </listitem>
  24   </itemizedlist>
  25   <para>For an overview of failover functionality in a Lustre file system, see <xref
  26       xmlns:xlink="http://www.w3.org/1999/xlink" linkend="understandingfailover"/>.</para>
  27   <section xml:id="high_availability">
  28     <title><indexterm>
  29         <primary>High availability</primary>
  30         <see>failover</see>
  31       </indexterm><indexterm>
  32         <primary>failover</primary>
  33       </indexterm>Setting Up a Failover Environment</title>
  34     <para>The Lustre software provides failover mechanisms only at the layer of the Lustre file
  35       system. No failover functionality is provided for system-level components such as failing
  36       hardware or applications, or even for the entire failure of a node, as would typically be
  37       provided in a complete failover solution. Failover functionality such as node monitoring,
  38       failure detection, and resource fencing must be provided by external HA software, such as
  39       PowerMan or the open source Corosync and Pacemaker packages provided by Linux operating system
  40       vendors. Corosync provides support for detecting failures, and Pacemaker provides the actions
  41       to take once a failure has been detected.</para>
  42     <section remap="h3">
  43       <title><indexterm>
  44           <primary>failover</primary>
  45           <secondary>power control device</secondary>
  46         </indexterm>Selecting Power Equipment</title>
  47       <para>Failover in a Lustre file system requires the use of a remote
  48         power control (RPC) mechanism, which comes in different configurations.
  49         For example, Lustre server nodes may be equipped with IPMI/BMC devices
  50         that allow remote power control. In the past, software or even
  51         “sneakerware” has been used, but these are not recommended. For
  52         recommended devices, refer to the list of supported RPC devices on the
  53         website for the PowerMan cluster power management utility:</para>
  54       <para><link xmlns:xlink="http://www.w3.org/1999/xlink"
  55              xlink:href="https://linux.die.net/man/7/powerman-devices">
  56         https://linux.die.net/man/7/powerman-devices</link></para>
  57     </section>
  58     <section remap="h3">
  59       <title><indexterm>
  60           <primary>failover</primary>
  61           <secondary>power management software</secondary>
  62         </indexterm>Selecting Power Management Software</title>
  63       <para>Lustre failover requires RPC and management capability to verify that a failed node is
  64         shut down before I/O is directed to the failover node. This avoids double-mounting the two
  65         nodes and the risk of unrecoverable data corruption. A variety of power management tools
  66         will work. Two packages that have been commonly used with the Lustre software are PowerMan
  67         and Linux-HA (aka. STONITH ).</para>
  68       <para>The PowerMan cluster power management utility is used to control
  69         RPC devices from a central location. PowerMan provides native support
  70         for several RPC varieties and Expect-like configuration simplifies
  71         the addition of new devices. The latest versions of PowerMan are
  72         available at: </para>
  73       <para><link xmlns:xlink="http://www.w3.org/1999/xlink"
  74              xlink:href="https://github.com/chaos/powerman">
  75           https://github.com/chaos/powerman</link></para>
  76       <para>STONITH, or “Shoot The Other Node In The Head”, is a set of power management tools
  77         provided with the Linux-HA package prior to Red Hat Enterprise Linux 6. Linux-HA has native
  78         support for many power control devices, is extensible (uses Expect scripts to automate
  79         control), and provides the software to detect and respond to failures. With Red Hat
  80         Enterprise Linux 6, Linux-HA is being replaced in the open source community by the
  81         combination of Corosync and Pacemaker. For Red Hat Enterprise Linux subscribers, cluster
  82         management using CMAN is available from Red Hat.</para>
  83     </section>
  84     <section>
  85       <title><indexterm>
  86           <primary>failover</primary>
  87           <secondary>high-availability (HA)  software</secondary>
  88         </indexterm>Selecting High-Availability (HA) Software</title>
  89       <para>The Lustre file system must be set up with high-availability (HA) software to enable a
  90         complete Lustre failover solution. Except for PowerMan, the HA software packages mentioned
  91         above provide both power management and cluster management.  For information about setting
  92         up failover with Pacemaker, see:</para>
  93       <itemizedlist>
  94         <listitem>
  95           <para>Pacemaker Project website:
  96             <link xmlns:xlink="http://www.w3.org/1999/xlink"
  97              xlink:href="https://clusterlabs.org/">https://clusterlabs.org/
  98             </link></para>
  99         </listitem>
 100         <listitem>
 101           <para>Article
 102             <emphasis role="italic">Using Pacemaker with a Lustre File System
 103             </emphasis>:
 104             <link xmlns:xlink="http://www.w3.org/1999/xlink"
 105              xlink:href="https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System">
 106               https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System</link></para>
 107         </listitem>
 108       </itemizedlist>
 109     </section>
 110   </section>
 111   <section xml:id="failover_setup">
 112     <title><indexterm>
 113         <primary>failover</primary>
 114         <secondary>setup</secondary>
 115       </indexterm>Preparing a Lustre File System for Failover</title>
 116     <para>To prepare a Lustre file system to be configured and managed as an HA system by a
 117       third-party HA application, each storage target (MGT, MGS, OST) must be associated with a
 118       second node to create a failover pair. This configuration information is then communicated by
 119       the MGS to a client when the client mounts the file system.</para>
 120     <para>The per-target configuration is relayed to the MGS at mount time. Some rules related to
 121       this are:<itemizedlist>
 122         <listitem>
 123           <para> When a target is <emphasis role="underline"><emphasis role="italic"
 124                 >initially</emphasis></emphasis> mounted, the MGS reads the configuration
 125             information from the target (such as mgt vs. ost, failnode, fsname) to configure the
 126             target into a Lustre file system. If the MGS is reading the initial mount configuration,
 127             the mounting node becomes that target's “primary” node.</para>
 128         </listitem>
 129         <listitem>
 130           <para>When a target is <emphasis role="underline"><emphasis role="italic"
 131                 >subsequently</emphasis></emphasis> mounted, the MGS reads the current configuration
 132             from the target and, as needed, will reconfigure the MGS database target
 133             information</para>
 134         </listitem>
 135       </itemizedlist></para>
 136     <para>When the target is formatted using the <literal>mkfs.lustre</literal> command, the failover
 137       service node(s) for the target are designated using the <literal>--servicenode</literal>
 138       option. In the example below, an OST with index <literal>0</literal> in the  file system
 139         <literal>testfs</literal> is formatted with two service nodes designated to serve as a
 140       failover
 141       pair:<screen>mkfs.lustre --reformat --ost --fsname testfs --mgsnode=192.168.10.1@o3ib \
 142               --index=0 --servicenode=192.168.10.7@o2ib \
 143               --servicenode=192.168.10.8@o2ib \
 144               /dev/sdb</screen></para>
 145     <para>More than two potential service nodes can be designated for a target. The target can then
 146       be mounted on any of the designated service nodes.</para>
 147     <para>When HA is configured on a storage target, the Lustre software
 148       enables multi-mount protection (MMP) on that storage target. MMP prevents
 149       multiple nodes from simultaneously mounting and thus corrupting the data
 150       on the target. For more about MMP, see
 151       <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 152       linkend="managingfailover"/>.</para>
 153     <para>If the MGT has been formatted with multiple service nodes designated, this information
 154       must be conveyed to the Lustre client in the mount command used to mount the file system. In
 155       the example below, NIDs for two MGSs that have been designated as service nodes for the MGT
 156       are specified in the mount command executed on the
 157       client:<screen>mount -t lustre 10.10.120.1@tcp1:10.10.120.2@tcp1:/testfs /lustre/testfs</screen></para>
 158     <para>When a client mounts the file system, the MGS provides configuration information to the
 159       client for the MDT(s) and OST(s) in the file system along with the NIDs for all service nodes
 160       associated with each target and the service node on which the target is mounted. Later, when
 161       the client attempts to access data on a target, it will try the NID for each specified service
 162       node until it connects to the target.</para>
 163   </section>
 164   <section xml:id="administering_failover">
 165     <title>Administering Failover in a Lustre File System</title>
 166     <para>For additional information about administering failover features in a Lustre file system, see:<itemizedlist>
 167         <listitem>
 168           <para><xref xmlns:xlink="http://www.w3.org/1999/xlink"
 169                  linkend="failover_ost" /></para>
 170         </listitem>
 171         <listitem>
 172           <para><xref xmlns:xlink="http://www.w3.org/1999/xlink"
 173                  linkend="failover_nids"
 174             /></para>
 175         </listitem>
 176         <listitem>
 177           <para><xref xmlns:xlink="http://www.w3.org/1999/xlink"
 178                  linkend="lustremaint.ChangeAddrFailoverNode"
 179             /></para>
 180         </listitem>
 181         <listitem>
 182           <para><xref xmlns:xlink="http://www.w3.org/1999/xlink"
 183                  linkend="mkfs.lustre"
 184             /></para>
 185         </listitem>
 186       </itemizedlist></para>
 187   </section>
 188 </chapter>
 189 <!--
 190   vim:expandtab:shiftwidth=2:tabstop=8:
 191   -->