1 <?xml version='1.0' encoding='UTF-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="configuringfailover">
5 <title xml:id="configuringfailover.title">Configuring Failover in a Lustre
7 <para>This chapter describes how to configure failover in a Lustre file
8 system. It includes:</para>
12 <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_82389"/></para>
15 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438188_92688"
19 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_tnq_kbr_xl"/></para>
22 <para>For an overview of failover functionality in a Lustre file system, see <xref
23 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="understandingfailover"/>.</para>
24 <section xml:id="dbdoclet.50438188_82389">
26 <primary>High availability</primary>
28 </indexterm><indexterm>
29 <primary>failover</primary>
30 </indexterm>Setting Up a Failover Environment</title>
31 <para>The Lustre software provides failover mechanisms only at the layer of the Lustre file
32 system. No failover functionality is provided for system-level components such as failing
33 hardware or applications, or even for the entire failure of a node, as would typically be
34 provided in a complete failover solution. Failover functionality such as node monitoring,
35 failure detection, and resource fencing must be provided by external HA software, such as
36 PowerMan or the open source Corosync and Pacemaker packages provided by Linux operating system
37 vendors. Corosync provides support for detecting failures, and Pacemaker provides the actions
38 to take once a failure has been detected.</para>
41 <primary>failover</primary>
42 <secondary>power control device</secondary>
43 </indexterm>Selecting Power Equipment</title>
44 <para>Failover in a Lustre file system requires the use of a remote power control (RPC)
45 mechanism, which comes in different configurations. For example, Lustre server nodes may be
46 equipped with IPMI/BMC devices that allow remote power control. In the past, software or
47 even “sneakerware” has been used, but these are not recommended. For recommended devices,
48 refer to the list of supported RPC devices on the website for the PowerMan cluster power
49 management utility:</para>
50 <para><link xmlns:xlink="http://www.w3.org/1999/xlink"
51 xlink:href="http://code.google.com/p/powerman/wiki/SupportedDevs"
52 >http://code.google.com/p/powerman/wiki/SupportedDevs</link></para>
56 <primary>failover</primary>
57 <secondary>power management software</secondary>
58 </indexterm>Selecting Power Management Software</title>
59 <para>Lustre failover requires RPC and management capability to verify that a failed node is
60 shut down before I/O is directed to the failover node. This avoids double-mounting the two
61 nodes and the risk of unrecoverable data corruption. A variety of power management tools
62 will work. Two packages that have been commonly used with the Lustre software are PowerMan
63 and Linux-HA (aka. STONITH ).</para>
64 <para>The PowerMan cluster power management utility is used to control RPC devices from a
65 central location. PowerMan provides native support for several RPC varieties and Expect-like
66 configuration simplifies the addition of new devices. The latest versions of PowerMan are
68 <para><link xmlns:xlink="http://www.w3.org/1999/xlink"
69 xlink:href="http://code.google.com/p/powerman/"
70 >http://code.google.com/p/powerman/</link></para>
71 <para>STONITH, or “Shoot The Other Node In The Head”, is a set of power management tools
72 provided with the Linux-HA package prior to Red Hat Enterprise Linux 6. Linux-HA has native
73 support for many power control devices, is extensible (uses Expect scripts to automate
74 control), and provides the software to detect and respond to failures. With Red Hat
75 Enterprise Linux 6, Linux-HA is being replaced in the open source community by the
76 combination of Corosync and Pacemaker. For Red Hat Enterprise Linux subscribers, cluster
77 management using CMAN is available from Red Hat.</para>
81 <primary>failover</primary>
82 <secondary>high-availability (HA) software</secondary>
83 </indexterm>Selecting High-Availability (HA) Software</title>
84 <para>The Lustre file system must be set up with high-availability (HA) software to enable a
85 complete Lustre failover solution. Except for PowerMan, the HA software packages mentioned
86 above provide both power management and cluster management. For information about setting
87 up failover with Pacemaker, see:</para>
90 <para>Pacemaker Project website: <link xmlns:xlink="http://www.w3.org/1999/xlink"
91 xlink:href="http://clusterlabs.org/"><link xlink:href="http://clusterlabs.org/"
92 >http://clusterlabs.org/</link></link></para>
95 <para>Article <emphasis role="italic">Using Pacemaker with a Lustre File
96 System</emphasis>: <link xmlns:xlink="http://www.w3.org/1999/xlink"
97 xlink:href="https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System"
99 xlink:href="https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System"
100 >https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+with+a+Lustre+File+System</link></link></para>
105 <section xml:id="dbdoclet.50438188_92688">
107 <primary>failover</primary>
108 <secondary>setup</secondary>
109 </indexterm>Preparing a Lustre File System for Failover</title>
110 <para>To prepare a Lustre file system to be configured and managed as an HA system by a
111 third-party HA application, each storage target (MGT, MGS, OST) must be associated with a
112 second node to create a failover pair. This configuration information is then communicated by
113 the MGS to a client when the client mounts the file system.</para>
114 <para>The per-target configuration is relayed to the MGS at mount time. Some rules related to
115 this are:<itemizedlist>
117 <para> When a target is <emphasis role="underline"><emphasis role="italic"
118 >initially</emphasis></emphasis> mounted, the MGS reads the configuration
119 information from the target (such as mgt vs. ost, failnode, fsname) to configure the
120 target into a Lustre file system. If the MGS is reading the initial mount configuration,
121 the mounting node becomes that target's “primary” node.</para>
124 <para>When a target is <emphasis role="underline"><emphasis role="italic"
125 >subsequently</emphasis></emphasis> mounted, the MGS reads the current configuration
126 from the target and, as needed, will reconfigure the MGS database target
129 </itemizedlist></para>
130 <para>When the target is formatted using the <literal>mkfs.lustre</literal> command, the failover
131 service node(s) for the target are designated using the <literal>--servicenode</literal>
132 option. In the example below, an OST with index <literal>0</literal> in the file system
133 <literal>testfs</literal> is formatted with two service nodes designated to serve as a
135 pair:<screen>mkfs.lustre --reformat --ost --fsname testfs --mgsnode=192.168.10.1@o3ib \
136 --index=0 --servicenode=192.168.10.7@o2ib \
137 --servicenode=192.168.10.8@o2ib \
138 /dev/sdb</screen></para>
139 <para>More than two potential service nodes can be designated for a target. The target can then
140 be mounted on any of the designated service nodes.</para>
141 <para>When HA is configured on a storage target, the Lustre software enables multi-mount
142 protection (MMP) on that storage target. MMP prevents multiple nodes from simultaneously
143 mounting and thus corrupting the data on the target. For more about MMP, see <xref
144 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="managingfailover"/>.</para>
145 <para>If the MGT has been formatted with multiple service nodes designated, this information
146 must be conveyed to the Lustre client in the mount command used to mount the file system. In
147 the example below, NIDs for two MGSs that have been designated as service nodes for the MGT
148 are specified in the mount command executed on the
149 client:<screen>mount -t lustre 10.10.120.1@tcp1:10.10.120.2@tcp1:/testfs /lustre/testfs</screen></para>
150 <para>When a client mounts the file system, the MGS provides configuration information to the
151 client for the MDT(s) and OST(s) in the file system along with the NIDs for all service nodes
152 associated with each target and the service node on which the target is mounted. Later, when
153 the client attempts to access data on a target, it will try the NID for each specified service
154 node until it connects to the target.</para>
155 <para>Previous to Lustre software release 2.0, the <literal>--failnode</literal> option to
156 <literal>mkfs.lustre</literal> was used to designate a failover service node for a primary
157 server for a target. When the <literal>--failnode</literal> option is used, certain
158 restrictions apply:<itemizedlist>
160 <para>The target must be initially mounted on the primary service node, not the failover
161 node designated by the <literal>--failnode</literal> option.</para>
164 <para>If the <literal>tunefs.lustre –-writeconf</literal> option is used to erase and
165 regenerate the configuration log for the file system, a target cannot be initially
166 mounted on a designated failnode.</para>
169 <para>If a <literal>--failnode</literal> option is added to a target to designate a
170 failover server for the target, the target must be re-mounted on the primary node before
171 the <literal>--failnode</literal> option takes effect</para>
173 </itemizedlist></para>
175 <section xml:id="section_tnq_kbr_xl">
176 <title>Administering Failover in a Lustre File System</title>
177 <para>For additional information about administering failover features in a Lustre file system, see:<itemizedlist>
179 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_57420"
183 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438194_41817"
187 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="lustremaint.ChangeAddrFailoverNode"
191 <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="dbdoclet.50438219_75432"
194 </itemizedlist></para>
198 vim:expandtab:shiftwidth=2:tabstop=8: