1 <?xml version='1.0' encoding='utf-8'?>
2 <chapter xmlns="http://docbook.org/ns/docbook"
3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
4 xml:id="understandingfailover">
5 <title xml:id="understandingfailover.title">Understanding Failover in a
6 Lustre File System</title>
7 <para>This chapter describes failover in a Lustre file system. It
12 <xref linkend="dbdoclet.50540653_59957" />
17 <xref linkend="dbdoclet.50540653_97944" />
21 <section xml:id="dbdoclet.50540653_59957">
24 <primary>failover</primary>
25 </indexterm>What is Failover?</title>
26 <para>In a high-availability (HA) system, unscheduled downtime is minimized
27 by using redundant hardware and software components and software components
28 that automate recovery when a failure occurs. If a failure condition
29 occurs, such as the loss of a server or storage device or a network or
30 software fault, the system's services continue with minimal interruption.
31 Generally, availability is specified as the percentage of time the system
32 is required to be available.</para>
33 <para>Availability is accomplished by replicating hardware and/or software
34 so that when a primary server fails or is unavailable, a standby server can
35 be switched into its place to run applications and associated resources.
37 <emphasis role="italic">failover</emphasis>, is automatic in an HA system
38 and, in most cases, completely application-transparent.</para>
39 <para>A failover hardware setup requires a pair of servers with a shared
40 resource (typically a physical storage device, which may be based on SAN,
41 NAS, hardware RAID, SCSI or Fibre Channel (FC) technology). The method of
42 sharing storage should be essentially transparent at the device level; the
43 same physical logical unit number (LUN) should be visible from both
44 servers. To ensure high availability at the physical storage level, we
45 encourage the use of RAID arrays to protect against drive-level
48 <para>The Lustre software does not provide redundancy for data; it
49 depends exclusively on redundancy of backing storage devices. The backing
50 OST storage should be RAID 5 or, preferably, RAID 6 storage. MDT storage
51 should be RAID 1 or RAID 10.</para>
56 <primary>failover</primary>
57 <secondary>capabilities</secondary>
58 </indexterm>Failover Capabilities</title>
59 <para>To establish a highly-available Lustre file system, power
60 management software or hardware and high availability (HA) software are
61 used to provide the following failover capabilities:</para>
65 <emphasis role="bold">Resource fencing</emphasis>- Protects physical
66 storage from simultaneous access by two nodes.</para>
70 <emphasis role="bold">Resource management</emphasis>- Starts and
71 stops the Lustre resources as a part of failover, maintains the
72 cluster state, and carries out other resource management
77 <emphasis role="bold">Health monitoring</emphasis>- Verifies the
78 availability of hardware and network resources and responds to health
79 indications provided by the Lustre software.</para>
82 <para>These capabilities can be provided by a variety of software and/or
83 hardware solutions. For more information about using power management
84 software or hardware and high availability (HA) software with a Lustre
86 <xref linkend="configuringfailover" />.</para>
87 <para>HA software is responsible for detecting failure of the primary
88 Lustre server node and controlling the failover.The Lustre software works
89 with any HA software that includes resource (I/O) fencing. For proper
90 resource fencing, the HA software must be able to completely power off
91 the failed server or disconnect it from the shared storage device. If two
92 active nodes have access to the same storage device, data may be severely
98 <primary>failover</primary>
99 <secondary>configuration</secondary>
100 </indexterm>Types of Failover Configurations</title>
101 <para>Nodes in a cluster can be configured for failover in several ways.
102 They are often configured in pairs (for example, two OSTs attached to a
103 shared storage device), but other failover configurations are also
104 possible. Failover configurations include:</para>
108 <emphasis role="bold">Active/passive</emphasis> pair - In this
109 configuration, the active node provides resources and serves data,
110 while the passive node is usually standing by idle. If the active
111 node fails, the passive node takes over and becomes active.</para>
115 <emphasis role="bold">Active/active</emphasis> pair - In this
116 configuration, both nodes are active, each providing a subset of
117 resources. In case of a failure, the second node takes over resources
118 from the failed node.</para>
121 <para>If there is a single MDT in a filesystem, two MDSes can be
122 configured as an active/passive pair, while pairs of OSSes can be
123 deployed in an active/active configuration that improves OST availability
124 without extra overhead. Often the standby MDS is the active MDS for
125 another Lustre file system or the MGS, so no nodes are idle in the
126 cluster. If there are multiple MDTs in a filesystem, active-active
127 failover configurations are available for MDSs that serve MDTs on shared
131 <section xml:id="dbdoclet.50540653_97944">
134 <primary>failover</primary>
135 <secondary>and Lustre</secondary>
136 </indexterm>Failover Functionality in a Lustre File System</title>
137 <para>The failover functionality provided by the Lustre software can be
138 used for the following failover scenario. When a client attempts to do I/O
139 to a failed Lustre target, it continues to try until it receives an answer
140 from any of the configured failover nodes for the Lustre target. A
141 user-space application does not detect anything unusual, except that the
142 I/O may take longer to complete.</para>
143 <para>Failover in a Lustre file system requires that two nodes be
144 configured as a failover pair, which must share one or more storage
145 devices. A Lustre file system can be configured to provide MDT or OST
149 <para>For MDT failover, two MDSs can be configured to serve the same
150 MDT. Only one MDS node can serve any MDT at one time.
151 By placing two or more MDT devices on storage shared by two MDSs,
152 one MDS can fail and the remaining MDS can begin serving the unserved
153 MDT. This is described as an active/active failover pair.</para>
156 <para>For OST failover, multiple OSS nodes can be configured to be able
157 to serve the same OST. However, only one OSS node can serve the OST at
158 a time. An OST can be moved between OSS nodes that have access to the
159 same storage device using
160 <literal>umount/mount</literal> commands.</para>
164 <literal>--servicenode</literal> option is used to set up nodes in a Lustre
165 file system for failover at creation time (using
166 <literal>mkfs.lustre</literal>) or later when the Lustre file system is
168 <literal>tunefs.lustre</literal>). For explanations of these utilities, see
170 <xref linkend="dbdoclet.50438219_75432" />and
171 <xref linkend="dbdoclet.50438219_39574" />.</para>
172 <para>Failover capability in a Lustre file system can be used to upgrade
173 the Lustre software between successive minor versions without cluster
174 downtime. For more information, see
175 <xref linkend="upgradinglustre" />.</para>
176 <para>For information about configuring failover, see
177 <xref linkend="configuringfailover" />.</para>
179 <para>The Lustre software provides failover functionality only at the
180 file system level. In a complete failover solution, failover
181 functionality for system-level components, such as node failure detection
182 or power control, must be provided by a third-party tool.</para>
185 <para>OST failover functionality does not protect against corruption
186 caused by a disk failure. If the storage media (i.e., physical disk) used
187 for an OST fails, it cannot be recovered by functionality provided in the
188 Lustre software. We strongly recommend that some form of RAID be used for
189 OSTs. Lustre functionality assumes that the storage is reliable, so it
190 adds no extra reliability features.</para>
195 <primary>failover</primary>
196 <secondary>MDT</secondary>
197 </indexterm>MDT Failover Configuration (Active/Passive)</title>
198 <para>Two MDSs are typically configured as an active/passive failover
200 <xref linkend="understandingfailover.fig.configmdt" />. Note that both
201 nodes must have access to shared storage for the MDT(s) and the MGS. The
202 primary (active) MDS manages the Lustre system metadata resources. If the
203 primary MDS fails, the secondary (passive) MDS takes over these resources
204 and serves the MDTs and the MGS.</para>
206 <para>In an environment with multiple file systems, the MDSs can be
207 configured in a quasi active/active configuration, with each MDS
208 managing metadata for a subset of the Lustre file system.</para>
210 <figure xml:id="understandingfailover.fig.configmdt">
211 <title>Lustre failover configuration for a active/passive MDT</title>
214 <imagedata fileref="./figures/MDT_Failover.png" />
217 <phrase>Lustre failover configuration for an MDT</phrase>
222 <section xml:id='dbdoclet.mdtactiveactive'>
225 <primary>failover</primary>
226 <secondary>MDT</secondary>
227 </indexterm>MDT Failover Configuration (Active/Active)</title>
228 <para>MDTs can be configured as an active/active failover
229 configuration. A failover cluster is built from two MDSs as shown in
230 <xref linkend="understandingfailover.fig.configmdts" />.</para>
231 <figure xml:id="understandingfailover.fig.configmdts">
232 <title>Lustre failover configuration for a active/active MDTs</title>
235 <imagedata scalefit="1" width="50%"
236 fileref="figures/MDTs_Failover.png" />
239 <phrase>Lustre failover configuration for two MDTs</phrase>
247 <primary>failover</primary>
248 <secondary>OST</secondary>
249 </indexterm>OST Failover Configuration (Active/Active)</title>
250 <para>OSTs are usually configured in a load-balanced, active/active
251 failover configuration. A failover cluster is built from two OSSs as
253 <xref linkend="understandingfailover.fig.configost" />.</para>
255 <para>OSSs configured as a failover pair must have shared
258 <figure xml:id="understandingfailover.fig.configost">
259 <title>Lustre failover configuration for an OSTs</title>
262 <imagedata scalefit="1" width="100%"
263 fileref="./figures/OST_Failover.png" />
266 <phrase>Lustre failover configuration for an OSTs</phrase>
270 <para>In an active configuration, 50% of the available OSTs are assigned
271 to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS
272 serves as the primary node for half the OSTs and as a failover node for
273 the remaining OSTs.</para>
274 <para>In this mode, if one OSS fails, the other OSS takes over all of the
275 failed OSTs. The clients attempt to connect to each OSS serving the OST,
276 until one of them responds. Data on the OST is written synchronously, and
277 the clients replay transactions that were in progress and uncommitted to
278 disk before the OST failure.</para>
279 <para>For more information about configuring failover, see
280 <xref linkend="configuringfailover" />.</para>