UnderstandingFailover.xml

   1 <?xml version='1.0' encoding='utf-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3 xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4 xml:id="understandingfailover">
   5   <title xml:id="understandingfailover.title">Understanding Failover in a
   6   Lustre File System</title>
   7   <para>This chapter describes failover in a Lustre file system. It
   8   includes:</para>
   9   <itemizedlist>
  10     <listitem>
  11       <para>
  12         <xref linkend="dbdoclet.50540653_59957" />
  13       </para>
  14     </listitem>
  15     <listitem>
  16       <para>
  17         <xref linkend="dbdoclet.50540653_97944" />
  18       </para>
  19     </listitem>
  20   </itemizedlist>
  21   <section xml:id="dbdoclet.50540653_59957">
  22     <title>
  23     <indexterm>
  24       <primary>failover</primary>
  25     </indexterm>What is Failover?</title>
  26     <para>In a high-availability (HA) system, unscheduled downtime is minimized
  27     by using redundant hardware and software components and software components
  28     that automate recovery when a failure occurs. If a failure condition
  29     occurs, such as the loss of a server or storage device or a network or
  30     software fault, the system's services continue with minimal interruption.
  31     Generally, availability is specified as the percentage of time the system
  32     is required to be available.</para>
  33     <para>Availability is accomplished by replicating hardware and/or software
  34     so that when a primary server fails or is unavailable, a standby server can
  35     be switched into its place to run applications and associated resources.
  36     This process, called
  37     <emphasis role="italic">failover</emphasis>, is automatic in an HA system
  38     and, in most cases, completely application-transparent.</para>
  39     <para>A failover hardware setup requires a pair of servers with a shared
  40     resource (typically a physical storage device, which may be based on SAN,
  41     NAS, hardware RAID, SCSI or Fibre Channel (FC) technology). The method of
  42     sharing storage should be essentially transparent at the device level; the
  43     same physical logical unit number (LUN) should be visible from both
  44     servers. To ensure high availability at the physical storage level, we
  45     encourage the use of RAID arrays to protect against drive-level
  46     failures.</para>
  47     <note>
  48       <para>The Lustre software does not provide redundancy for data; it
  49       depends exclusively on redundancy of backing storage devices. The backing
  50       OST storage should be RAID 5 or, preferably, RAID 6 storage. MDT storage
  51       should be RAID 1 or RAID 10.</para>
  52     </note>
  53     <section remap="h3">
  54       <title>
  55       <indexterm>
  56         <primary>failover</primary>
  57         <secondary>capabilities</secondary>
  58       </indexterm>Failover Capabilities</title>
  59       <para>To establish a highly-available Lustre file system, power
  60       management software or hardware and high availability (HA) software are
  61       used to provide the following failover capabilities:</para>
  62       <itemizedlist>
  63         <listitem>
  64           <para>
  65           <emphasis role="bold">Resource fencing</emphasis>- Protects physical
  66           storage from simultaneous access by two nodes.</para>
  67         </listitem>
  68         <listitem>
  69           <para>
  70           <emphasis role="bold">Resource management</emphasis>- Starts and
  71           stops the Lustre resources as a part of failover, maintains the
  72           cluster state, and carries out other resource management
  73           tasks.</para>
  74         </listitem>
  75         <listitem>
  76           <para>
  77           <emphasis role="bold">Health monitoring</emphasis>- Verifies the
  78           availability of hardware and network resources and responds to health
  79           indications provided by the Lustre software.</para>
  80         </listitem>
  81       </itemizedlist>
  82       <para>These capabilities can be provided by a variety of software and/or
  83       hardware solutions. For more information about using power management
  84       software or hardware and high availability (HA) software with a Lustre
  85       file system, see
  86       <xref linkend="configuringfailover" />.</para>
  87       <para>HA software is responsible for detecting failure of the primary
  88       Lustre server node and controlling the failover.The Lustre software works
  89       with any HA software that includes resource (I/O) fencing. For proper
  90       resource fencing, the HA software must be able to completely power off
  91       the failed server or disconnect it from the shared storage device. If two
  92       active nodes have access to the same storage device, data may be severely
  93       corrupted.</para>
  94     </section>
  95     <section remap="h3">
  96       <title>
  97       <indexterm>
  98         <primary>failover</primary>
  99         <secondary>configuration</secondary>
 100       </indexterm>Types of Failover Configurations</title>
 101       <para>Nodes in a cluster can be configured for failover in several ways.
 102       They are often configured in pairs (for example, two OSTs attached to a
 103       shared storage device), but other failover configurations are also
 104       possible. Failover configurations include:</para>
 105       <itemizedlist>
 106         <listitem>
 107           <para>
 108           <emphasis role="bold">Active/passive</emphasis> pair - In this
 109           configuration, the active node provides resources and serves data,
 110           while the passive node is usually standing by idle. If the active
 111           node fails, the passive node takes over and becomes active.</para>
 112         </listitem>
 113         <listitem>
 114           <para>
 115           <emphasis role="bold">Active/active</emphasis> pair - In this
 116           configuration, both nodes are active, each providing a subset of
 117           resources. In case of a failure, the second node takes over resources
 118           from the failed node.</para>
 119         </listitem>
 120       </itemizedlist>
 121       <para>In Lustre software releases previous to Lustre software release
 122       2.4, MDSs can be configured as an active/passive pair, while OSSs can be
 123       deployed in an active/active configuration that provides redundancy
 124       without extra overhead. Often the standby MDS is the active MDS for
 125       another Lustre file system or the MGS, so no nodes are idle in the
 126       cluster.</para>
 127       <para condition="l24">Lustre software release 2.4 introduces metadata
 128       targets for individual sub-directories. Active-active failover
 129       configurations are available for MDSs that serve MDTs on shared
 130       storage.</para>
 131     </section>
 132   </section>
 133   <section xml:id="dbdoclet.50540653_97944">
 134     <title>
 135     <indexterm>
 136       <primary>failover</primary>
 137       <secondary>and Lustre</secondary>
 138     </indexterm>Failover Functionality in a Lustre File System</title>
 139     <para>The failover functionality provided by the Lustre software can be
 140     used for the following failover scenario. When a client attempts to do I/O
 141     to a failed Lustre target, it continues to try until it receives an answer
 142     from any of the configured failover nodes for the Lustre target. A
 143     user-space application does not detect anything unusual, except that the
 144     I/O may take longer to complete.</para>
 145     <para>Failover in a Lustre file system requires that two nodes be
 146     configured as a failover pair, which must share one or more storage
 147     devices. A Lustre file system can be configured to provide MDT or OST
 148     failover.</para>
 149     <itemizedlist>
 150       <listitem>
 151         <para>For MDT failover, two MDSs can be configured to serve the same
 152         MDT. Only one MDS node can serve an MDT at a time.</para>
 153         <para condition="l24">Lustre software release 2.4 allows multiple MDTs.
 154         By placing two or more MDT partitions on storage shared by two MDSs,
 155         one MDS can fail and the remaining MDS can begin serving the unserved
 156         MDT. This is described as an active/active failover pair.</para>
 157       </listitem>
 158       <listitem>
 159         <para>For OST failover, multiple OSS nodes can be configured to be able
 160         to serve the same OST. However, only one OSS node can serve the OST at
 161         a time. An OST can be moved between OSS nodes that have access to the
 162         same storage device using
 163         <literal>umount/mount</literal> commands.</para>
 164       </listitem>
 165     </itemizedlist>
 166     <para>The
 167     <literal>--servicenode</literal> option is used to set up nodes in a Lustre
 168     file system for failover at creation time (using
 169     <literal>mkfs.lustre</literal>) or later when the Lustre file system is
 170     active (using
 171     <literal>tunefs.lustre</literal>). For explanations of these utilities, see
 172
 173     <xref linkend="dbdoclet.50438219_75432" />and
 174     <xref linkend="dbdoclet.50438219_39574" />.</para>
 175     <para>Failover capability in a Lustre file system can be used to upgrade
 176     the Lustre software between successive minor versions without cluster
 177     downtime. For more information, see
 178     <xref linkend="upgradinglustre" />.</para>
 179     <para>For information about configuring failover, see
 180     <xref linkend="configuringfailover" />.</para>
 181     <note>
 182       <para>The Lustre software provides failover functionality only at the
 183       file system level. In a complete failover solution, failover
 184       functionality for system-level components, such as node failure detection
 185       or power control, must be provided by a third-party tool.</para>
 186     </note>
 187     <caution>
 188       <para>OST failover functionality does not protect against corruption
 189       caused by a disk failure. If the storage media (i.e., physical disk) used
 190       for an OST fails, it cannot be recovered by functionality provided in the
 191       Lustre software. We strongly recommend that some form of RAID be used for
 192       OSTs. Lustre functionality assumes that the storage is reliable, so it
 193       adds no extra reliability features.</para>
 194     </caution>
 195     <section remap="h3">
 196       <title>
 197       <indexterm>
 198         <primary>failover</primary>
 199         <secondary>MDT</secondary>
 200       </indexterm>MDT Failover Configuration (Active/Passive)</title>
 201       <para>Two MDSs are typically configured as an active/passive failover
 202       pair as shown in
 203       <xref linkend="understandingfailover.fig.configmdt" />. Note that both
 204       nodes must have access to shared storage for the MDT(s) and the MGS. The
 205       primary (active) MDS manages the Lustre system metadata resources. If the
 206       primary MDS fails, the secondary (passive) MDS takes over these resources
 207       and serves the MDTs and the MGS.</para>
 208       <note>
 209         <para>In an environment with multiple file systems, the MDSs can be
 210         configured in a quasi active/active configuration, with each MDS
 211         managing metadata for a subset of the Lustre file system.</para>
 212       </note>
 213       <figure xml:id="understandingfailover.fig.configmdt">
 214         <title>Lustre failover configuration for a active/passive MDT</title>
 215         <mediaobject>
 216           <imageobject>
 217             <imagedata fileref="./figures/MDT_Failover.png" />
 218           </imageobject>
 219           <textobject>
 220             <phrase>Lustre failover configuration for an MDT</phrase>
 221           </textobject>
 222         </mediaobject>
 223       </figure>
 224     </section>
 225     <section xml:id='dbdoclet.mdtactiveactive' condition='l24'>
 226       <title>
 227       <indexterm>
 228         <primary>failover</primary>
 229         <secondary>MDT</secondary>
 230       </indexterm>MDT Failover Configuration (Active/Active)</title>
 231       <para>Multiple MDTs became available with the advent of Lustre software
 232       release 2.4. MDTs can be setup as an active/active failover
 233       configuration. A failover cluster is built from two MDSs as shown in
 234       <xref linkend="understandingfailover.fig.configmdts" />.</para>
 235       <figure xml:id="understandingfailover.fig.configmdts">
 236         <title>Lustre failover configuration for a active/active MDTs</title>
 237         <mediaobject>
 238           <imageobject>
 239             <imagedata scalefit="1" width="50%"
 240             fileref="figures/MDTs_Failover.png" />
 241           </imageobject>
 242           <textobject>
 243             <phrase>Lustre failover configuration for two MDTs</phrase>
 244           </textobject>
 245         </mediaobject>
 246       </figure>
 247     </section>
 248     <section remap="h3">
 249       <title>
 250       <indexterm>
 251         <primary>failover</primary>
 252         <secondary>OST</secondary>
 253       </indexterm>OST Failover Configuration (Active/Active)</title>
 254       <para>OSTs are usually configured in a load-balanced, active/active
 255       failover configuration. A failover cluster is built from two OSSs as
 256       shown in
 257       <xref linkend="understandingfailover.fig.configost" />.</para>
 258       <note>
 259         <para>OSSs configured as a failover pair must have shared
 260         disks/RAID.</para>
 261       </note>
 262       <figure xml:id="understandingfailover.fig.configost">
 263         <title>Lustre failover configuration for an OSTs</title>
 264         <mediaobject>
 265           <imageobject>
 266             <imagedata scalefit="1" width="100%"
 267             fileref="./figures/OST_Failover.png" />
 268           </imageobject>
 269           <textobject>
 270             <phrase>Lustre failover configuration for an OSTs</phrase>
 271           </textobject>
 272         </mediaobject>
 273       </figure>
 274       <para>In an active configuration, 50% of the available OSTs are assigned
 275       to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS
 276       serves as the primary node for half the OSTs and as a failover node for
 277       the remaining OSTs.</para>
 278       <para>In this mode, if one OSS fails, the other OSS takes over all of the
 279       failed OSTs. The clients attempt to connect to each OSS serving the OST,
 280       until one of them responds. Data on the OST is written synchronously, and
 281       the clients replay transactions that were in progress and uncommitted to
 282       disk before the OST failure.</para>
 283       <para>For more information about configuring failover, see
 284       <xref linkend="configuringfailover" />.</para>
 285     </section>
 286   </section>
 287 </chapter>