Understanding Failover in a
Lustre File SystemThis chapter describes failover in a Lustre file system. It
includes:failoverWhat is Failover?In a high-availability (HA) system, unscheduled downtime is minimized
by using redundant hardware and software components and software components
that automate recovery when a failure occurs. If a failure condition
occurs, such as the loss of a server or storage device or a network or
software fault, the system's services continue with minimal interruption.
Generally, availability is specified as the percentage of time the system
is required to be available.Availability is accomplished by replicating hardware and/or software
so that when a primary server fails or is unavailable, a standby server can
be switched into its place to run applications and associated resources.
This process, called
failover, is automatic in an HA system
and, in most cases, completely application-transparent.A failover hardware setup requires a pair of servers with a shared
resource (typically a physical storage device, which may be based on SAN,
NAS, hardware RAID, SCSI or Fibre Channel (FC) technology). The method of
sharing storage should be essentially transparent at the device level; the
same physical logical unit number (LUN) should be visible from both
servers. To ensure high availability at the physical storage level, we
encourage the use of RAID arrays to protect against drive-level
failures.The Lustre software does not provide redundancy for data; it
depends exclusively on redundancy of backing storage devices. The backing
OST storage should be RAID 5 or, preferably, RAID 6 storage. MDT storage
should be RAID 1 or RAID 10.failovercapabilitiesFailover CapabilitiesTo establish a highly-available Lustre file system, power
management software or hardware and high availability (HA) software are
used to provide the following failover capabilities:Resource fencing- Protects physical
storage from simultaneous access by two nodes.Resource management- Starts and
stops the Lustre resources as a part of failover, maintains the
cluster state, and carries out other resource management
tasks.Health monitoring- Verifies the
availability of hardware and network resources and responds to health
indications provided by the Lustre software.These capabilities can be provided by a variety of software and/or
hardware solutions. For more information about using power management
software or hardware and high availability (HA) software with a Lustre
file system, see
.HA software is responsible for detecting failure of the primary
Lustre server node and controlling the failover.The Lustre software works
with any HA software that includes resource (I/O) fencing. For proper
resource fencing, the HA software must be able to completely power off
the failed server or disconnect it from the shared storage device. If two
active nodes have access to the same storage device, data may be severely
corrupted.failoverconfigurationTypes of Failover ConfigurationsNodes in a cluster can be configured for failover in several ways.
They are often configured in pairs (for example, two OSTs attached to a
shared storage device), but other failover configurations are also
possible. Failover configurations include:Active/passive pair - In this
configuration, the active node provides resources and serves data,
while the passive node is usually standing by idle. If the active
node fails, the passive node takes over and becomes active.Active/active pair - In this
configuration, both nodes are active, each providing a subset of
resources. In case of a failure, the second node takes over resources
from the failed node.If there is a single MDT in a filesystem, two MDSes can be
configured as an active/passive pair, while pairs of OSSes can be
deployed in an active/active configuration that improves OST availability
without extra overhead. Often the standby MDS is the active MDS for
another Lustre file system or the MGS, so no nodes are idle in the
cluster. If there are multiple MDTs in a filesystem, active-active
failover configurations are available for MDSs that serve MDTs on shared
storage.failoverand LustreFailover Functionality in a Lustre File SystemThe failover functionality provided by the Lustre software can be
used for the following failover scenario. When a client attempts to do I/O
to a failed Lustre target, it continues to try until it receives an answer
from any of the configured failover nodes for the Lustre target. A
user-space application does not detect anything unusual, except that the
I/O may take longer to complete.Failover in a Lustre file system requires that two nodes be
configured as a failover pair, which must share one or more storage
devices. A Lustre file system can be configured to provide MDT or OST
failover.For MDT failover, two MDSs can be configured to serve the same
MDT. Only one MDS node can serve any MDT at one time.
By placing two or more MDT devices on storage shared by two MDSs,
one MDS can fail and the remaining MDS can begin serving the unserved
MDT. This is described as an active/active failover pair.For OST failover, multiple OSS nodes can be configured to be able
to serve the same OST. However, only one OSS node can serve the OST at
a time. An OST can be moved between OSS nodes that have access to the
same storage device using
umount/mount commands.The
--servicenode option is used to set up nodes in a Lustre
file system for failover at creation time (using
mkfs.lustre) or later when the Lustre file system is
active (using
tunefs.lustre). For explanations of these utilities, see
and
.Failover capability in a Lustre file system can be used to upgrade
the Lustre software between successive minor versions without cluster
downtime. For more information, see
.For information about configuring failover, see
.The Lustre software provides failover functionality only at the
file system level. In a complete failover solution, failover
functionality for system-level components, such as node failure detection
or power control, must be provided by a third-party tool.OST failover functionality does not protect against corruption
caused by a disk failure. If the storage media (i.e., physical disk) used
for an OST fails, it cannot be recovered by functionality provided in the
Lustre software. We strongly recommend that some form of RAID be used for
OSTs. Lustre functionality assumes that the storage is reliable, so it
adds no extra reliability features.failoverMDTMDT Failover Configuration (Active/Passive)Two MDSs are typically configured as an active/passive failover
pair as shown in
. Note that both
nodes must have access to shared storage for the MDT(s) and the MGS. The
primary (active) MDS manages the Lustre system metadata resources. If the
primary MDS fails, the secondary (passive) MDS takes over these resources
and serves the MDTs and the MGS.In an environment with multiple file systems, the MDSs can be
configured in a quasi active/active configuration, with each MDS
managing metadata for a subset of the Lustre file system.failoverMDTMDT Failover Configuration (Active/Active)MDTs can be configured as an active/active failover
configuration. A failover cluster is built from two MDSs as shown in
.failoverOSTOST Failover Configuration (Active/Active)OSTs are usually configured in a load-balanced, active/active
failover configuration. A failover cluster is built from two OSSs as
shown in
.OSSs configured as a failover pair must have shared
disks/RAID.In an active configuration, 50% of the available OSTs are assigned
to one OSS and the remaining OSTs are assigned to the other OSS. Each OSS
serves as the primary node for half the OSTs and as a failover node for
the remaining OSTs.In this mode, if one OSS fails, the other OSS takes over all of the
failed OSTs. The clients attempt to connect to each OSS serving the OST,
until one of them responds. Data on the OST is written synchronously, and
the clients replay transactions that were in progress and uncommitted to
disk before the OST failure.For more information about configuring failover, see
.