From eca75f1118e3f5d6a105eab977fc3f5f4028e585 Mon Sep 17 00:00:00 2001 From: Richard Henwood Date: Wed, 18 May 2011 12:24:04 -0500 Subject: [PATCH] FIX: xrefs and tidying --- LustreRecovery.xml | 238 +++++++++++++++-------------------------------------- 1 file changed, 67 insertions(+), 171 deletions(-) diff --git a/LustreRecovery.xml b/LustreRecovery.xml index b30a519..24d4d08 100644 --- a/LustreRecovery.xml +++ b/LustreRecovery.xml @@ -1,86 +1,59 @@ - + - Lustre Recovery + Lustre Recovery This chapter describes how recovery is implemented in Lustre and includes the following sections: - Recovery Overview + + - + + - Metadata Replay + + - + + - Reply Reconstruction - - - - - - Version-based Recovery - - - - - - Commit on Share - - - + + - - - - - - Note -Usually the Lustre recovery process is transparent. For information about troubleshooting recovery when something goes wrong, see Chapter 27: Lustre Recovery. - - - - -
- <anchor xml:id="dbdoclet.50438268_pgfId-1289094" xreflabel=""/> -
- 30.1 <anchor xml:id="dbdoclet.50438268_58047" xreflabel=""/>Recovery Overview + +Usually the Lustre recovery process is transparent. For information about troubleshooting recovery when something goes wrong, see . + +
+ 30.1 Recovery Overview Lustre's recovery feature is responsible for dealing with node or network failure and returning the cluster to a consistent, performant state. Because Lustre allows servers to perform asynchronous update operations to the on-disk file system (i.e., the server can reply without waiting for the update to synchronously commit to disk), the clients may have state in memory that is newer than what the server can recover from disk after a crash. A handful of different types of failures can cause recovery to occur: Client (compute node) failure - - - + MDS failure (and failover) - - - + OST failure (and failover) - - - + Transient network partition - - - + Currently, all Lustre failure and recovery operations are based on the concept of connection failure; all imports or exports associated with a given connection are considered to fail if any of them fail. - For information on Lustre recovery, see Metadata Replay. For information on recovering from a corrupt file system, see Commit on Share. For information on resolving orphaned objects, a common issue after recovery, see Working with Orphaned Objects. + For information on Lustre recovery, see . For information on recovering from a corrupt file system, see . For information on resolving orphaned objects, a common issue after recovery, see (Working with Orphaned Objects).
<anchor xml:id="dbdoclet.50438268_pgfId-1287395" xreflabel=""/>30.1.1 <anchor xml:id="dbdoclet.50438268_96839" xreflabel=""/>Client <anchor xml:id="dbdoclet.50438268_marker-1287394" xreflabel=""/>Failure - Recovery from client failure in Lustre is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client's locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. Network Partition describes this case in more detail. + Recovery from client failure in Lustre is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client's locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. describes this case in more detail.
<anchor xml:id="dbdoclet.50438268_pgfId-1290714" xreflabel=""/>30.1.2 <anchor xml:id="dbdoclet.50438268_43796" xreflabel=""/>Client <anchor xml:id="dbdoclet.50438268_marker-1292164" xreflabel=""/>Eviction @@ -88,65 +61,44 @@ Reasons why a client might be evicted: Failure to respond to a server request in a timely manner - - Blocking lock callback (i.e., client holds lock that another client/server wants) - - - + Lock completion callback (i.e., client is granted lock previously held by another client) - - - + Lock glimpse callback (i.e., client is asked for size of object by another client) - - - + Server shutdown notification (with simplified interoperability) - - - + Failure to ping the server in a timely manner, unless the server is receiving no RPC traffic at all (which may indicate a network partition). - - - +
<anchor xml:id="dbdoclet.50438268_pgfId-1287398" xreflabel=""/>30.1.3 <anchor xml:id="dbdoclet.50438268_37508" xreflabel=""/>MDS Failure <anchor xml:id="dbdoclet.50438268_marker-1287397" xreflabel=""/>(Failover) Highly-available (HA) Lustre operation requires that the metadata server have a peer configured for failover, including the use of a shared storage device for the MDT backing file system. The actual mechanism for detecting peer failure, power off (STONITH) of the failed peer (to prevent it from continuing to modify the shared disk), and takeover of the Lustre MDS service on the backup node depends on external HA software such as Heartbeat. It is also possible to have MDS recovery with a single MDS node. In this case, recovery will take as long as is needed for the single MDS to be restarted. When clients detect an MDS failure (either by timeouts of in-flight requests or idle-time ping messages), they connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk. - The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see Metadata Replay. + The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see . Transaction numbers are used to ensure that operations are replayed in the order they were originally performed, so that they are guaranteed to succeed and present the same filesystem state as before the failure. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid the introduction of state changes that might conflict with what is being replayed by previously-connected clients.
<anchor xml:id="dbdoclet.50438268_pgfId-1289241" xreflabel=""/>30.1.4 <anchor xml:id="dbdoclet.50438268_28881" xreflabel=""/>OST <anchor xml:id="dbdoclet.50438268_marker-1289240" xreflabel=""/>Failure (Failover) When an OST fails or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and I/O requests going to that OST are blocked waiting for OST recovery or failover. It is possible to administratively mark the OSC as inactive on the client, in which case file operations that involve the failed OST will return an IO error (-EIO). Otherwise, the application waits until the OST has recovered or the client process is interrupted (e.g. ,with CTRL-C). - The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see Working with Orphaned Objects. + The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see (Working with Orphaned Objects). While the OSC to OST operation recovery protocol is the same as that between the MDC and MDT using the Metadata Replay protocol, typically the OST commits bulk write operations to disk synchronously and each reply indicates that the request is already committed and the data does not need to be saved for recovery. In some cases, the OST replies to the client before the operation is committed to disk (e.g. truncate, destroy, setattr, and I/O operations in very new versions of Lustre), and normal replay and resend handling is done, including resending of the bulk writes. In this case, the client keeps a copy of the data available in memory until the server indicates that the write has committed to disk. To force an OST recovery, unmount the OST and then mount it again. If the OST was connected to clients before it failed, then a recovery process starts after the remount, enabling clients to reconnect to the OST and replay transactions in their queue. When the OST is in recovery mode, all new client connections are refused until the recovery finishes. The recovery is complete when either all previously-connected clients reconnect and their transactions are replayed or a client connection attempt times out. If a connection attempt times out, then all clients waiting to reconnect (and their transactions) are lost. - - - - - - Note -If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:lctl --device <OST device number> abort_recovery To determine an OST’s device number and device name, run the lctl dl command. Sample lctl dl command output is shown below:7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 In this example, 7 is the OST device number. The device name is ddn_data-OST0009. In most instances, the device name can be used in place of the device number. - - - - + If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:lctl --device <OST device number> abort_recovery To determine an OST's device number and device name, run the lctl dl command. Sample lctl dl command output is shown below:7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 In this example, 7 is the OST device number. The device name is ddn_data-OST0009. In most instances, the device name can be used in place of the device number.
<anchor xml:id="dbdoclet.50438268_pgfId-1289389" xreflabel=""/>30.1.5 <anchor xml:id="dbdoclet.50438268_96876" xreflabel=""/>Network <anchor xml:id="dbdoclet.50438268_marker-1289388" xreflabel=""/>Partition @@ -158,33 +110,25 @@ In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its saved state related to that server, as described in Client Eviction, above. Failed recovery might occur for a number of reasons, including: Failure of recovery - - Recovery fails if the operations of one client directly depend on the operations of another client that failed to participate in recovery. Otherwise, Version Based Recovery (VBR) allows recovery to proceed for all of the connected clients, and only missing clients are evicted. - - - + Manual abort of recovery - - - + Manual eviction by the administrator - - - +
-
- 30.2 <anchor xml:id="dbdoclet.50438268_65824" xreflabel=""/>Metadata <anchor xml:id="dbdoclet.50438268_marker-1292175" xreflabel=""/>Replay +
+ 30.2 Metadata <anchor xml:id="dbdoclet.50438268_marker-1292175" xreflabel=""/>Replay Highly available Lustre operation requires that the MDS have a peer configured for failover, including the use of a shared storage device for the MDS backing file system. When a client detects an MDS failure, it connects to the new MDS and uses the metadata replay protocol to replay its requests. Metadata replay ensures that the failover MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.
@@ -235,8 +179,8 @@ Once all of the previously-shared state has been recovered on the server (the target file system is up-to-date with client cache and the server has recreated locks representing the locks held by the client), the client can resend any requests that did not receive an earlier reply. This processing is done like normal request processing, and, in some cases, the server may do reply reconstruction.
-
- 30.3 <anchor xml:id="dbdoclet.50438268_23736" xreflabel=""/>Reply <anchor xml:id="dbdoclet.50438268_marker-1292176" xreflabel=""/>Reconstruction +
+ 30.3 Reply <anchor xml:id="dbdoclet.50438268_marker-1292176" xreflabel=""/>Reconstruction When a reply is dropped, the MDS needs to be able to reconstruct the reply when the original request is re-sent. This must be done without repeating any non-idempotent operations, while preserving the integrity of the locking system. In the event of MDS failover, the information used to reconstruct the reply must be serialized on the disk in transactions that are joined or nested with those operating on the disk.
<anchor xml:id="dbdoclet.50438268_pgfId-1289741" xreflabel=""/>30.3.1 Required State @@ -244,21 +188,15 @@ XID of the request - - - + Resulting transno (if any) - - - + Result code (req->rq_status) - - - + For open requests, the "disposition" of the open must also be stored.
@@ -268,21 +206,15 @@ File handle - - - + Lock handle - - - + mds_body with information about the file created (for O_CREAT) - - - + The disposition, status and request data (re-sent intact by the client) are sufficient to determine which type of lock handle was granted, whether an open file handle was created, and which resource should be described in the mds_body.
@@ -299,66 +231,49 @@
-
- 30.4 <anchor xml:id="dbdoclet.50438268_80068" xreflabel=""/>Version-based <anchor xml:id="dbdoclet.50438268_marker-1288580" xreflabel=""/>Recovery +
+ 30.4 Version-based <anchor xml:id="dbdoclet.50438268_marker-1288580" xreflabel=""/>Recovery The Version-based Recovery (VBR) feature improves Lustre reliability in cases where client requests (RPCs) fail to replay during recovery There are two scenarios under which client RPCs are not replayed: (1) Non-functioning or isolated clients do not reconnect, and they cannot replay their RPCs, causing a gap in the replay sequence. These clients get errors and are evicted. (2) Functioning clients connect, but they cannot replay some or all of their RPCs that occurred after the gap caused by the non-functioning/isolated clients. These clients get errors (caused by the failed clients). With VBR, these requests have a better chance to replay because the "gaps" are only related to specific files that the missing client(s) changed.. . - In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The ''downstream'' clients never got to replay their requests because of the wait on the earlier client’s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost. + In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The ''downstream'' clients never got to replay their requests because of the wait on the earlier client'™s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost. With VBR, the recovery mechanism does not result in the loss of clients or their data, because changes in inode versions are tracked, and more clients are able to reintegrate into the cluster. With VBR, inode tracking looks like this: Each inodeUsually, there are two inodes, a parent and a child. stores a version, that is, the number of the last transaction (transno) in which the inode was changed. + - - - - When an inode is about to be changed, a pre-operation version of the inode is saved in the client’s data. - - - + When an inode is about to be changed, a pre-operation version of the inode is saved in the client'™s data. + The client keeps the pre-operation inode version and the post-operation version (transaction number) for replay, and sends them in the event of a server failure. - - - + If the pre-operation version matches, then the request is replayed. The post-operation version is assigned on all inodes modified in the request. - - - + - - - - - - Note -An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a ''rename'' operation, four different inodes can be modified. - - - - + An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a ''rename'' operation, four different inodes can be modified. During normal operation, the server: Updates the versions of all inodes involved in a given operation - - - + Returns the old and new inode versions to the client with the reply - - - + When the recovery mechanism is underway, VBR follows these steps: - 1. VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client. - 2. The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure. - 3. When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted. + + VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client. + + The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure. + + When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted. + VBR recovery is fully transparent to users. It may lead to slightly longer recovery times if the cluster loses several clients during server recovery.
<anchor xml:id="dbdoclet.50438268_pgfId-1287803" xreflabel=""/>30.4.1 <anchor xml:id="dbdoclet.50438268_marker-1288583" xreflabel=""/>VBR Messages @@ -372,22 +287,13 @@
<anchor xml:id="dbdoclet.50438268_pgfId-1287839" xreflabel=""/>30.4.2 Tips for <anchor xml:id="dbdoclet.50438268_marker-1288584" xreflabel=""/>Using VBR - VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client’s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost. + VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client'™s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.
-
- 30.5 <anchor xml:id="dbdoclet.50438268_83826" xreflabel=""/>Commit on <anchor xml:id="dbdoclet.50438268_marker-1292182" xreflabel=""/>Share +
+ 30.5 Commit on <anchor xml:id="dbdoclet.50438268_marker-1292182" xreflabel=""/>Share The commit-on-share (COS) feature makes Lustre recovery more reliable by preventing missing clients from causing cascading evictions of other clients. With COS enabled, if some Lustre clients miss the recovery window after a reboot or a server failure, the remaining clients are not evicted. - - - - - - Note -The commit-on-share feature is enabled, by default. - - - - + The commit-on-share feature is enabled, by default.
<anchor xml:id="dbdoclet.50438268_pgfId-1292075" xreflabel=""/>30.5.1 Working with Commit on Share To illustrate how COS works, let's first look at the old recovery scenario. After a service restart, the MDS would boot and enter recovery mode. Clients began reconnecting and replaying their uncommitted transactions. Clients could replay transactions independently as long as their transactions did not depend on each other (one client's transactions did not depend on a different client's transactions). The MDS is able to determine whether one transaction is dependent on another transaction via the Version-based Recovery feature. @@ -403,17 +309,7 @@ To disable or enable COS when the file system is running, use: lctl set_param mdt.*.commit_on_sharing=0/1 - - - - - - Note -Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the ldiskfs journal on a low-latency external device may improve file system performance. - - - - + Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the ldiskfs journal on a low-latency external device may improve file system performance.
-
-- 1.8.3.1