From d09c3f9df81d4ac3d795ee9741aa5d6376bf5f89 Mon Sep 17 00:00:00 2001 From: James Nunez Date: Thu, 30 Jan 2014 16:39:44 -0700 Subject: [PATCH] LUDOC-155 lfsck: LFSCK Phase II Additions Added information about MDT-OST file layout consistency checking, LFSCK phase II, to the manual section about LFSCK. Also cleanup up a few typos, table column issues and minor corrections (including LUDOC-182). Signed-off-by: James Nunez Change-Id: I2333b383168c30e4ee54b6f2bb7f300df45c0f28 Reviewed-on: http://review.whamcloud.com/9068 Tested-by: Jenkins Reviewed-by: Fan Yong Reviewed-by: Richard Henwood --- TroubleShootingRecovery.xml | 200 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 161 insertions(+), 39 deletions(-) diff --git a/TroubleShootingRecovery.xml b/TroubleShootingRecovery.xml index 2c2bb57..319fcfb 100644 --- a/TroubleShootingRecovery.xml +++ b/TroubleShootingRecovery.xml @@ -261,6 +261,10 @@ lfsck: fixed 0 errors restoring from a file-level MDT backup (), or in case the OI table is otherwise corrupted. Later phases of LFSCK will add further checks to the Lustre distributed file system state. + In Lustre software release 2.4, LFSCK can verify and repairing FID-in-Dirent and LinkEA consistency. + + In Lustre software release 2.6, LFSCK can verify and repair MDT-OST file layout inconsistency. File layout inconsistencies between MDT-objects and OST-objects that are checked and corrected include dangling reference, unreferenced OST-objects, mismatched references and multiple references. + Control and monitoring of LFSCK is through LFSCK and the /proc file system interfaces. LFSCK supports three types of interface: switch interfaces, status interfaces and adjustments interfaces. These interfaces are detailed below. @@ -270,30 +274,32 @@ lfsck: fixed 0 errors Manually Starting LFSCK
Synopsis - lctl lfsck_start -M | --device MDT_device \ + lctl lfsck_start -M | --device [MDT,OST]_device \ [-e | --error error_handle] \ [-h | --help] \ - [-m | --method iteration_method] \ [-n | --dryrun switch] \ [-r | --reset] \ [-s | --speed speed_limit] \ - [-t | --type lfsck_type[,lfsck_type...]] + [-A | --all] \ + [-t | --type lfsck_type[,lfsck_type...]] \ + [-w | --windows win_size] \ + [-o | --orphan]
Description - This is command is used by LFSCK after the MDT is mounted. + This command is used by LFSCK after the MDT is mounted.
Options The various lfsck_start options are listed and described below. For a complete list of available options, type lctl lfsck_start -h. - + - + Option @@ -307,7 +313,7 @@ lfsck: fixed 0 errors -M | --device - The MDT device to start LFSCK on. This will be a requirement when multiple MDTs are supported. + The MDT or OST device to start LFSCK/scrub on. @@ -323,39 +329,39 @@ lfsck: fixed 0 errors -h | --help - Operating help. + Operating help information. - -m | --method + -n | --dryrun - Method for scanning the MDT device. Currently only otable (object table based) iteration is supported. If it is not specified, the saved value (when resuming from checkpoint) will be used if present. + Perform a trial without making any changes. off (default) or on. - -n | --dryrun + -r | --reset - Perform a trial without making any changes. + Reset the start position for the object iteration to the beginning for the specified MDT. By default the iterator will resume scanning from the last checkpoint (saved periodically by LFSCK) provided it is available. - -r | --reset + -s | --speed - Reset the start position for the object iteration to the beginning for the specified MDT. By default the iterator will resume scanning from the last checkpoint (saved periodically by LFSCK) provided it is available. + Set the upper speed limit of LFSCK processing in objects per second. If it is not specified, the saved value (when resuming from checkpoint) or default value of 0 (0 = run as fast as possible) is used. Speed can be adjusted while LFSCK is running with the adjustment interface. - -s | --speed + -A | --all - Set the upper speed limit of LFSCK processing in objects per second. If it is not specified, the saved value (when resuming from checkpoint) or default value of 0 (0 = run as fast as possible) is used. Speed can be adjusted while LFSCK is running with the adjustment interface. + Start LFSCK on all devices via a single lctl command. It is not only used for layout consistency check/repair, but also for other LFSCK components, such as LFSCK for namespace consistency (LFSCK 1.5) and for DNE consistency check/repair in the future. @@ -364,8 +370,25 @@ lfsck: fixed 0 errors The type of checking/repairing that should be performed. The new LFSCK framework provides a single interface for a variety of system consistency checking/repairing operations including: -Without a specified option: check and repair object index (OI Scrub.) +Without a specified option, the LFSCK component(s) which ran last time and did not finish or the component(s) corresponding to some known system inconsistency, will be started. Anytime the LFSCK is triggered, the OI scrub will run automatically, so there is no need to specify OI_scrub. namespace: check and repair FID-in-Dirent and LinkEA consistency. +layout: check and repair MDT-OST inconsistency. + + + + + -w | --windows + + + The windows size for async requests pipeline. + + + + + -o | --orphan + + + Handle orphan objects, such as orphan OST-objects for layout LFSCK. @@ -377,24 +400,25 @@ lfsck: fixed 0 errors Manually Stopping LFSCK
Synopsis - lctl lfsck_stop -M | --device MDT_device \ + lctl lfsck_stop -M | --device [MDT,OST]_device \ + [-A | --all] \ [-h | --help]
Description - This is command is used by LFSCK after the MDT is mounted. + This command is used by LFSCK after the MDT is mounted.
Options - The various lfsck_stop options are listed and described below. For a complete list of available options, type lctl lfsck_stop h. + The various lfsck_stop options are listed and described below. For a complete list of available options, type lctl lfsck_stop -h. - + - + Option @@ -408,7 +432,15 @@ lfsck: fixed 0 errors -M | --device - The MDT device to start LFSCK on. This will be a requirement when multiple MDTs are supported. + The MDT or OST device to stop LFSCK/scrub on. + + + + + -A | --all + + + Stop LFSCK on all devices. @@ -416,7 +448,7 @@ lfsck: fixed 0 errors -h | --help - Operating help. + Operating help information. @@ -431,7 +463,7 @@ lfsck: fixed 0 errors LFSCK status of OI Scrub via <literal>procfs</literal>
Synopsis - lctl get_param -n osd-ldisk.FSNAME-MDT_device.oi_scrub + lctl get_param -n osd-ldiskfs.FSNAME-MDT_device.oi_scrub
@@ -441,12 +473,12 @@ lfsck: fixed 0 errors
Output - + - + Information @@ -491,13 +523,16 @@ lfsck: fixed 0 errors Checked total number of objects scanned. Updated total number of objects repaired. Failed total number of objects that failed to be repaired. - Ignored total number of objects marked I_LUSTER_NOSCRUB. + No Scrub total number of objects marked LDISKFS_STATE_LUSTRE_NOSCRUB and skipped. IGIF total number of objects IGIF scanned. Prior Updated how many objects have been repaired which are triggered by parallel RPC. Success Count total number of completed OI_scrub runs on the device. Run Time how long the scrub has run, tally from the time of scanning from the beginning of the specified MDT device, not include the paused/failure time among checkpoints. Average Speed calculated by dividing Checked by run_time. Real-Time Speed the speed since last checkpoint if the OI_scrub is running. + Scanned total number of objects under /lost+found that have been scanned. + Repaired total number of objects under /lost+found that have been recovered. + Failed total number of objects under /lost+found failed to be scanned or failed to be recovered. @@ -515,17 +550,17 @@ lfsck: fixed 0 errors
Description - The namespace component is responsible for checking and repairing FID-in-Dirent and LinkEA consistency. The procfs interface for this component is in the MDD layer, named lfsck_namespace. To show the status of this component lctl get_param should be used as follows: + The namespace component is responsible for checking and repairing FID-in-Dirent and LinkEA consistency. The procfs interface for this component is in the MDD layer, named lfsck_namespace. To show the status of this component lctl get_param should be used as described in the synopsis.
Output - + - + Information @@ -550,7 +585,7 @@ lfsck: fixed 0 errors entries have been discovered), upgrade (from Lustre software release 1.8 IGIF format.) - Parameters: including dryrun and failout. + Parameters: including dryrun, all_targets and failout. Time Since Last Completed. Time Since Latest Start. Time Since Last Checkpoint. @@ -576,8 +611,95 @@ lfsck: fixed 0 errors Dirs total number of directories scanned. M-linked total number of multiple-linked objects that have been scanned. Nlinks Repaired total number of objects with nlink attributes that have been repaired. - Name-entry Added total number of objects that have had a name entry added back to the namespace. - Success Count the total number off completed LFSCK runs on the device. + Lost_found total number of objects that have had a name entry added back to the namespace. + Success Count the total number of completed LFSCK runs on the device. + Run Time Phase1 the duration of the LFSCK run during scanning-phase1. Excluding the time spent paused between checkpoints. + Run Time Phase2 the duration of the LFSCK run during scanning-phase2. Excluding the time spent paused between checkpoints. + Average Speed Phase1 calculated by dividing checked_phase1 by run_time_phase1. + Average Speed Phase2 calculated by dividing checked_phase2 by run_time_phase1. + Real-Time Speed Phase1 the speed since the last checkpoint if the LFSCK is running scanning-phase1. + Real-Time Speed Phase2 the speed since the last checkpoint if the LFSCK is running scanning-phase2. + + + + + + +
+
+
+ LFSCK status of layout via <literal>procfs</literal> +
+ Synopsis + lctl get_param -n mdd.FSNAME-MDT_device.lfsck_layout +lctl get_param -n obdfilter.FSNAME-OST_device.lfsck_layout + +
+
+ Description + The layout component is responsible for checking and repairing MDT-OST inconsistency. The procfs interface for this component is in the MDD layer, named lfsck_layout, and in the OBD layer, named lfsck_layout. To show the status of this component lctl get_param should be used as described in the synopsis. +
+
+ Output + + + + + + + + Information + + + Detail + + + + + + + General Information + + + + Name: lfsck_layout + LFSCK namespace magic. + LFSCK namespace version.. + Status: one of the status - init, scanning-phase1, scanning-phase2, completed, failed, stopped, paused, crashed, partial, co-failed, co-stopped, or co-paused. + Flags: including - scanned-once (the first cycle scanning has been + completed), inconsistent (one + or more MDT-OST inconsistencies + have been discovered), + incomplete (some MDT or OST did not participate in the LFSCK or failed to finish the LFSCK) or crashed_lastid (the lastid files on the OST crashed and needs to be rebuilt). + Parameters: including dryrun, all_targets and failout. + Time Since Last Completed. + Time Since Latest Start. + Time Since Last Checkpoint. + Latest Start Position: the position the checking began most recently. + Last Checkpoint Position. + First Failure Position: the position for the first object to be repaired. + Current Position. + + + + + + Statistics + + + + Success Count: the total number of completed LFSCK runs on the device. + Repaired Dangling: total number of MDT-objects with dangling reference have been repaired in the scanning-phase1. + Repaired Unmatched Pairs total number of unmatched MDT and OST-object paris have been repaired in the scanning-phase1 + Repaired Multiple Referenced total number of OST-objects with multiple reference have been repaired in the scanning-phase1. + Repaired Orphan total number of orphan OST-objects have been repaired in the scanning-phase2. + Repaired Inconsistent Owner total number.of OST-objects with incorrect owner information have been repaired in the scanning-phase1. + Repaired Others total number of.other inconsistency repaired in the scanning phases. + Skipped Number of skipped objects. + Failed Phase1 total number of objects that failed to be repaired during scanning-phase1. + Failed Phase2 total number of objects that failed to be repaired during scanning-phase2. + Checked Phase1 total number of objects scanned during scanning-phase1. + Checked Phase2 total number of objects scanned during scanning-phase2. Run Time Phase1 the duration of the LFSCK run during scanning-phase1. Excluding the time spent paused between checkpoints. Run Time Phase2 the duration of the LFSCK run during scanning-phase2. Excluding the time spent paused between checkpoints. Average Speed Phase1 calculated by dividing checked_phase1 by run_time_phase1. @@ -595,12 +717,12 @@ lfsck: fixed 0 errors
LFSCK adjustment interface -
+
Rate control
Synopsis - lctl set_param mdt.${FSNAME}-${MDT_device}.lfsck_speed_limit=N - + lctl set_param mdd.${FSNAME}-${MDT_device}.lfsck_speed_limit=N +lctl set_param obdfilter.${FSNAME}-${OST_device}.lfsck_speed_limit=N
Description @@ -609,7 +731,7 @@ lfsck: fixed 0 errors
Values - + @@ -648,7 +770,7 @@ lfsck: fixed 0 errors
Values - + -- 1.8.3.1