2 LFSCK: an online file system checker for Lustre
3 ===============================================
5 LFSCK is an online tool to scan, check and repair a Lustre file system that can
6 be used with a file system that is mounted and in use. It checks for a large
7 variety of inconsistencies between meta data targets (MDTs) and object storage
8 targets (OSTs) and provides automatic correction where possible.
10 LFSCK does not check consistency of the on-disk format and assumes that it is
11 consistent. For ldiskfs, e2fsck from e2fsprogs should be used to ensure the on
12 disk format is consistent. ZFS is designed to always have a valid on-disk
13 structure and as a result, no 'fsck' is necessary.
16 Quick usage instructions
17 ===============================================
19 - start a standard scan
21 LFSCK only runs on an MDS, and starts scanning automatically if an
22 inconsistency is detected when the MDT service is started. The scan can be
23 started manually on a running MDT using the command:
25 # lctl lfsck_start --type namespace --type layout -M testfs-MDT0000
27 - reviewing the status of lfsck
29 lfsck only provides status from a MDS.
31 # lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
35 # lctl lfsck_stop -M lustre-MDT0000
39 ===============================================
42 * control of scanning rate.
43 * automatic checkpoint recovery of an interrupted scan.
44 * reconstruciton of the FID-to-inode mapping after a file level restore or 1.8
46 * fixing FID-in-Dirent name entry to be consistent with the FID in the inode
48 * detection and repair including:
49 * MDT-OST inconsistencies, including:
50 * dangling references.
51 * unreferenced OST objects.
52 * mismatched references.
53 * multiple references.
54 * monitoring using proc and lctl interfaces.
58 ===============================================
60 Information about lfsck can be found in
61 /proc/fs/lustre/mdd/<fsname>-<mdt>/lfsck_{namespace,layout}
64 LFSCK master slave design
65 ===============================================
67 The LFSCK master engine resides on the MDT, and is implemented as a kernel
68 thread in the LFSCK layer. The master engine is responsible for scanning on the
69 MDT and also controls slave engines on OSTs. Scanning on both MDTs and OSTs
70 occurs in two stages. These stages are firstly consistency check and repair and
71 secondly orphan identification and processing.
73 1. The master engine is started either by the user space command or an
74 excessive number of MDT-OST inconsistency events are detected. On starting, the
75 master engine sends RPCs to related OSTs to start the slave engines.
77 2. The master engine on the MDS scans the MDT device using namespace iteration
78 (described below). For each striped file, it calls the registered LFSCK process
79 handlers to perform the relevant system consistency check/repair, which is are
80 enumerated in the 'features' section. All objects on OSTs that are never
81 referenced during this scan (because, for example, they are orphans) are
82 recorded in an OST orphan object index on each OST.
84 3. After the MDT completes first-stage system scanning, the master engine sends
85 RPCs to OSTs that have relations to the MDT, to make the OST begin scanning.
86 The master engine waits for the slave engines to complete the first-stage
87 system scan and is signaled in turn by an RPC from each OST.
90 The LFSCK slave engine resides on each OST, and is implemented as a kernel
91 thread in the LFSCK layer. This kernel thread drives the first-stage system
94 1. When the slave engine is triggered by the RPC from the master engine in the
95 first phase, the OST scans the local OST deviceto generate the in-memory OST
98 2. When the first-stage system scan (for both MDTs and OSTs) is complete a list
99 of non-referenced OST-objects is available. Only objects that are not accessed
100 during the first stage scan are regarded as potential orphans.
102 3. In the second stage, the OSTs scan to resolve orphan objects in the file
103 system. The OST orphan object index is used as input to the second stage. For
104 each item in the index, the presence of a parent MDT object is verified. Orphan
105 objects will either be relinked to an existing file if found - or moved into a
106 new file in .lustre/lost+found.
108 If multiple MDTs are present, MDTs will check/repair MDT-OST consistency in
109 parallel. To avoid scans of the OST device the slave engine will not begin
110 second-stage system scans until all the master engines complete the first-stage
111 system scan. For each OST there is a single OST orphan object index, regardless
112 of how many MDTs are in the MDT-OST consistency check/repair.
115 Object traversal design reference
116 ===============================================
118 Objects are traversed by LFSCK with two methods. inode traversal and namespace
119 traversal. For all types, the OST iterates through objects with inode
120 traversal. The MDT will choose the iteration appropriate to the scaning type
121 requested. Layout uses inode traversal, namespace use namespace traversal.
125 Two kernel threads are employed to maximize the performance of this operation.
126 One Object Storage Device (OSD) thread performs the inode table iteration,
127 which scans MDT inode table and submits inode read requests asynchronously to
128 drive disk I/O efficiently. The second thread is the OI Scrub thread which
129 searches the OI table and updates related mapping entries. The two threads run
130 concurrently and iterate inodes in a pipeline.
132 The Object Storage Device (OSD) is the abstract layer above a concrete back-end
133 file system (i.e. ext4, ZFS, Btrfs, etc.). Each OSD implementation differs
134 internally to support concrete file systems. In order to support OI Scrub the
135 inode iterator is presented via the OSD API as a virtual index that contains
136 all the inodes in the file system. Common interface calls are created to
137 implement inode table based iteration to enable support for additional concrete
138 file system in the future.
140 * namespace traversal
142 In addition to inode traversal, there are directory based items that
143 need scanning for namespace consistency. For example, FID-in-Dirent and LinkEA
144 are directory based features.
146 A naive approach to namespace traversal would be to descend recursively from
147 the file system root. However, this approach will typically generate random IO,
148 which for performance reasons should be minimized. In addition, one must
149 consider operations (i.e. rename) taking place within a directory that is
150 currently being scanned. For these reasons a hybrid approach to scanning is
153 1. LFSCK begins inode traversal.
155 2. If a directory is discovered then namespace traversal begins. LFSCK does not
156 descend into sub-directories. LFSCK ignores rename operations during the
157 directory traversal because the subsequent inode traversal will guarantee
158 processing of renamed objects. Reading directory blocks is a small fraction of
159 the data needed for the inodes they reference. In addition, entries in the
160 directory are typically allocated following the directory inode on the disk so
161 for many directories the children inodes will already be available because of
164 3. Process each entry in the directory checking the FID-in-Dirent and the FID
165 in the object LMA are consistent. Repair if not. Check also that the linkEA
166 points back to the parent object. Check also that '.' and '..' entries are
169 4. Once all directory entries are exhausted, return to inode traversal.
173 ===============================================
175 source code: file:/lustre/lfsck/
177 operations manual: http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.lfsckadmin
179 useful links: http://insidehpc.com/2013/05/02/video-lfsck-online-lustre-file-system-checker/
180 http://www.opensfs.org/wp-content/uploads/2013/04/Zhuravlev_LFSCK.pdf
184 ===============================================
186 OSD - Object storage device. A generic term for a storage device with an
187 interface that extends beyond a block-orientated device interface.
189 OI - Object Index. A table that maps FIDs to inodes. This table must be
190 regenerated if a file level restore is performed as inodes will change.
192 FID - File IDentifier. A Lustre file system identifies every file and object
193 with a unique 128-bit ID.
195 FID-in-Dirent - FID in Directory Entry. To enhance the performance of readdir,
196 the FID (and name) of a file are recorded in the current directory entry.
198 LMA - Lustre Metadata Attributes. A record of Lustre specific attributes, for
201 linkEA - Link Extended Attributes. When a file is created or hard-linked the
202 parent directory name and FID are recorded as extended attributes to the file.