LFSCK: an online file system checker for Lustre =============================================== LFSCK is an online tool to scan, check and repair a Lustre file system that can be used with a file system that is mounted and in use. It checks for a large variety of inconsistencies between meta data targets (MDTs) and object storage targets (OSTs) and provides automatic correction where possible. LFSCK does not check consistency of the on-disk format and assumes that it is consistent. For ldiskfs, e2fsck from e2fsprogs should be used to ensure the on disk format is consistent. ZFS is designed to always have a valid on-disk structure and as a result, no 'fsck' is necessary. Quick usage instructions =============================================== - start a standard scan LFSCK only runs on an MDS, and starts scanning automatically if an inconsistency is detected when the MDT service is started. The scan can be started manually on a running MDT using the command: # lctl lfsck_start --type namespace --type layout -M testfs-MDT0000 - reviewing the status of lfsck lfsck only provides status from a MDS. # lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace - stop a lfsck scan # lctl lfsck_stop -M lustre-MDT0000 Features =============================================== * online scanning. * control of scanning rate. * automatic checkpoint recovery of an interrupted scan. * reconstruciton of the FID-to-inode mapping after a file level restore or 1.8 upgrade * fixing FID-in-Dirent name entry to be consistent with the FID in the inode LMA. * detection and repair including: * MDT-OST inconsistencies, including: * dangling references. * unreferenced OST objects. * mismatched references. * multiple references. * monitoring using proc and lctl interfaces. /proc entries =============================================== Information about lfsck can be found in /proc/fs/lustre/mdd/-/lfsck_{namespace,layout} LFSCK master slave design =============================================== The LFSCK master engine resides on the MDT, and is implemented as a kernel thread in the LFSCK layer. The master engine is responsible for scanning on the MDT and also controls slave engines on OSTs. Scanning on both MDTs and OSTs occurs in two stages. These stages are firstly consistency check and repair and secondly orphan identification and processing. 1. The master engine is started either by the user space command or an excessive number of MDT-OST inconsistency events are detected. On starting, the master engine sends RPCs to related OSTs to start the slave engines. 2. The master engine on the MDS scans the MDT device using namespace iteration (described below). For each striped file, it calls the registered LFSCK process handlers to perform the relevant system consistency check/repair, which is are enumerated in the 'features' section. All objects on OSTs that are never referenced during this scan (because, for example, they are orphans) are recorded in an OST orphan object index on each OST. 3. After the MDT completes first-stage system scanning, the master engine sends RPCs to OSTs that have relations to the MDT, to make the OST begin scanning. The master engine waits for the slave engines to complete the first-stage system scan and is signaled in turn by an RPC from each OST. The LFSCK slave engine resides on each OST, and is implemented as a kernel thread in the LFSCK layer. This kernel thread drives the first-stage system scan on the OST. 1. When the slave engine is triggered by the RPC from the master engine in the first phase, the OST scans the local OST deviceto generate the in-memory OST orphan object index. 2. When the first-stage system scan (for both MDTs and OSTs) is complete a list of non-referenced OST-objects is available. Only objects that are not accessed during the first stage scan are regarded as potential orphans. 3. In the second stage, the OSTs scan to resolve orphan objects in the file system. The OST orphan object index is used as input to the second stage. For each item in the index, the presence of a parent MDT object is verified. Orphan objects will either be relinked to an existing file if found - or moved into a new file in .lustre/lost+found. If multiple MDTs are present, MDTs will check/repair MDT-OST consistency in parallel. To avoid scans of the OST device the slave engine will not begin second-stage system scans until all the master engines complete the first-stage system scan. For each OST there is a single OST orphan object index, regardless of how many MDTs are in the MDT-OST consistency check/repair. Object traversal design reference =============================================== Objects are traversed by LFSCK with two methods. inode traversal and namespace traversal. For all types, the OST iterates through objects with inode traversal. The MDT will choose the iteration appropriate to the scaning type requested. Layout uses inode traversal, namespace use namespace traversal. * inode traversal Two kernel threads are employed to maximize the performance of this operation. One Object Storage Device (OSD) thread performs the inode table iteration, which scans MDT inode table and submits inode read requests asynchronously to drive disk I/O efficiently. The second thread is the OI Scrub thread which searches the OI table and updates related mapping entries. The two threads run concurrently and iterate inodes in a pipeline. The Object Storage Device (OSD) is the abstract layer above a concrete back-end file system (i.e. ext4, ZFS, Btrfs, etc.). Each OSD implementation differs internally to support concrete file systems. In order to support OI Scrub the inode iterator is presented via the OSD API as a virtual index that contains all the inodes in the file system. Common interface calls are created to implement inode table based iteration to enable support for additional concrete file system in the future. * namespace traversal In addition to inode traversal, there are directory based items that need scanning for namespace consistency. For example, FID-in-Dirent and LinkEA are directory based features. A naive approach to namespace traversal would be to descend recursively from the file system root. However, this approach will typically generate random IO, which for performance reasons should be minimized. In addition, one must consider operations (i.e. rename) taking place within a directory that is currently being scanned. For these reasons a hybrid approach to scanning is employed. 1. LFSCK begins inode traversal. 2. If a directory is discovered then namespace traversal begins. LFSCK does not descend into sub-directories. LFSCK ignores rename operations during the directory traversal because the subsequent inode traversal will guarantee processing of renamed objects. Reading directory blocks is a small fraction of the data needed for the inodes they reference. In addition, entries in the directory are typically allocated following the directory inode on the disk so for many directories the children inodes will already be available because of pre-fetch. 3. Process each entry in the directory checking the FID-in-Dirent and the FID in the object LMA are consistent. Repair if not. Check also that the linkEA points back to the parent object. Check also that '.' and '..' entries are consistent. 4. Once all directory entries are exhausted, return to inode traversal. References =============================================== source code: file:/lustre/lfsck/ operations manual: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.lfsckadmin useful links: http://insidehpc.com/2013/05/02/video-lfsck-online-lustre-file-system-checker/ http://www.opensfs.org/wp-content/uploads/2013/04/Zhuravlev_LFSCK.pdf Glossary of terms =============================================== OSD - Object storage device. A generic term for a storage device with an interface that extends beyond a block-orientated device interface. OI - Object Index. A table that maps FIDs to inodes. This table must be regenerated if a file level restore is performed as inodes will change. FID - File IDentifier. A Lustre file system identifies every file and object with a unique 128-bit ID. FID-in-Dirent - FID in Directory Entry. To enhance the performance of readdir, the FID (and name) of a file are recorded in the current directory entry. LMA - Lustre Metadata Attributes. A record of Lustre specific attributes, for example HSM state. linkEA - Link Extended Attributes. When a file is created or hard-linked the parent directory name and FID are recorded as extended attributes to the file.