From 6c441812758e47b4698086173b583e3d6bd051b0 Mon Sep 17 00:00:00 2001 From: Richard Henwood Date: Fri, 5 Dec 2014 14:20:37 -0600 Subject: [PATCH] LUDOC-263 lfsck: review and update LFSCK documentation. LFSCK is functionally complete for release 2.7. Update the documentation to reflect this. Wrap files for readability. Change-Id: I16462cc2a27b14e4652e19378642af4c306a165b Signed-off-by: Richard Henwood Reviewed-on: http://review.whamcloud.com/12966 Tested-by: Jenkins Reviewed-by: Fan Yong --- BackupAndRestore.xml | 741 +++++++++++---- Glossary.xml | 2 +- TroubleShootingRecovery.xml | 2076 +++++++++++++++++++++++++++++-------------- UnderstandingLustre.xml | 1008 ++++++++++++--------- UpgradingLustre.xml | 554 +++++++----- 5 files changed, 2888 insertions(+), 1493 deletions(-) diff --git a/BackupAndRestore.xml b/BackupAndRestore.xml index 83202c3..c8a955f 100644 --- a/BackupAndRestore.xml +++ b/BackupAndRestore.xml @@ -1,146 +1,302 @@ - - - Backing Up and Restoring a File System - This chapter describes how to backup and restore at the file system-level, device-level and - file-level in a Lustre file system. Each backup approach is described in the the following - sections: + + + Backing Up and Restoring a File + System + This chapter describes how to backup and restore at the file + system-level, device-level and file-level in a Lustre file system. Each + backup approach is described in the the following sections: - + + + - + + + - + + + - + + + - + + +
- - <indexterm><primary>backup</primary></indexterm> - <indexterm><primary>restoring</primary><see>backup</see></indexterm> - <indexterm><primary>LVM</primary><see>backup</see></indexterm> - <indexterm><primary>rsync</primary><see>backup</see></indexterm> - Backing up a File System - Backing up a complete file system gives you full control over the files to back up, and - allows restoration of individual files as needed. File system-level backups are also the - easiest to integrate into existing backup solutions. - File system backups are performed from a Lustre client (or many clients working parallel in different directories) rather than on individual server nodes; this is no different than backing up any other file system. - However, due to the large size of most Lustre file systems, it is not always possible to get a complete backup. We recommend that you back up subsets of a file system. This includes subdirectories of the entire file system, filesets for a single user, files incremented by date, and so on. + + <indexterm> + <primary>backup</primary> + </indexterm> + <indexterm> + <primary>restoring</primary> + <see>backup</see> + </indexterm> + <indexterm> + <primary>LVM</primary> + <see>backup</see> + </indexterm> + <indexterm> + <primary>rsync</primary> + <see>backup</see> + </indexterm>Backing up a File System + Backing up a complete file system gives you full control over the + files to back up, and allows restoration of individual files as needed. + File system-level backups are also the easiest to integrate into existing + backup solutions. + File system backups are performed from a Lustre client (or many + clients working parallel in different directories) rather than on + individual server nodes; this is no different than backing up any other + file system. + However, due to the large size of most Lustre file systems, it is not + always possible to get a complete backup. We recommend that you back up + subsets of a file system. This includes subdirectories of the entire file + system, filesets for a single user, files incremented by date, and so + on. - In order to allow the file system namespace to scale for future applications, Lustre - software release 2.x internally uses a 128-bit file identifier for all files. To interface - with user applications, the Lustre software presents 64-bit inode numbers for the - stat(), fstat(), and readdir() - system calls on 64-bit applications, and 32-bit inode numbers to 32-bit applications. - Some 32-bit applications accessing Lustre file systems (on both 32-bit and 64-bit CPUs) - may experience problems with the stat(), fstat() - or readdir() system calls under certain circumstances, though the - Lustre client should return 32-bit inode numbers to these applications. - In particular, if the Lustre file system is exported from a 64-bit client via NFS to a - 32-bit client, the Linux NFS server will export 64-bit inode numbers to applications running - on the NFS client. If the 32-bit applications are not compiled with Large File Support - (LFS), then they return EOVERFLOW errors when accessing the Lustre files. - To avoid this problem, Linux NFS clients can use the kernel command-line option - "nfs.enable_ino64=0" in order to force the NFS client to - export 32-bit inode numbers to the client. - Workaround: We very strongly recommend that backups using tar(1) and other utilities that depend on the inode number to uniquely identify an inode to be run on 64-bit clients. The 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit inode number, and as a result these utilities may operate incorrectly on 32-bit clients. + In order to allow the file system namespace to scale for future + applications, Lustre software release 2.x internally uses a 128-bit file + identifier for all files. To interface with user applications, the Lustre + software presents 64-bit inode numbers for the + stat(), + fstat(), and + readdir() system calls on 64-bit applications, and + 32-bit inode numbers to 32-bit applications. + Some 32-bit applications accessing Lustre file systems (on both + 32-bit and 64-bit CPUs) may experience problems with the + stat(), + fstat() or + readdir() system calls under certain circumstances, + though the Lustre client should return 32-bit inode numbers to these + applications. + In particular, if the Lustre file system is exported from a 64-bit + client via NFS to a 32-bit client, the Linux NFS server will export + 64-bit inode numbers to applications running on the NFS client. If the + 32-bit applications are not compiled with Large File Support (LFS), then + they return + EOVERFLOW errors when accessing the Lustre files. To + avoid this problem, Linux NFS clients can use the kernel command-line + option " + nfs.enable_ino64=0" in order to force the NFS client + to export 32-bit inode numbers to the client. + + Workaround: We very strongly recommend + that backups using + tar(1) and other utilities that depend on the inode + number to uniquely identify an inode to be run on 64-bit clients. The + 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit + inode number, and as a result these utilities may operate incorrectly on + 32-bit clients.
- <indexterm><primary>backup</primary><secondary>rsync</secondary></indexterm>Lustre_rsync - The lustre_rsync feature keeps the entire file system in sync on a backup by replicating the file system's changes to a second file system (the second file system need not be a Lustre file system, but it must be sufficiently large). lustre_rsync uses Lustre changelogs to efficiently synchronize the file systems without having to scan (directory walk) the Lustre file system. This efficiency is critically important for large file systems, and distinguishes the Lustre lustre_rsync feature from other replication/backup solutions. + + <indexterm> + <primary>backup</primary> + <secondary>rsync</secondary> + </indexterm>Lustre_rsync + The + lustre_rsync feature keeps the entire file system in + sync on a backup by replicating the file system's changes to a second + file system (the second file system need not be a Lustre file system, but + it must be sufficiently large). + lustre_rsync uses Lustre changelogs to efficiently + synchronize the file systems without having to scan (directory walk) the + Lustre file system. This efficiency is critically important for large + file systems, and distinguishes the Lustre + lustre_rsync feature from other replication/backup + solutions.
- <indexterm><primary>backup</primary><secondary>rsync</secondary><tertiary>using</tertiary></indexterm>Using Lustre_rsync - The lustre_rsync feature works by periodically running lustre_rsync, a userspace program used to synchronize changes in the Lustre file system onto the target file system. The lustre_rsync utility keeps a status file, which enables it to be safely interrupted and restarted without losing synchronization between the file systems. - The first time that lustre_rsync is run, the user must specify a set of parameters for the program to use. These parameters are described in the following table and in . On subsequent runs, these parameters are stored in the the status file, and only the name of the status file needs to be passed to lustre_rsync. - Before using lustre_rsync: + + <indexterm> + <primary>backup</primary> + <secondary>rsync</secondary> + <tertiary>using</tertiary> + </indexterm>Using Lustre_rsync + The + lustre_rsync feature works by periodically running + lustre_rsync, a userspace program used to + synchronize changes in the Lustre file system onto the target file + system. The + lustre_rsync utility keeps a status file, which + enables it to be safely interrupted and restarted without losing + synchronization between the file systems. + The first time that + lustre_rsync is run, the user must specify a set of + parameters for the program to use. These parameters are described in + the following table and in + . On subsequent runs, these + parameters are stored in the the status file, and only the name of the + status file needs to be passed to + lustre_rsync. + Before using + lustre_rsync: - Register the changelog user. For details, see the (changelog_register) parameter in the (lctl). + Register the changelog user. For details, see the + ( + changelog_register) parameter in the + ( + lctl). - AND - - Verify that the Lustre file system (source) and the replica file system (target) are identical before registering the changelog user. If the file systems are discrepant, use a utility, e.g. regular rsync (not lustre_rsync), to make them identical. + Verify that the Lustre file system (source) and the replica + file system (target) are identical + beforeregistering the changelog user. If the + file systems are discrepant, use a utility, e.g. regular + rsync(not + lustre_rsync), to make them identical. - The lustre_rsync utility uses the following parameters: + The + lustre_rsync utility uses the following + parameters: - - + + - Parameter + + Parameter + - Description + + Description + - --source=src + + --source= + src + - The path to the root of the Lustre file system (source) which will be synchronized. This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. + The path to the root of the Lustre file system (source) + which will be synchronized. This is a mandatory option if a + valid status log created during a previous synchronization + operation ( + --statuslog) is not specified. - --target=tgt + + --target= + tgt + - The path to the root where the source file system will be synchronized (target). This is a mandatory option if the status log created during a previous synchronization operation (--statuslog) is not specified. This option can be repeated if multiple synchronization targets are desired. + The path to the root where the source file system will + be synchronized (target). This is a mandatory option if the + status log created during a previous synchronization + operation ( + --statuslog) is not specified. This option + can be repeated if multiple synchronization targets are + desired. - --mdt=mdt + + --mdt= + mdt + - The metadata device to be synchronized. A changelog user must be registered for this device. This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. + The metadata device to be synchronized. A changelog + user must be registered for this device. This is a mandatory + option if a valid status log created during a previous + synchronization operation ( + --statuslog) is not specified. - --user=userid + + --user= + userid + - The changelog user ID for the specified MDT. To use lustre_rsync, the changelog user must be registered. For details, see the changelog_register parameter in (lctl). This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. + The changelog user ID for the specified MDT. To use + lustre_rsync, the changelog user must be + registered. For details, see the + changelog_register parameter in + ( + lctl). This is a mandatory option if a + valid status log created during a previous synchronization + operation ( + --statuslog) is not specified. - --statuslog=log + + --statuslog= + log + - A log file to which synchronization status is saved. When the lustre_rsync utility starts, if the status log from a previous synchronization operation is specified, then the state is read from the log and otherwise mandatory --source, --target and --mdt options can be skipped. Specifying the --source, --target and/or --mdt options, in addition to the --statuslog option, causes the specified parameters in the status log to be overridden. Command line options take precedence over options in the status log. + A log file to which synchronization status is saved. + When the + lustre_rsync utility starts, if the status + log from a previous synchronization operation is specified, + then the state is read from the log and otherwise mandatory + --source, + --target and + --mdt options can be skipped. Specifying + the + --source, + --target and/or + --mdt options, in addition to the + --statuslog option, causes the specified + parameters in the status log to be overridden. Command line + options take precedence over options in the status + log. - --xattr yes|no + --xattr + yes|no - Specifies whether extended attributes (xattrs) are synchronized or not. The default is to synchronize extended attributes. - - Disabling xattrs causes Lustre striping information not to be synchronized. - + Specifies whether extended attributes ( + xattrs) are synchronized or not. The + default is to synchronize extended attributes. + + + Disabling xattrs causes Lustre striping information + not to be synchronized. + + - --verbose + + --verbose + Produces verbose output. @@ -148,18 +304,28 @@ - --dry-run + + --dry-run + - Shows the output of lustre_rsync commands (copy, mkdir, etc.) on the target file system without actually executing them. + Shows the output of + lustre_rsync commands ( + copy, + mkdir, etc.) on the target file system + without actually executing them. - --abort-on-err + + --abort-on-err + - Stops processing the lustre_rsync operation if an error occurs. The default is to continue the operation. + Stops processing the + lustre_rsync operation if an error occurs. + The default is to continue the operation. @@ -167,12 +333,22 @@
- <indexterm><primary>backup</primary><secondary>rsync</secondary><tertiary>examples</tertiary></indexterm><literal>lustre_rsync</literal> Examples - Sample lustre_rsync commands are listed below. - Register a changelog user for an MDT (e.g. testfs-MDT0000). + + <indexterm> + <primary>backup</primary> + <secondary>rsync</secondary> + <tertiary>examples</tertiary> + </indexterm> + <literal>lustre_rsync</literal> Examples + Sample + lustre_rsync commands are listed below. + Register a changelog user for an MDT (e.g. + testfs-MDT0000). # lctl --device testfs-MDT0000 changelog_register testfs-MDT0000 -Registered changelog userid 'cl1' - Synchronize a Lustre file system (/mnt/lustre) to a target file system (/mnt/target). +Registered changelog userid 'cl1' + Synchronize a Lustre file system ( + /mnt/lustre) to a target file system ( + /mnt/target). $ lustre_rsync --source=/mnt/lustre --target=/mnt/target \ --mdt=testfs-MDT0000 --user=cl1 --statuslog sync.log --verbose Lustre filesystem: testfs @@ -185,7 +361,10 @@ Starting changelog record: 0 Errors: 0 lustre_rsync took 1 seconds Changelog records consumed: 22 - After the file system undergoes changes, synchronize the changes onto the target file system. Only the statuslog name needs to be specified, as it has all the parameters passed earlier. + After the file system undergoes changes, synchronize the changes + onto the target file system. Only the + statuslog name needs to be specified, as it has all + the parameters passed earlier. $ lustre_rsync --statuslog sync.log --verbose Replicating Lustre filesystem: testfs MDT device: testfs-MDT0000 @@ -197,7 +376,10 @@ Starting changelog record: 22 Errors: 0 lustre_rsync took 2 seconds Changelog records consumed: 42 - To synchronize a Lustre file system (/mnt/lustre) to two target file systems (/mnt/target1 and /mnt/target2). + To synchronize a Lustre file system ( + /mnt/lustre) to two target file systems ( + /mnt/target1 and + /mnt/target2). $ lustre_rsync --source=/mnt/lustre --target=/mnt/target1 \ --target=/mnt/target2 --mdt=testfs-MDT0000 --user=cl1 \ --statuslog sync.log @@ -205,117 +387,210 @@ Changelog records consumed: 42
- <indexterm><primary>backup</primary><secondary>MDS/OST device level</secondary></indexterm>Backing Up and Restoring an MDS or OST (Device Level) - In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the data and configuration files is preserved in the original state and is the easiest method of doing a backup. For the MDT file system, it may also be the fastest way to perform the backup and restore, since it can do large streaming read and write operations at the maximum bandwidth of the underlying devices. + + <indexterm> + <primary>backup</primary> + <secondary>MDS/OST device level</secondary> + </indexterm>Backing Up and Restoring an MDS or OST (Device Level) + In some cases, it is useful to do a full device-level backup of an + individual device (MDT or OST), before replacing hardware, performing + maintenance, etc. Doing full device-level backups ensures that all of the + data and configuration files is preserved in the original state and is the + easiest method of doing a backup. For the MDT file system, it may also be + the fastest way to perform the backup and restore, since it can do large + streaming read and write operations at the maximum bandwidth of the + underlying devices. - Keeping an updated full backup of the MDT is especially important because a permanent failure of the MDT file system renders the much larger amount of data in all the OSTs largely inaccessible and unusable. + Keeping an updated full backup of the MDT is especially important + because a permanent failure of the MDT file system renders the much + larger amount of data in all the OSTs largely inaccessible and + unusable. - In Lustre software release 2.0 through 2.2, the only successful way to backup and - restore an MDT is to do a device-level backup as is described in this section. File-level - restore of an MDT is not possible before Lustre software release 2.3, as the Object Index - (OI) file cannot be rebuilt after restore without the OI Scrub functionality. Since Lustre software release 2.3, Object Index files are - automatically rebuilt at first mount after a restore is detected (see LU-957), and file-level backup - is supported (see ). + In Lustre software release 2.0 through 2.2, the only successful way + to backup and restore an MDT is to do a device-level backup as is + described in this section. File-level restore of an MDT is not possible + before Lustre software release 2.3, as the Object Index (OI) file cannot + be rebuilt after restore without the OI Scrub functionality. + Since Lustre software release 2.3, + Object Index files are automatically rebuilt at first mount after a + restore is detected (see + LU-957), + and file-level backup is supported (see + ). - If hardware replacement is the reason for the backup or if a spare storage device is available, it is possible to do a raw copy of the MDT or OST from one block device to the other, as long as the new device is at least as large as the original device. To do this, run: + If hardware replacement is the reason for the backup or if a spare + storage device is available, it is possible to do a raw copy of the MDT or + OST from one block device to the other, as long as the new device is at + least as large as the original device. To do this, run: dd if=/dev/{original} of=/dev/{newdev} bs=1M - If hardware errors cause read problems on the original device, use the command below to allow as much data as possible to be read from the original device while skipping sections of the disk with errors: + If hardware errors cause read problems on the original device, use + the command below to allow as much data as possible to be read from the + original device while skipping sections of the disk with errors: dd if=/dev/{original} of=/dev/{newdev} bs=4k conv=sync,noerror / count={original size in 4kB blocks} - Even in the face of hardware errors, the ldiskfs - file system is very robust and it may be possible to recover the file - system data after running e2fsck -fy /dev/{newdev} on - the new device, along with ll_recover_lost_found_objs - for OST devices. + Even in the face of hardware errors, the + ldiskfs file system is very robust and it may be possible + to recover the file system data after running + e2fsck -fy /dev/{newdev} on the new device, along with + ll_recover_lost_found_objs for OST devices. With Lustre software version 2.6 and later, there is - no longer a need to run ll_recover_lost_found_objs on - the OSTs, since the LFSCK scanning will automatically - move objects from lost+found back into its correct - location on the OST after directory corruption. + no longer a need to run + ll_recover_lost_found_objs on the OSTs, since the + LFSCK scanning will automatically move objects from + lost+found back into its correct location on the OST + after directory corruption.
- <indexterm><primary>backup</primary><secondary>OST file system</secondary></indexterm><indexterm><primary>backup</primary><secondary>MDT file system</secondary></indexterm>Making a File-Level Backup of an OST or MDT File System - This procedure provides an alternative to backup or migrate the data of an OST or MDT at the file level. At the file-level, unused space is omitted from the backed up and the process may be completed quicker with smaller total backup size. Backing up a single OST device is not necessarily the best way to perform backups of the Lustre file system, since the files stored in the backup are not usable without metadata stored on the MDT and additional file stripes that may be on other OSTs. However, it is the preferred method for migration of OST devices, especially when it is desirable to reformat the underlying file system with different configuration options or to reduce fragmentation. + + <indexterm> + <primary>backup</primary> + <secondary>OST file system</secondary> + </indexterm> + <indexterm> + <primary>backup</primary> + <secondary>MDT file system</secondary> + </indexterm>Making a File-Level Backup of an OST or MDT File System + This procedure provides an alternative to backup or migrate the data + of an OST or MDT at the file level. At the file-level, unused space is + omitted from the backed up and the process may be completed quicker with + smaller total backup size. Backing up a single OST device is not + necessarily the best way to perform backups of the Lustre file system, + since the files stored in the backup are not usable without metadata stored + on the MDT and additional file stripes that may be on other OSTs. However, + it is the preferred method for migration of OST devices, especially when it + is desirable to reformat the underlying file system with different + configuration options or to reduce fragmentation. - Prior to Lustre software release 2.3, the only successful way to perform an MDT backup - and restore is to do a device-level backup as is described in . The ability to do MDT file-level backups is not - available for Lustre software release 2.0 through 2.2, because restoration of the Object - Index (OI) file does not return the MDT to a functioning state. Since - Lustre software release 2.3, Object Index files are automatically rebuilt at - first mount after a restore is detected (see LU-957), so file-level MDT - restore is supported. + Prior to Lustre software release 2.3, the only successful way to + perform an MDT backup and restore is to do a device-level backup as is + described in + . The ability to do MDT + file-level backups is not available for Lustre software release 2.0 + through 2.2, because restoration of the Object Index (OI) file does not + return the MDT to a functioning state. + Since Lustre software release 2.3, + Object Index files are automatically rebuilt at first mount after a + restore is detected (see + LU-957), + so file-level MDT restore is supported. - For Lustre software release 2.3 and newer with MDT file-level backup support, substitute - mdt for ost in the instructions below. + For Lustre software release 2.3 and newer with MDT file-level backup + support, substitute + mdt for + ost in the instructions below. - Make a mountpoint for the file system. + + Make a mountpoint for the file + system. + [oss]# mkdir -p /mnt/ost - Mount the file system. + + Mount the file system. + [oss]# mount -t ldiskfs /dev/{ostdev} /mnt/ost - Change to the mountpoint being backed up. + + Change to the mountpoint being backed + up. + [oss]# cd /mnt/ost - Back up the extended attributes. - [oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak + + Back up the extended attributes. + + [oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak - If the tar(1) command supports the --xattr option, the getfattr step may be unnecessary as long as tar does a backup of the trusted.* attributes. However, completing this step is not harmful and can serve as an added safety measure. + If the + tar(1) command supports the + --xattr option, the + getfattr step may be unnecessary as long as tar + does a backup of the + trusted.* attributes. However, completing this step + is not harmful and can serve as an added safety measure. - In most distributions, the getfattr command is part of the attr package. If the getfattr command returns errors like Operation not supported, then the kernel does not correctly support EAs. Stop and use a different backup method. + In most distributions, the + getfattr command is part of the + attr package. If the + getfattr command returns errors like + Operation not supported, then the kernel does not + correctly support EAs. Stop and use a different backup method. - Verify that the ea-$date.bak file has properly backed up the EA data on the OST. - Without this attribute data, the restore process may be missing extra data that can be very useful in case of later file system corruption. Look at this file with more or a text editor. Each object file should have a corresponding item similar to this: + + Verify that the + ea-$date.bak file has properly backed up the EA + data on the OST. + + Without this attribute data, the restore process may be missing + extra data that can be very useful in case of later file system + corruption. Look at this file with more or a text editor. Each object + file should have a corresponding item similar to this: [oss]# file: O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000 - Back up all file system data. + + Back up all file system data. + [oss]# tar czvf {backup file}.tgz [--xattrs] --sparse . - The tar --sparse option is vital for backing up an MDT. In - order to have --sparse behave correctly, and complete the backup of - and MDT in finite time, the version of tar must be specified. Correctly functioning - versions of tar include the Lustre software enhanced version of tar at , the tar from a Red Hat Enterprise Linux distribution (version 6.3 or more recent) - and the GNU tar version 1.25 or more recent. + The tar + --sparse option is vital for backing up an MDT. In + order to have + --sparse behave correctly, and complete the backup + of and MDT in finite time, the version of tar must be specified. + Correctly functioning versions of tar include the Lustre software + enhanced version of tar at + , + the tar from a Red Hat Enterprise Linux distribution (version 6.3 or + more recent) and the GNU tar version 1.25 or more recent. - The tar --xattrs option is only available - in GNU tar distributions from Red Hat or Intel. + The tar + --xattrs option is only available in GNU tar + distributions from Red Hat or Intel. - Change directory out of the file system. + + Change directory out of the file + system. + [oss]# cd - - Unmount the file system. + + Unmount the file system. + [oss]# umount /mnt/ost - When restoring an OST backup on a different node as part of an OST migration, you also have to change server NIDs and use the --writeconf command to re-generate the configuration logs. See (Changing a Server NID). + When restoring an OST backup on a different node as part of an + OST migration, you also have to change server NIDs and use the + --writeconf command to re-generate the + configuration logs. See + (Changing a Server NID).
- <indexterm><primary>backup</primary><secondary>restoring file system backup</secondary></indexterm>Restoring a File-Level Backup - To restore data from a file-level backup, you need to format the device, restore the file data and then restore the EA data. + + <indexterm> + <primary>backup</primary> + <secondary>restoring file system backup</secondary> + </indexterm>Restoring a File-Level Backup + To restore data from a file-level backup, you need to format the + device, restore the file data and then restore the EA data. Format the new device. @@ -341,12 +616,14 @@ trusted.fid= \ Restore the file system extended attributes. [oss]# setfattr --restore=ea-${date}.bak - If --xattrs option is supported by tar and specified in the step above, this step is redundant. + If + --xattrs option is supported by tar and specified + in the step above, this step is redundant. Verify that the extended attributes were restored. - [oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \ + [oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000 @@ -358,41 +635,77 @@ trusted.fid= \ [oss]# umount /mnt/ost - If the file system was used between the time the backup was made and when it was restored, then the online LFSCK tool (part of Lustre code) will automatically be run to ensure the file system is coherent. If all of the device file systems were backed up at the same time after the entire Lustre file system was stopped, this is not necessary. In either case, the file system should be immediately usable even if LFSCK is not run, though there may be I/O errors reading from files that are present on the MDT but not the OSTs, and files that were created after the MDT backup will not be accessible/visible. See for details on using LFSCK. + If the file system was used between the time the backup was made and + when it was restored, then the online + LFSCK tool (part of Lustre code after version 2.3) + will automatically be + run to ensure the file system is coherent. If all of the device file + systems were backed up at the same time after the entire Lustre file system + was stopped, this step is unnecessary. In either case, the file system will + be immediately although there may be I/O errors reading + from files that are present on the MDT but not the OSTs, and files that + were created after the MDT backup will not be accessible or visible. See + for details on using LFSCK.
- <indexterm> - <primary>backup</primary> - <secondary>using LVM</secondary> - </indexterm>Using LVM Snapshots with the Lustre File System - If you want to perform disk-based backups (because, for example, access to the backup system needs to be as fast as to the primary Lustre file system), you can use the Linux LVM snapshot tool to maintain multiple, incremental file system backups. - Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of the main Lustre file system will probably result in unacceptable performance losses. You should create a new, backup Lustre file system and periodically (e.g., nightly) back up new/changed files to it. Periodic snapshots can be taken of this backup file system to create a series of "full" backups. + + <indexterm> + <primary>backup</primary> + <secondary>using LVM</secondary> + </indexterm>Using LVM Snapshots with the Lustre File System + If you want to perform disk-based backups (because, for example, + access to the backup system needs to be as fast as to the primary Lustre + file system), you can use the Linux LVM snapshot tool to maintain multiple, + incremental file system backups. + Because LVM snapshots cost CPU cycles as new files are written, + taking snapshots of the main Lustre file system will probably result in + unacceptable performance losses. You should create a new, backup Lustre + file system and periodically (e.g., nightly) back up new/changed files to + it. Periodic snapshots can be taken of this backup file system to create a + series of "full" backups. - Creating an LVM snapshot is not as reliable as making a separate backup, because the LVM snapshot shares the same disks as the primary MDT device, and depends on the primary MDT device for much of its data. If the primary MDT device becomes corrupted, this may result in the snapshot being corrupted. + Creating an LVM snapshot is not as reliable as making a separate + backup, because the LVM snapshot shares the same disks as the primary MDT + device, and depends on the primary MDT device for much of its data. If + the primary MDT device becomes corrupted, this may result in the snapshot + being corrupted.
- <indexterm><primary>backup</primary><secondary>using LVM</secondary><tertiary>creating</tertiary></indexterm>Creating an LVM-based Backup File System - Use this procedure to create a backup Lustre file system for use with the LVM snapshot mechanism. + + <indexterm> + <primary>backup</primary> + <secondary>using LVM</secondary> + <tertiary>creating</tertiary> + </indexterm>Creating an LVM-based Backup File System + Use this procedure to create a backup Lustre file system for use + with the LVM snapshot mechanism. Create LVM volumes for the MDT and OSTs. - Create LVM devices for your MDT and OST targets. Make sure not to use the entire disk for the targets; save some room for the snapshots. The snapshots start out as 0 size, but grow as you make changes to the current file system. If you expect to change 20% of the file system between backups, the most recent snapshot will be 20% of the target size, the next older one will be 40%, etc. Here is an example: + Create LVM devices for your MDT and OST targets. Make sure not + to use the entire disk for the targets; save some room for the + snapshots. The snapshots start out as 0 size, but grow as you make + changes to the current file system. If you expect to change 20% of + the file system between backups, the most recent snapshot will be 20% + of the target size, the next older one will be 40%, etc. Here is an + example: cfs21:~# pvcreate /dev/sda1 - Physical volume "/dev/sda1" successfully created + Physical volume "/dev/sda1" successfully created cfs21:~# vgcreate vgmain /dev/sda1 - Volume group "vgmain" successfully created + Volume group "vgmain" successfully created cfs21:~# lvcreate -L200G -nMDT0 vgmain - Logical volume "MDT0" created + Logical volume "MDT0" created cfs21:~# lvcreate -L200G -nOST0 vgmain - Logical volume "OST0" created + Logical volume "OST0" created cfs21:~# lvscan - ACTIVE '/dev/vgmain/MDT0' [200.00 GB] inherit - ACTIVE '/dev/vgmain/OST0' [200.00 GB] inherit + ACTIVE '/dev/vgmain/MDT0' [200.00 GB] inherit + ACTIVE '/dev/vgmain/OST0' [200.00 GB] inherit Format the LVM volumes as Lustre targets. - In this example, the backup file system is called main and - designates the current, most up-to-date backup. + In this example, the backup file system is called + main and designates the current, most up-to-date + backup. cfs21:~# mkfs.lustre --fsname=main --mdt --index=0 /dev/vgmain/MDT0 No management node specified, adding MGS to this MDT. Permanent disk data: @@ -413,7 +726,8 @@ checking for existing Lustre data mkfs_cmd = mkfs.ext2 -j -b 4096 -L main-MDT0000 -i 4096 -I 512 -q -O dir_index -F /dev/vgmain/MDT0 Writing CONFIGS/mountdata -cfs21:~# mkfs.lustre --mgsnode=cfs21 --fsname=main --ost --index=0 /dev/vgmain/OST0 +cfs21:~# mkfs.lustre --mgsnode=cfs21 --fsname=main --ost --index=0 +/dev/vgmain/OST0 Permanent disk data: Target: main-OST0000 Index: 0 @@ -435,13 +749,20 @@ checking for existing Lustre data Writing CONFIGS/mountdata cfs21:~# mount -t lustre /dev/vgmain/MDT0 /mnt/mdt cfs21:~# mount -t lustre /dev/vgmain/OST0 /mnt/ost -cfs21:~# mount -t lustre cfs21:/main /mnt/main +cfs21:~# mount -t lustre cfs21:/main /mnt/main +
- <indexterm><primary>backup</primary><secondary>new/changed files</secondary></indexterm>Backing up New/Changed Files to the Backup File System - At periodic intervals e.g., nightly, back up new and changed files to the LVM-based backup file system. + + <indexterm> + <primary>backup</primary> + <secondary>new/changed files</secondary> + </indexterm>Backing up New/Changed Files to the Backup File + System + At periodic intervals e.g., nightly, back up new and changed files + to the LVM-based backup file system. cfs21:~# cp /etc/passwd /mnt/main cfs21:~# cp /etc/fstab /mnt/main @@ -450,29 +771,60 @@ cfs21:~# ls /mnt/main fstab passwd
- <indexterm><primary>backup</primary><secondary>using LVM</secondary><tertiary>creating snapshots</tertiary></indexterm>Creating Snapshot Volumes - Whenever you want to make a "checkpoint" of the main Lustre file system, create LVM snapshots of all target MDT and OSTs in the LVM-based backup file system. You must decide the maximum size of a snapshot ahead of time, although you can dynamically change this later. The size of a daily snapshot is dependent on the amount of data changed daily in the main Lustre file system. It is likely that a two-day old snapshot will be twice as big as a one-day old snapshot. - You can create as many snapshots as you have room for in the volume group. If necessary, you can dynamically add disks to the volume group. - The snapshots of the target MDT and OSTs should be taken at the same point in time. Make sure that the cronjob updating the backup file system is not running, since that is the only thing writing to the disks. Here is an example: + + <indexterm> + <primary>backup</primary> + <secondary>using LVM</secondary> + <tertiary>creating snapshots</tertiary> + </indexterm>Creating Snapshot Volumes + Whenever you want to make a "checkpoint" of the main Lustre file + system, create LVM snapshots of all target MDT and OSTs in the LVM-based + backup file system. You must decide the maximum size of a snapshot ahead + of time, although you can dynamically change this later. The size of a + daily snapshot is dependent on the amount of data changed daily in the + main Lustre file system. It is likely that a two-day old snapshot will be + twice as big as a one-day old snapshot. + You can create as many snapshots as you have room for in the volume + group. If necessary, you can dynamically add disks to the volume + group. + The snapshots of the target MDT and OSTs should be taken at the + same point in time. Make sure that the cronjob updating the backup file + system is not running, since that is the only thing writing to the disks. + Here is an example: cfs21:~# modprobe dm-snapshot cfs21:~# lvcreate -L50M -s -n MDT0.b1 /dev/vgmain/MDT0 Rounding up size to full physical extent 52.00 MB - Logical volume "MDT0.b1" created + Logical volume "MDT0.b1" created cfs21:~# lvcreate -L50M -s -n OST0.b1 /dev/vgmain/OST0 Rounding up size to full physical extent 52.00 MB - Logical volume "OST0.b1" created - After the snapshots are taken, you can continue to back up new/changed files to "main". The snapshots will not contain the new files. + Logical volume "OST0.b1" created + + After the snapshots are taken, you can continue to back up + new/changed files to "main". The snapshots will not contain the new + files. cfs21:~# cp /etc/termcap /mnt/main cfs21:~# ls /mnt/main -fstab passwd termcap +fstab passwd termcap +
- <indexterm><primary>backup</primary><secondary>using LVM</secondary><tertiary>restoring</tertiary></indexterm>Restoring the File System From a Snapshot - Use this procedure to restore the file system from an LVM snapshot. + + <indexterm> + <primary>backup</primary> + <secondary>using LVM</secondary> + <tertiary>restoring</tertiary> + </indexterm>Restoring the File System From a Snapshot + Use this procedure to restore the file system from an LVM + snapshot. Rename the LVM snapshot. - Rename the file system snapshot from "main" to "back" so you can mount it without unmounting "main". This is recommended, but not required. Use the --reformat flag to tunefs.lustre to force the name change. For example: + Rename the file system snapshot from "main" to "back" so you + can mount it without unmounting "main". This is recommended, but not + required. Use the + --reformat flag to + tunefs.lustre to force the name change. For + example: cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/vgmain/MDT0.b1 checking for existing Lustre data found Lustre data @@ -518,9 +870,10 @@ Permanent disk data: (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.0.21@tcp -Writing CONFIGS/mountdata - When renaming a file system, we must also erase the last_rcvd file from the - snapshots +Writing CONFIGS/mountdata + + When renaming a file system, we must also erase the last_rcvd + file from the snapshots cfs21:~# mount -t ldiskfs /dev/vgmain/MDT0.b1 /mnt/mdtback cfs21:~# rm /mnt/mdtback/last_rcvd cfs21:~# umount /mnt/mdtback @@ -529,29 +882,45 @@ cfs21:~# rm /mnt/ostback/last_rcvd cfs21:~# umount /mnt/ostback - Mount the file system from the LVM snapshot. For example: + Mount the file system from the LVM snapshot. For + example: cfs21:~# mount -t lustre /dev/vgmain/MDT0.b1 /mnt/mdtback cfs21:~# mount -t lustre /dev/vgmain/OST0.b1 /mnt/ostback cfs21:~# mount -t lustre cfs21:/back /mnt/back - Note the old directory contents, as of the snapshot time. For example: + Note the old directory contents, as of the snapshot time. For + example: cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back -fstab passwds +fstab passwds +
- <indexterm><primary>backup</primary><secondary>using LVM</secondary><tertiary>deleting</tertiary></indexterm>Deleting Old Snapshots - To reclaim disk space, you can erase old snapshots as your backup policy dictates. Run: + + <indexterm> + <primary>backup</primary> + <secondary>using LVM</secondary> + <tertiary>deleting</tertiary> + </indexterm>Deleting Old Snapshots + To reclaim disk space, you can erase old snapshots as your backup + policy dictates. Run: lvremove /dev/vgmain/MDT0.b1
- <indexterm><primary>backup</primary><secondary>using LVM</secondary><tertiary>resizing</tertiary></indexterm>Changing Snapshot Volume Size - You can also extend or shrink snapshot volumes if you find your daily deltas are smaller or larger than expected. Run: + + <indexterm> + <primary>backup</primary> + <secondary>using LVM</secondary> + <tertiary>resizing</tertiary> + </indexterm>Changing Snapshot Volume Size + You can also extend or shrink snapshot volumes if you find your + daily deltas are smaller or larger than expected. Run: lvextend -L10G /dev/vgmain/MDT0.b1 - Extending snapshots seems to be broken in older LVM. It is working in LVM v2.02.01. + Extending snapshots seems to be broken in older LVM. It is + working in LVM v2.02.01.
diff --git a/Glossary.xml b/Glossary.xml index a8c2be7..84416da 100644 --- a/Glossary.xml +++ b/Glossary.xml @@ -253,7 +253,7 @@ - lfsck + LFSCK Lustre file system check. A distributed version of a disk file system checker. diff --git a/TroubleShootingRecovery.xml b/TroubleShootingRecovery.xml index 987228b..3c24798 100644 --- a/TroubleShootingRecovery.xml +++ b/TroubleShootingRecovery.xml @@ -1,693 +1,1445 @@ - - Troubleshooting Recovery - This chapter describes what to do if something goes wrong during recovery. It describes: - - - - - - - - - - - - - - -
- <indexterm><primary>recovery</primary><secondary>corruption of backing ldiskfs file system</secondary></indexterm>Recovering from Errors or Corruption on a Backing ldiskfs File System - When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the - file system. ldiskfs journaling ensures that the file system remains - consistent over a system crash. The backing file systems are never accessed directly - from the client, so client crashes are not relevant for server file system - consistency. - The only time it is REQUIRED that e2fsck be run on a device is when an event causes problems that ldiskfs journaling is unable to handle, such as a hardware device failure or I/O error. If the ldiskfs kernel code detects corruption on the disk, it mounts the file system as read-only to prevent further corruption, but still allows read access to the device. This appears as error "-30" (EROFS) in the syslogs on the server, e.g.: - Dec 29 14:11:32 mookie kernel: LDISKFS-fs error (device sdz): + + + Troubleshooting + Recovery + This chapter describes what to do if something goes wrong during + recovery. It describes: + + + + + + + + + + + + + + + + + + + + + + +
+ + <indexterm> + <primary>recovery</primary> + <secondary>corruption of backing ldiskfs file system</secondary> + </indexterm>Recovering from Errors or Corruption on a Backing ldiskfs File + System + When an OSS, MDS, or MGS server crash occurs, it is not necessary to + run e2fsck on the file system. + ldiskfs journaling ensures that the file system remains + consistent over a system crash. The backing file systems are never accessed + directly from the client, so client crashes are not relevant for server + file system consistency. + The only time it is REQUIRED that + e2fsck be run on a device is when an event causes + problems that ldiskfs journaling is unable to handle, such as a hardware + device failure or I/O error. If the ldiskfs kernel code detects corruption + on the disk, it mounts the file system as read-only to prevent further + corruption, but still allows read access to the device. This appears as + error "-30" ( + EROFS) in the syslogs on the server, e.g.: + Dec 29 14:11:32 mookie kernel: LDISKFS-fs error (device sdz): ldiskfs_lookup: unlinked inode 5384166 in dir #145170469 -Dec 29 14:11:32 mookie kernel: Remounting filesystem read-only - In such a situation, it is normally required that e2fsck only be run on the bad device before placing the device back into service. - In the vast majority of cases, the Lustre software can cope with any inconsistencies - found on the disk and between other devices in the file system. - - The offline LFSCK tool included with e2fsprogs is rarely required for Lustre file - system operation. - - For problem analysis, it is strongly recommended that e2fsck be run under a logger, like script, to record all of the output and changes that are made to the file system in case this information is needed later. - If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to assess the type and extent of damage to the file system. The drawback is that in this mode, e2fsck does not recover the file system journal, so there may appear to be file system corruption when none really exists. - To address concern about whether corruption is real or only due to the journal not - being replayed, you can briefly mount and unmount the ldiskfs file - system directly on the node with the Lustre file system stopped, using a command similar - to: - mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost - This causes the journal to be recovered. - The e2fsck utility works well when fixing file system corruption - (better than similar file system recovery tools and a primary reason why - ldiskfs was chosen over other file systems). However, it is often - useful to identify the type of damage that has occurred so an ldiskfs - expert can make intelligent decisions about what needs fixing, in place of - e2fsck. - root# {stop lustre services for this device, if running} +Dec 29 14:11:32 mookie kernel: Remounting filesystem read-only + In such a situation, it is normally required that e2fsck only be run + on the bad device before placing the device back into service. + In the vast majority of cases, the Lustre software can cope with any + inconsistencies found on the disk and between other devices in the file + system. + + The legacy offline-LFSCK tool included with e2fsprogs is rarely + required for Lustre file system operation. offline-LFSCK is not to be + confused with LFSCK tool, which is part of Lustre and provides online + consistency checking. + + For problem analysis, it is strongly recommended that + e2fsck be run under a logger, like script, to record all + of the output and changes that are made to the file system in case this + information is needed later. + If time permits, it is also a good idea to first run + e2fsck in non-fixing mode (-n option) to assess the type + and extent of damage to the file system. The drawback is that in this mode, + e2fsck does not recover the file system journal, so there + may appear to be file system corruption when none really exists. + To address concern about whether corruption is real or only due to + the journal not being replayed, you can briefly mount and unmount the + ldiskfs file system directly on the node with the Lustre + file system stopped, using a command similar to: + mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost + This causes the journal to be recovered. + The + e2fsck utility works well when fixing file system + corruption (better than similar file system recovery tools and a primary + reason why + ldiskfs was chosen over other file systems). However, it + is often useful to identify the type of damage that has occurred so an + ldiskfs expert can make intelligent decisions about what + needs fixing, in place of + e2fsck. + root# {stop lustre services for this device, if running} root# script /tmp/e2fsck.sda Script started, file is /tmp/e2fsck.sda root# mount -t ldiskfs /dev/sda /mnt/ost root# umount /mnt/ost -root# e2fsck -fn /dev/sda # don't fix file system, just check for corruption +root# e2fsck -fn /dev/sda # don't fix file system, just check for corruption : [e2fsck output] : -root# e2fsck -fp /dev/sda # fix errors with prudent answers (usually yes) - +root# e2fsck -fp /dev/sda # fix errors with prudent answers (usually yes) +
+
+ + <indexterm> + <primary>recovery</primary> + <secondary>corruption of Lustre file system</secondary> + </indexterm>Recovering from Corruption in the Lustre File System + In cases where an ldiskfs MDT or OST becomes corrupt, you need to run + e2fsck to correct the local filesystem consistency, then use + LFSCK to run a distributed check on the file system to + resolve any inconsistencies between the MDTs and OSTs, or among MDTs. + + + Stop the Lustre file system. + + + Run + e2fsck -f on the individual MDT/OST that had + problems to fix any local file system damage. + We recommend running + e2fsck under script, to create a log of changes made + to the file system in case it is needed later. After + e2fsck is run, bring up the file system, if + necessary, to reduce the outage window. + + +
+ + <indexterm> + <primary>recovery</primary> + <secondary>orphaned objects</secondary> + </indexterm>Working with Orphaned Objects + The simplest problem to resolve is that of orphaned objects. When + the LFSCK layout check is run, these objects are linked to new files and + put into + .lustre/lost+found/MDTxxxx + in the Lustre file system + (where MDTxxxx is the index of the MDT on which the orphan was found), + where they can be examined and saved or deleted as necessary. + With Lustre version 2.7 and later, LFSCK will + identify and process orphan objects found on MDTs as well.
-
- <indexterm><primary>recovery</primary><secondary>corruption of Lustre file system</secondary></indexterm>Recovering from Corruption in the Lustre File System - In cases where an ldiskfs MDT or OST becomes corrupt, you need to run e2fsck to correct the local filesystem consistency, then use LFSCK to run a distributed check on the file system to resolve any inconsistencies between the MDTs and OSTs. - - - Stop the Lustre file system. - - - Run e2fsck -f on the individual MDS / OST that had problems to fix any local file system damage. - We recommend running e2fsck under script, to create a log of changes made to the file system in case it is needed later. After e2fsck is run, bring up the file system, if necessary, to reduce the outage window. - - -
- <indexterm><primary>recovery</primary><secondary>orphaned objects</secondary></indexterm>Working with Orphaned Objects - The easiest problem to resolve is that of orphaned objects. When the LFSCK layout check is run, these objects are linked to new files and put into .lustre/lost+found in the Lustre file system, where they can be examined and saved or deleted as necessary. -
-
-
- <indexterm><primary>recovery</primary><secondary>unavailable OST</secondary></indexterm>Recovering from an Unavailable OST - One problem encountered in a Lustre file system environment is - when an OST becomes unavailable due to a network partition, OSS node crash, etc. When - this happens, the OST's clients pause and wait for the OST to become available - again, either on the primary OSS or a failover OSS. When the OST comes back online, the - Lustre file system starts a recovery process to enable clients to reconnect to the OST. - Lustre servers put a limit on the time they will wait in recovery for clients to - reconnect. - During recovery, clients reconnect and replay their requests serially, in the same order they were done originally. Until a client receives a confirmation that a given transaction has been written to stable storage, the client holds on to the transaction, in case it needs to be replayed. Periodically, a progress message prints to the log, stating how_many/expected clients have reconnected. If the recovery is aborted, this log shows how many clients managed to reconnect. When all clients have completed recovery, or if the recovery timeout is reached, the recovery period ends and the OST resumes normal request processing. - If some clients fail to replay their requests during the recovery period, this will not stop the recovery from completing. You may have a situation where the OST recovers, but some clients are not able to participate in recovery (e.g. network problems or client failure), so they are evicted and their requests are not replayed. This would result in any operations on the evicted clients failing, including in-progress writes, which would cause cached writes to be lost. This is a normal outcome; the recovery cannot wait indefinitely, or the file system would be hung any time a client failed. The lost transactions are an unfortunate result of the recovery process. - - The failure of client recovery does not indicate or lead to - filesystem corruption. This is a normal event that is handled by - the MDT and OST, and should not result in any inconsistencies - between servers. - - - The version-based recovery (VBR) feature enables a failed client to be ''skipped'', so remaining clients can replay their requests, resulting in a more successful recovery from a downed OST. For more information about the VBR feature, see (Version-based Recovery). - -
-
- <indexterm><primary>recovery</primary><secondary>oiscrub</secondary></indexterm><indexterm><primary>recovery</primary><secondary>lfsck</secondary></indexterm>Checking the file system with LFSCK - LFSCK is an administrative tool introduced in Lustre software release 2.3 for checking - and repair of the attributes specific to a mounted Lustre file system. It is similar in - concept to an offline fsck repair tool for a local filesystem, - but LFSCK is implemented to run as part of the Lustre file system while the file - system is mounted and in use. This allows consistency of checking and repair by the - Lustre software without unnecessary downtime, and can be run on the largest Lustre file - systems. - In Lustre software release 2.3, LFSCK can verify and repair the Object Index (OI) - table that is used internally to map Lustre File Identifiers (FIDs) to MDT internal - inode numbers, through a process called OI Scrub. An OI Scrub is required after - restoring from a file-level MDT backup (), or - in case the OI table is otherwise corrupted. Later phases of LFSCK will add further - checks to the Lustre distributed file system state. - In Lustre software release 2.4, LFSCK namespace scanning can verify and repair the directory FID-in-Dirent and LinkEA consistency. - - In Lustre software release 2.6, LFSCK layout scanning can verify and repair MDT-OST file layout inconsistency. File layout inconsistencies between MDT-objects and OST-objects that are checked and corrected include dangling reference, unreferenced OST-objects, mismatched references and multiple references. - - Control and monitoring of LFSCK is through LFSCK and the /proc file system - interfaces. LFSCK supports three types of interface: switch interface, status - interface and adjustment interface. These interfaces are detailed below. +
+
+ + <indexterm> + <primary>recovery</primary> + <secondary>unavailable OST</secondary> + </indexterm>Recovering from an Unavailable OST + One problem encountered in a Lustre file system environment is when + an OST becomes unavailable due to a network partition, OSS node crash, etc. + When this happens, the OST's clients pause and wait for the OST to become + available again, either on the primary OSS or a failover OSS. When the OST + comes back online, the Lustre file system starts a recovery process to + enable clients to reconnect to the OST. Lustre servers put a limit on the + time they will wait in recovery for clients to reconnect. + During recovery, clients reconnect and replay their requests + serially, in the same order they were done originally. Until a client + receives a confirmation that a given transaction has been written to stable + storage, the client holds on to the transaction, in case it needs to be + replayed. Periodically, a progress message prints to the log, stating + how_many/expected clients have reconnected. If the recovery is aborted, + this log shows how many clients managed to reconnect. When all clients have + completed recovery, or if the recovery timeout is reached, the recovery + period ends and the OST resumes normal request processing. + If some clients fail to replay their requests during the recovery + period, this will not stop the recovery from completing. You may have a + situation where the OST recovers, but some clients are not able to + participate in recovery (e.g. network problems or client failure), so they + are evicted and their requests are not replayed. This would result in any + operations on the evicted clients failing, including in-progress writes, + which would cause cached writes to be lost. This is a normal outcome; the + recovery cannot wait indefinitely, or the file system would be hung any + time a client failed. The lost transactions are an unfortunate result of + the recovery process. + + The failure of client recovery does not indicate or lead to + filesystem corruption. This is a normal event that is handled by the MDT + and OST, and should not result in any inconsistencies between + servers. + + + The version-based recovery (VBR) feature enables a failed client to + be ''skipped'', so remaining clients can replay their requests, resulting + in a more successful recovery from a downed OST. For more information + about the VBR feature, see + (Version-based Recovery). + +
+
+ + <indexterm> + <primary>recovery</primary> + <secondary>oiscrub</secondary> + </indexterm> + <indexterm> + <primary>recovery</primary> + <secondary>LFSCK</secondary> + </indexterm>Checking the file system with LFSCK + LFSCK is an administrative tool introduced in Lustre + software release 2.3 for checking and repair of the attributes specific to a + mounted Lustre file system. It is similar in concept to an offline fsck repair + tool for a local filesystem, but LFSCK is implemented to run as part of the + Lustre file system while the file system is mounted and in use. This allows + consistency of checking and repair by the Lustre software without unnecessary + downtime, and can be run on the largest Lustre file systems with negligible + disruption to normal operations. + Since Lustre software release 2.3, LFSCK can verify + and repair the Object Index (OI) table that is used internally to map + Lustre File Identifiers (FIDs) to MDT internal ldiskfs inode numbers, in + an internal table called the OI Table. An OI Scrub traverses this the IO + Table and makes corrections where necessary. An OI Scrub is required after + restoring from a file-level MDT backup ( + ), or in case the OI Table is + otherwise corrupted. Later phases of LFSCK will add further checks to the + Lustre distributed file system state. + In Lustre software release 2.4, LFSCK namespace + scanning can verify and repair the directory FID-in-Dirent and LinkEA + consistency. + In Lustre software release 2.6, LFSCK layout scanning + can verify and repair MDT-OST file layout inconsistencies. File layout + inconsistencies between MDT-objects and OST-objects that are checked and + corrected include dangling reference, unreferenced OST-objects, mismatched + references and multiple references. + In Lustre software release 2.7, LFSCK layout scanning + is enhanced to support verify and repair inconsistencies between multiple + MDTs. + Control and monitoring of LFSCK is through LFSCK and the + /proc file system interfaces. LFSCK supports three types + of interface: switch interface, status interface, and adjustment interface. + These interfaces are detailed below.
- LFSCK switch interface + LFSCK switch interface +
+ Manually Starting LFSCK +
+ Description + LFSCK can be started after the MDT is mounted using the + lctl lfsck_start command. +
- Manually Starting LFSCK -
- Description - LFSCK can be started after the MDT is mounted using the lctl lfsck_start command. -
-
- Usage - lctl lfsck_start -M | --device [MDT,OST]_device \ + Usage + lctl lfsck_start -M | --device +[MDT,OST]_device \ [-A | --all] \ - [-c | --create_ostobj [on | off]] \ - [-C | --create_mdtobj [on | off]] \ - [-e | --error {continue | abort}] \ + [-c | --create_ostobj +[on | off]] \ + [-C | --create_mdtobj +[on | off]] \ + [-e | --error +{continue | abort}] \ [-h | --help] \ - [-n | --dryrun [on | off]] \ + [-n | --dryrun +[on | off]] \ [-o | --orphan] \ [-r | --reset] \ - [-s | --speed ops_per_sec_limit] \ - [-t | --type lfsck_type[,lfsck_type...]] \ - [-w | --window_size size] - -
-
- Options - The various lfsck_start options are listed and described below. For a complete list of available options, type lctl lfsck_start -h. - - - - - - - - Option - - - Description - - - - - - - -M | --device - - - The MDT or OST device to start LFSCK/scrub on. - - - - - -A | --all - - - Start LFSCK on all devices via a single lctl command. This applies to both layout and namespace consistency checking and repair. - - - - - -c | --create_ostobj - - - Create the lost OST-object for dangling LOV EA, off (default) or on. If not specified, then the default behaviour is to keep the dangling LOV EA there without creating the lost OST-object. - - - - - -C | --create_mdtobj - - - Create the lost MDT-object for dangling name entry, off (default) or on. If not specified, then the default behaviour is to keep the dangling name entry there without creating the lost MDT-object. - - - - - -e | --error - - - Error handle, continue (default) or abort. Specify whether the LFSCK will stop or not if fail to repair something. If it is not specified, the saved value (when resuming from checkpoint) will be used if present. This option cannot be changed if LFSCK is running. - - - - - -h | --help - - - Operating help information. - - - - - -n | --dryrun - - - Perform a trial without making any changes. off (default) or on. - - - - - -o | --orphan - - - Repair orphan OST-objects for layout LFSCK. - - - - - -r | --reset - - - Reset the start position for the object iteration to the beginning for the specified MDT. By default the iterator will resume scanning from the last checkpoint (saved periodically by LFSCK) provided it is available. - - - - - -s | --speed - - - Set the upper speed limit of LFSCK processing in objects per second. If it is not specified, the saved value (when resuming from checkpoint) or default value of 0 (0 = run as fast as possible) is used. Speed can be adjusted while LFSCK is running with the adjustment interface. - - - - - -t | --type - - - The type of checking/repairing that should be performed. The new LFSCK framework provides a single interface for a variety of system consistency checking/repairing operations including: -Without a specified option, the LFSCK component(s) which ran last time and did not finish or the component(s) corresponding to some known system inconsistency, will be started. Anytime the LFSCK is triggered, the OI scrub will run automatically, so there is no need to specify OI_scrub. -namespace: check and repair FID-in-Dirent and LinkEA consistency. Lustre-2.7 enhances namespace consistency verification under DNE mode. -layout: check and repair MDT-OST inconsistency. - - - - - -w | --window_size - - - The window size for the async request pipeline. The LFSCK async request pipeline's input/output may have quite different processing speeds, and there may be too many requests in the pipeline as to cause abnormal memory/network pressure. If not specified, then the default window size for the async request pipeline is 1024. - - - - - -
+ [-s | --speed +ops_per_sec_limit] \ + [-t | --type +lfsck_type[,lfsck_type...]] \ + [-w | --window_size +size] +
+
+ Options + The various + lfsck_start options are listed and described below. + For a complete list of available options, type + lctl lfsck_start -h. + + + + + + + + + Option + + + + + Description + + + + + + + + + -M | --device + + + + The MDT or OST device to start LFSCK on. + + + + + + -A | --all + + + + Start LFSCK on all devices. + This applies to both layout and + namespace consistency checking and repair. + + + + + + -c | --create_ostobj + + + + Create the lost OST-object for + dangling LOV EA, + off(default) or + on. If not specified, then the default + behaviour is to keep the dangling LOV EA there without + creating the lost OST-object. + + + + + + -C | --create_mdtobj + + + + Create the lost MDT-object for + dangling name entry, + off(default) or + on. If not specified, then the default + behaviour is to keep the dangling name entry there without + creating the lost MDT-object. + + + + + + -e | --error + + + + Error handle, + continue(default) or + abort. Specify whether the LFSCK will + stop or not if fails to repair something. If it is not + specified, the saved value (when resuming from checkpoint) + will be used if present. This option cannot be changed + while LFSCK is running. + + + + + + -h | --help + + + + Operating help information. + + + + + + -n | --dryrun + + + + Perform a trial without making any changes. + off(default) or + on. + + + + + + -o | --orphan + + + + Repair orphan OST-objects for layout + LFSCK. + + + + + + -r | --reset + + + + Reset the start position for the object iteration to + the beginning for the specified MDT. By default the + iterator will resume scanning from the last checkpoint + (saved periodically by LFSCK) provided it is + available. + + + + + + -s | --speed + + + + Set the upper speed limit of LFSCK processing in + objects per second. If it is not specified, the saved value + (when resuming from checkpoint) or default value of 0 (0 = + run as fast as possible) is used. Speed can be adjusted + while LFSCK is running with the adjustment + interface. + + + + + + -t | --type + + + + The type of checking/repairing that should be + performed. The new LFSCK framework provides a single + interface for a variety of system consistency + checking/repairing operations including: + Without a specified option, the LFSCK component(s) + which ran last time and did not finish or the component(s) + corresponding to some known system inconsistency, will be + started. Anytime the LFSCK is triggered, the OI scrub will + run automatically, so there is no need to specify + OI_scrub in that case. + + namespace: check and repair + FID-in-Dirent and LinkEA consistency. + Lustre-2.7 enhances + namespace consistency verification under DNE mode. + + layout: check and repair MDT-OST + inconsistency. + + + + + + -w | --window_size + + + + The window size for the async request + pipeline. The LFSCK async request pipeline's input/output + may have quite different processing speeds, and there may + be too many requests in the pipeline as to cause abnormal + memory/network pressure. If not specified, then the default + window size for the async request pipeline is 1024. + + + + + +
+
+
+ Manually Stopping LFSCK +
+ Description + To stop LFSCK when the MDT is mounted, use the + lctl lfsck_stop command.
- Manually Stopping LFSCK -
- Description - To stop LFSCK when the MDT is mounted, use the lctl lfsck_stop command. -
-
- Usage - lctl lfsck_stop -M | --device [MDT,OST]_device \ + Usage + lctl lfsck_stop -M | --device +[MDT,OST]_device \ [-A | --all] \ - [-h | --help] - -
-
- Options - The various lfsck_stop options are listed and described below. For a complete list of available options, type lctl lfsck_stop -h. - - - - - - - - Option - - - Description - - - - - - - -M | --device - - - The MDT or OST device to stop LFSCK/scrub on. - - - - - -A | --all - - - Stop LFSCK on all devices. - - - - - -h | --help - - - Operating help information. - - - - - -
+ [-h | --help]
+
+ Options + The various + lfsck_stop options are listed and described below. + For a complete list of available options, type + lctl lfsck_stop -h. + + + + + + + + + Option + + + + + Description + + + + + + + + + -M | --device + + + + The MDT or OST device to stop LFSCK on. + + + + + + -A | --all + + + + Stop LFSCK on all devices. + + + + + + -h | --help + + + + Operating help information. + + + + + +
+
- LFSCK status interface + LFSCK status interface +
+ LFSCK status of OI Scrub via + <literal>procfs</literal> +
+ Description + For each LFSCK component there is a dedicated procfs interface + to trace the corresponding LFSCK component status. For OI Scrub, the + interface is the OSD layer procfs interface, named + oi_scrub. To display OI Scrub status, the standard + + lctl get_param command is used as shown in the + usage below. +
+
+ Usage + lctl get_param -n osd-ldiskfs.FSNAME-[MDT_device|OST_device].oi_scrub +
+
+ Output + + + + + + + + + Information + + + + + Detail + + + + + + + + General Information + + + + + Name: OI_scrub. + + + OI scrub magic id (an identifier unique to OI + scrub). + + + OI files count. + + + Status: one of the status - + init, + scanning, + completed, + failed, + stopped, + paused, or + crashed. + + + Flags: including - + recreated(OI file(s) is/are + removed/recreated), + inconsistent(restored from + file-level backup), + auto(triggered by non-UI mechanism), + and + upgrade(from Lustre software release + 1.8 IGIF format.) + + + Parameters: OI scrub parameters, like + failout. + + + Time Since Last Completed. + + + Time Since Latest Start. + + + Time Since Last Checkpoint. + + + Latest Start Position: the position for the + latest scrub started from. + + + Last Checkpoint Position. + + + First Failure Position: the position for the + first object to be repaired. + + + Current Position. + + + + + + + Statistics + + + + + + Checked total number of objects + scanned. + + + + Updated total number of objects + repaired. + + + + Failed total number of objects that + failed to be repaired. + + + + No Scrub total number of objects + marked + LDISKFS_STATE_LUSTRE_NOSCRUB and + skipped. + + + + IGIF total number of objects IGIF + scanned. + + + + Prior Updated how many objects have + been repaired which are triggered by parallel + RPC. + + + + Success Count total number of + completed OI_scrub runs on the device. + + + + Run Time how long the scrub has run, + tally from the time of scanning from the beginning of + the specified MDT device, not include the + paused/failure time among checkpoints. + + + + Average Speed calculated by dividing + Checked by + run_time. + + + + Real-Time Speed the speed since last + checkpoint if the OI_scrub is running. + + + + Scanned total number of objects under + /lost+found that have been scanned. + + + + Repaired total number of objects + under /lost+found that have been recovered. + + + + Failed total number of objects under + /lost+found failed to be scanned or failed to be + recovered. + + + + + + + +
+
+
+ LFSCK status of namespace via + <literal>procfs</literal> +
+ Description + The + namespace component is responsible for checks + described in . The + procfs interface for this component is in the + MDD layer, named + lfsck_namespace. To show the status of this + component, + lctl get_param should be used as described in the + usage below. +
- LFSCK status of OI Scrub via <literal>procfs</literal> -
- Description - For each LFSCK component there is a dedicated procfs interface to trace the corresponding LFSCK component status. For OI Scrub, the interface is the OSD layer procfs interface, named oi_scrub. To display OI Scrub status, the standard lctl get_param command is used as shown in the usage below. -
-
- Usage - lctl get_param -n osd-ldiskfs.FSNAME-MDT_device.oi_scrub - -
-
- Output - - - - - - - - Information - - - Detail - - - - - - - General Information - - - - Name: OI_scrub. - OI scrub magic id (an identifier unique to OI scrub). - OI files count. - Status: one of the status - init, scanning, completed, failed, stopped, paused, or crashed. - Flags: including - recreated (OI file(s) is/are removed/recreated), - inconsistent (restored from - file-level backup), auto - (triggered by non-UI mechanism), and - upgrade (from Lustre software - release 1.8 IGIF format.) - Parameters: OI scrub parameters, like failout. - Time Since Last Completed. - Time Since Latest Start. - Time Since Last Checkpoint. - Latest Start Position: the position for the latest scrub started from. - Last Checkpoint Position. - First Failure Position: the position for the first object to be repaired. - Current Position. - - - - - - Statistics - - - - Checked total number of objects scanned. - Updated total number of objects repaired. - Failed total number of objects that failed to be repaired. - No Scrub total number of objects marked LDISKFS_STATE_LUSTRE_NOSCRUB and skipped. - IGIF total number of objects IGIF scanned. - Prior Updated how many objects have been repaired which are triggered by parallel RPC. - Success Count total number of completed OI_scrub runs on the device. - Run Time how long the scrub has run, tally from the time of scanning from the beginning of the specified MDT device, not include the paused/failure time among checkpoints. - Average Speed calculated by dividing Checked by run_time. - Real-Time Speed the speed since last checkpoint if the OI_scrub is running. - Scanned total number of objects under /lost+found that have been scanned. - Repaired total number of objects under /lost+found that have been recovered. - Failed total number of objects under /lost+found failed to be scanned or failed to be recovered. - - - - - - -
+ Usage + lctl get_param -n mdd. FSNAME-MDT_device.lfsck_namespace
-
- LFSCK status of namespace via <literal>procfs</literal> -
- Description - The namespace component is responsible for checking and repairing FID-in-Dirent and LinkEA consistency. The procfs interface for this component is in the MDD layer, named lfsck_namespace. To show the status of this component, lctl get_param should be used as described in the usage below. -
-
- Usage - lctl get_param -n mdd.FSNAME-MDT_device.lfsck_namespace - -
-
- Output - - - - - - - - Information - - - Detail - - - - - - - General Information - - - - Name: lfsck_namespace - LFSCK namespace magic. - LFSCK namespace version.. - Status: one of the status - init, scanning-phase1, scanning-phase2, completed, failed, stopped, paused, partial, co-failed, co-stopped or co-paused. - Flags: including - scanned-once (the first cycle scanning has been - completed), inconsistent (one - or more inconsistent FID-in-Dirent or LinkEA - entries that have been discovered), - upgrade (from Lustre software - release 1.8 IGIF format.) - Parameters: including dryrun, all_targets, failout, broadcast, orphan, create_ostobjandcreate_mdtobj. - Time Since Last Completed. - Time Since Latest Start. - Time Since Last Checkpoint. - Latest Start Position: the position the checking began most recently. - Last Checkpoint Position. - First Failure Position: the position for the first object to be repaired. - Current Position. - - - - - - Statistics - - - - Checked Phase1 total number of objects scanned during scanning-phase1. - Checked Phase2 total number of objects scanned during scanning-phase2. - Updated Phase1 total number of objects repaired during scanning-phase1. - Updated Phase2 total number of objects repaired during scanning-phase2. - Failed Phase1 total number of objets that failed to be repaired during scanning-phase1. - Failed Phase2 total number of objets that failed to be repaired during scanning-phase2. - directories total number of directories scanned. - multiple_linked_checked total number of multiple-linked objects that have been scanned. - dirent_repaired total number of FID-in-dirent entries that have been repaired. - linkea_repaired total number of linkEA entries that have been repaired. - unknown_inconsistency total number of undefined inconsistencies found in scanning-phase2. - unmatched_pairs_repaired total number of unmatched pairs that have been repaired. - dangling_repaired total number of dangling name entries that have been found/repaired. - multi_referenced_repaired total number of multiple referenced name entries that have been found/repaired. - bad_file_type_repaired total number of name entries with bad file type that have been repaired. - lost_dirent_repaired total number of lost name entries that have been re-inserted. - striped_dirs_scanned total number of striped directories (master) that have been scanned. - striped_dirs_repaired total number of striped directories (master) that have been repaired. - striped_dirs_failed total number of striped directories (master) that have failed to be verified. - striped_dirs_disabled total number of striped directories (master) that have been disabled. - striped_dirs_skipped total number of striped directories (master) that have been skipped (for shards verification) because of lost master LMV EA. - striped_shards_scanned total number of striped directory shards (slave) that have been scanned. - striped_shards_repaired total number of striped directory shards (slave) that have been repaired. - striped_shards_failed total number of striped directory shards (slave) that have failed to be verified. - striped_shards_skipped total number of striped directory shards (slave) that have been skipped (for name hash verification) because LFSCK does not know whether the slave LMV EA is valid or not. - name_hash_repaired total number of name entries under striped directory with bad name hash that have been repaired. - nlinks_repaired total number of objects with nlink fixed. - mul_linked_repaired total number of multiple-linked objects that have been repaired. - local_lost_found_scanned total number of objects under /lost+found that have been scanned. - local_lost_found_moved total number of objects under /lost+found that have been moved to namespace visible directory. - local_lost_found_skipped total number of objects under /lost+found that have been skipped. - local_lost_found_failed total number of objects under /lost+found that have failed to be processed. - Success Count the total number of completed LFSCK runs on the device. - Run Time Phase1 the duration of the LFSCK run during scanning-phase1. Excluding the time spent paused between checkpoints. - Run Time Phase2 the duration of the LFSCK run during scanning-phase2. Excluding the time spent paused between checkpoints. - Average Speed Phase1 calculated by dividing checked_phase1 by run_time_phase1. - Average Speed Phase2 calculated by dividing checked_phase2 by run_time_phase1. - Real-Time Speed Phase1 the speed since the last checkpoint if the LFSCK is running scanning-phase1. - Real-Time Speed Phase2 the speed since the last checkpoint if the LFSCK is running scanning-phase2. - - - - - - -
+
+ Output + + + + + + + + + Information + + + + + Detail + + + + + + + + General Information + + + + + Name: + lfsck_namespace + + + LFSCK namespace magic. + + + LFSCK namespace version.. + + + Status: one of the status - + init, + scanning-phase1, + scanning-phase2, + completed, + failed, + stopped, + paused, + partial, + co-failed, + co-stopped or + co-paused. + + + Flags: including - + scanned-once(the first cycle + scanning has been completed), + inconsistent(one or more + inconsistent FID-in-Dirent or LinkEA entries that have + been discovered), + upgrade(from Lustre software release + 1.8 IGIF format.) + + + Parameters: including + dryrun, + all_targets, + failout, + broadcast, + orphan, + create_ostobj and + create_mdtobj. + + + Time Since Last Completed. + + + Time Since Latest Start. + + + Time Since Last Checkpoint. + + + Latest Start Position: the position the checking + began most recently. + + + Last Checkpoint Position. + + + First Failure Position: the position for the + first object to be repaired. + + + Current Position. + + + + + + + Statistics + + + + + + Checked Phase1 total number of + objects scanned during + scanning-phase1. + + + + Checked Phase2 total number of + objects scanned during + scanning-phase2. + + + + Updated Phase1 total number of + objects repaired during + scanning-phase1. + + + + Updated Phase2 total number of + objects repaired during + scanning-phase2. + + + + Failed Phase1 total number of objets + that failed to be repaired during + scanning-phase1. + + + + Failed Phase2 total number of objets + that failed to be repaired during + scanning-phase2. + + + + directories total number of + directories scanned. + + + + multiple_linked_checked total number + of multiple-linked objects that have been + scanned. + + + + dirent_repaired total number of + FID-in-dirent entries that have been repaired. + + + + linkea_repaired total number of + linkEA entries that have been repaired. + + + + unknown_inconsistency total number of + undefined inconsistencies found in + scanning-phase2. + + + + unmatched_pairs_repaired total number + of unmatched pairs that have been repaired. + + + + dangling_repaired total number of + dangling name entries that have been + found/repaired. + + + + multi_referenced_repaired total + number of multiple referenced name entries that have + been found/repaired. + + + + bad_file_type_repaired total number + of name entries with bad file type that have been + repaired. + + + + lost_dirent_repaired total number of + lost name entries that have been re-inserted. + + + + striped_dirs_scanned total number of + striped directories (master) that have been + scanned. + + + + striped_dirs_repaired total number of + striped directories (master) that have been + repaired. + + + + striped_dirs_failed total number of + striped directories (master) that have failed to be + verified. + + + + striped_dirs_disabled total number of + striped directories (master) that have been + disabled. + + + + striped_dirs_skipped total number of + striped directories (master) that have been skipped + (for shards verification) because of lost master LMV + EA. + + + + striped_shards_scanned total number + of striped directory shards (slave) that have been + scanned. + + + + striped_shards_repaired total number + of striped directory shards (slave) that have been + repaired. + + + + striped_shards_failed total number of + striped directory shards (slave) that have failed to be + verified. + + + + striped_shards_skipped total number + of striped directory shards (slave) that have been + skipped (for name hash verification) because LFSCK does + not know whether the slave LMV EA is valid or + not. + + + + name_hash_repaired total number of + name entries under striped directory with bad name hash + that have been repaired. + + + + nlinks_repaired total number of + objects with nlink fixed. + + + + mul_linked_repaired total number of + multiple-linked objects that have been repaired. + + + + local_lost_found_scanned total number + of objects under /lost+found that have been + scanned. + + + + local_lost_found_moved total number + of objects under /lost+found that have been moved to + namespace visible directory. + + + + local_lost_found_skipped total number + of objects under /lost+found that have been + skipped. + + + + local_lost_found_failed total number + of objects under /lost+found that have failed to be + processed. + + + + Success Count the total number of + completed LFSCK runs on the device. + + + + Run Time Phase1 the duration of the + LFSCK run during + scanning-phase1. Excluding the time + spent paused between checkpoints. + + + + Run Time Phase2 the duration of the + LFSCK run during + scanning-phase2. Excluding the time + spent paused between checkpoints. + + + + Average Speed Phase1 calculated by + dividing + checked_phase1 by + run_time_phase1. + + + + Average Speed Phase2 calculated by + dividing + checked_phase2 by + run_time_phase1. + + + + Real-Time Speed Phase1 the speed + since the last checkpoint if the LFSCK is running + scanning-phase1. + + + + Real-Time Speed Phase2 the speed + since the last checkpoint if the LFSCK is running + scanning-phase2. + + + + + + +
-
- LFSCK status of layout via <literal>procfs</literal> -
- Description - The layout component is responsible for checking and repairing MDT-OST inconsistency. The procfs interface for this component is in the MDD layer, named lfsck_layout, and in the OBD layer, named lfsck_layout. To show the status of this component lctl get_param should be used as described in the usage below. -
-
- Usage - lctl get_param -n mdd.FSNAME-MDT_device.lfsck_layout -lctl get_param -n obdfilter.FSNAME-OST_device.lfsck_layout - -
-
- Output - - - - - - - - Information - - - Detail - - - - - - - General Information - - - - Name: lfsck_layout - LFSCK namespace magic. - LFSCK namespace version.. - Status: one of the status - init, scanning-phase1, scanning-phase2, completed, failed, stopped, paused, crashed, partial, co-failed, co-stopped, or co-paused. - Flags: including - scanned-once (the first cycle scanning has been - completed), inconsistent (one - or more MDT-OST inconsistencies - have been discovered), - incomplete (some MDT or OST did not participate in the LFSCK or failed to finish the LFSCK) or crashed_lastid (the lastid files on the OST crashed and needs to be rebuilt). - Parameters: including dryrun, all_targets and failout. - Time Since Last Completed. - Time Since Latest Start. - Time Since Last Checkpoint. - Latest Start Position: the position the checking began most recently. - Last Checkpoint Position. - First Failure Position: the position for the first object to be repaired. - Current Position. - - - - - - Statistics - - - - Success Count: the total number of completed LFSCK runs on the device. - Repaired Dangling: total number of MDT-objects with dangling reference have been repaired in the scanning-phase1. - Repaired Unmatched Pairs total number of unmatched MDT and OST-object paris have been repaired in the scanning-phase1 - Repaired Multiple Referenced total number of OST-objects with multiple reference have been repaired in the scanning-phase1. - Repaired Orphan total number of orphan OST-objects have been repaired in the scanning-phase2. - Repaired Inconsistent Owner total number.of OST-objects with incorrect owner information have been repaired in the scanning-phase1. - Repaired Others total number of.other inconsistency repaired in the scanning phases. - Skipped Number of skipped objects. - Failed Phase1 total number of objects that failed to be repaired during scanning-phase1. - Failed Phase2 total number of objects that failed to be repaired during scanning-phase2. - Checked Phase1 total number of objects scanned during scanning-phase1. - Checked Phase2 total number of objects scanned during scanning-phase2. - Run Time Phase1 the duration of the LFSCK run during scanning-phase1. Excluding the time spent paused between checkpoints. - Run Time Phase2 the duration of the LFSCK run during scanning-phase2. Excluding the time spent paused between checkpoints. - Average Speed Phase1 calculated by dividing checked_phase1 by run_time_phase1. - Average Speed Phase2 calculated by dividing checked_phase2 by run_time_phase1. - Real-Time Speed Phase1 the speed since the last checkpoint if the LFSCK is running scanning-phase1. - Real-Time Speed Phase2 the speed since the last checkpoint if the LFSCK is running scanning-phase2. - - - - - - -
+
+
+ LFSCK status of layout via + <literal>procfs</literal> +
+ Description + The + layout component is responsible for checking and + repairing MDT-OST inconsistency. The + procfs interface for this component is in the MDD + layer, named + lfsck_layout, and in the OBD layer, named + lfsck_layout. To show the status of this component + lctl get_param should be used as described in the + usage below. +
+
+ Usage + lctl get_param -n mdd. +FSNAME- +MDT_device.lfsck_layout +lctl get_param -n obdfilter. +FSNAME- +OST_device.lfsck_layout
+
+ Output + + + + + + + + + Information + + + + + Detail + + + + + + + + General Information + + + + + Name: + lfsck_layout + + + LFSCK namespace magic. + + + LFSCK namespace version.. + + + Status: one of the status - + init, + scanning-phase1, + scanning-phase2, + completed, + failed, + stopped, + paused, + crashed, + partial, + co-failed, + co-stopped, or + co-paused. + + + Flags: including - + scanned-once(the first cycle + scanning has been completed), + inconsistent(one or more MDT-OST + inconsistencies have been discovered), + incomplete(some MDT or OST did not + participate in the LFSCK or failed to finish the LFSCK) + or + crashed_lastid(the lastid files on + the OST crashed and needs to be rebuilt). + + + Parameters: including + dryrun, + all_targets and + failout. + + + Time Since Last Completed. + + + Time Since Latest Start. + + + Time Since Last Checkpoint. + + + Latest Start Position: the position the checking + began most recently. + + + Last Checkpoint Position. + + + First Failure Position: the position for the + first object to be repaired. + + + Current Position. + + + + + + + Statistics + + + + + + Success Count: the total number of + completed LFSCK runs on the device. + + + + Repaired Dangling: total number of + MDT-objects with dangling reference have been repaired + in the scanning-phase1. + + + + Repaired Unmatched Pairs total number + of unmatched MDT and OST-object paris have been + repaired in the scanning-phase1 + + + + Repaired Multiple Referenced total + number of OST-objects with multiple reference have been + repaired in the scanning-phase1. + + + + Repaired Orphan total number of + orphan OST-objects have been repaired in the + scanning-phase2. + + + + Repaired Inconsistent Owner total + number.of OST-objects with incorrect owner information + have been repaired in the scanning-phase1. + + + + Repaired Others total number of.other + inconsistency repaired in the scanning phases. + + + + Skipped Number of skipped + objects. + + + + Failed Phase1 total number of objects + that failed to be repaired during + scanning-phase1. + + + + Failed Phase2 total number of objects + that failed to be repaired during + scanning-phase2. + + + + Checked Phase1 total number of + objects scanned during + scanning-phase1. + + + + Checked Phase2 total number of + objects scanned during + scanning-phase2. + + + + Run Time Phase1 the duration of the + LFSCK run during + scanning-phase1. Excluding the time + spent paused between checkpoints. + + + + Run Time Phase2 the duration of the + LFSCK run during + scanning-phase2. Excluding the time + spent paused between checkpoints. + + + + Average Speed Phase1 calculated by + dividing + checked_phase1 by + run_time_phase1. + + + + Average Speed Phase2 calculated by + dividing + checked_phase2 by + run_time_phase1. + + + + Real-Time Speed Phase1 the speed + since the last checkpoint if the LFSCK is running + scanning-phase1. + + + + Real-Time Speed Phase2 the speed + since the last checkpoint if the LFSCK is running + scanning-phase2. + + + + + + + +
+
- LFSCK adjustment interface -
- Rate control -
- Description - The LFSCK upper speed limit can be changed using lctl set_param as shown in the usage below. -
-
- Usage - lctl set_param mdd.${FSNAME}-${MDT_device}.lfsck_speed_limit=N -lctl set_param obdfilter.${FSNAME}-${OST_device}.lfsck_speed_limit=N -
-
- Values - - - - - - - - 0 - - - No speed limit (run at maximum speed.) - - - - - positive integer - - - Maximum number of objects to scan per second. - - - - - -
+ LFSCK adjustment interface +
+ Rate control +
+ Description + The LFSCK upper speed limit can be changed using + lctl set_param as shown in the usage below. +
+
+ Usage + lctl set_param mdd.${FSNAME}-${MDT_device}.lfsck_speed_limit= +N +lctl set_param obdfilter.${FSNAME}-${OST_device}.lfsck_speed_limit= +N
-
- Auto scrub -
- Description - The auto_scrub parameter controls whether OI scrub will be triggered when an inconsistency is detected during OI lookup. It can be set as described in the usage and values sections below. - There is also a noscrub mount option (see ) which can be used to disable automatic OI scrub upon detection of a file-level backup at mount time. If the noscrub mount option is specified, auto_scrub will also be disabled, so OI scrub will not be triggered when an OI inconsistency is detected. Auto scrub can be renabled after the mount using the command shown in the usage. Manually starting LFSCK after mounting provides finer control over the starting conditions. -
-
- Usage - lctl set_param osd_ldiskfs.${FSNAME}-${MDT_device}.auto_scrub=N - - where N is an integer as described below. -
-
- Values - - - - - - - - 0 - - - Do not start OI Scrub automatically. - - - - - positive integer - - - Automatically start OI Scrub if inconsistency is detected during OI lookup. - - - - - -
+
+ Values + + + + + + + + 0 + + + No speed limit (run at maximum speed.) + + + + + positive integer + + + Maximum number of objects to scan per second. + + + + +
+
+
+ Auto scrub +
+ Description + The + auto_scrub parameter controls whether OI scrub will + be triggered when an inconsistency is detected during OI lookup. It + can be set as described in the usage and values sections + below. + There is also a + noscrub mount option (see + ) which can be used to + disable automatic OI scrub upon detection of a file-level backup at + mount time. If the + noscrub mount option is specified, + auto_scrub will also be disabled, so OI scrub will + not be triggered when an OI inconsistency is detected. Auto scrub can + be renabled after the mount using the command shown in the usage. + Manually starting LFSCK after mounting provides finer control over + the starting conditions. +
+
+ Usage + lctl set_param osd_ldiskfs.${FSNAME}-${MDT_device}.auto_scrub=N + where + Nis an integer as described below. + Lustre software 2.5 and later supports + -P option that makes the + set_param permanent. +
+
+ Values + + + + + + + + 0 + + + Do not start OI Scrub automatically. + + + + + positive integer + + + Automatically start OI Scrub if inconsistency is + detected during OI lookup. + + + + + +
+
-
+
diff --git a/UnderstandingLustre.xml b/UnderstandingLustre.xml index 28bca1e..517ba23 100644 --- a/UnderstandingLustre.xml +++ b/UnderstandingLustre.xml @@ -1,88 +1,107 @@ - - - Understanding Lustre Architecture - This chapter describes the Lustre architecture and features of the Lustre file system. It - includes the following sections: + + + Understanding Lustre + Architecture + This chapter describes the Lustre architecture and features of the + Lustre file system. It includes the following sections: - + - + - +
- <indexterm> - <primary>Lustre</primary> - </indexterm>What a Lustre File System Is (and What It Isn't) - The Lustre architecture is a storage architecture for clusters. The central component of - the Lustre architecture is the Lustre file system, which is supported on the Linux operating - system and provides a POSIX* standard-compliant UNIX file system - interface. - The Lustre storage architecture is used for many different kinds of clusters. It is best - known for powering many of the largest high-performance computing (HPC) clusters worldwide, - with tens of thousands of client systems, petabytes (PB) of storage and hundreds of gigabytes - per second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system as a site-wide - global file system, serving dozens of clusters. - The ability of a Lustre file system to scale capacity and performance for any need reduces - the need to deploy many separate file systems, such as one for each compute cluster. Storage - management is simplified by avoiding the need to copy data between compute clusters. In - addition to aggregating storage capacity of many servers, the I/O throughput is also - aggregated and scales with additional servers. Moreover, throughput and/or capacity can be - easily increased by adding servers dynamically. - While a Lustre file system can function in many work environments, it is not necessarily - the best choice for all applications. It is best suited for uses that exceed the capacity that - a single server can provide, though in some use cases, a Lustre file system can perform better - with a single server than other file systems due to its strong locking and data - coherency. + + <indexterm> + <primary>Lustre</primary> + </indexterm>What a Lustre File System Is (and What It Isn't) + The Lustre architecture is a storage architecture for clusters. The + central component of the Lustre architecture is the Lustre file system, + which is supported on the Linux operating system and provides a POSIX + *standard-compliant UNIX file system + interface. + The Lustre storage architecture is used for many different kinds of + clusters. It is best known for powering many of the largest + high-performance computing (HPC) clusters worldwide, with tens of thousands + of client systems, petabytes (PB) of storage and hundreds of gigabytes per + second (GB/sec) of I/O throughput. Many HPC sites use a Lustre file system + as a site-wide global file system, serving dozens of clusters. + The ability of a Lustre file system to scale capacity and performance + for any need reduces the need to deploy many separate file systems, such as + one for each compute cluster. Storage management is simplified by avoiding + the need to copy data between compute clusters. In addition to aggregating + storage capacity of many servers, the I/O throughput is also aggregated and + scales with additional servers. Moreover, throughput and/or capacity can be + easily increased by adding servers dynamically. + While a Lustre file system can function in many work environments, it + is not necessarily the best choice for all applications. It is best suited + for uses that exceed the capacity that a single server can provide, though + in some use cases, a Lustre file system can perform better with a single + server than other file systems due to its strong locking and data + coherency. A Lustre file system is currently not particularly well suited for - "peer-to-peer" usage models where clients and servers are running on the same node, - each sharing a small amount of storage, due to the lack of data replication at the Lustre - software level. In such uses, if one client/server fails, then the data stored on that node - will not be accessible until the node is restarted. + "peer-to-peer" usage models where clients and servers are running on the + same node, each sharing a small amount of storage, due to the lack of data + replication at the Lustre software level. In such uses, if one + client/server fails, then the data stored on that node will not be + accessible until the node is restarted.
- <indexterm> - <primary>Lustre</primary> - <secondary>features</secondary> - </indexterm>Lustre Features - Lustre file systems run on a variety of vendor's kernels. For more details, see the - Lustre Test Matrix . - A Lustre installation can be scaled up or down with respect to the number of client - nodes, disk storage and bandwidth. Scalability and performance are dependent on available - disk and network bandwidth and the processing power of the servers in the system. A Lustre - file system can be deployed in a wide variety of configurations that can be scaled well - beyond the size and performance observed in production systems to date. - shows the practical range of scalability and - performance characteristics of a Lustre file system and some test results in production - systems. + + <indexterm> + <primary>Lustre</primary> + <secondary>features</secondary> + </indexterm>Lustre Features + Lustre file systems run on a variety of vendor's kernels. For more + details, see the Lustre Test Matrix + . + A Lustre installation can be scaled up or down with respect to the + number of client nodes, disk storage and bandwidth. Scalability and + performance are dependent on available disk and network bandwidth and the + processing power of the servers in the system. A Lustre file system can + be deployed in a wide variety of configurations that can be scaled well + beyond the size and performance observed in production systems to + date. + + shows the practical range of + scalability and performance characteristics of a Lustre file system and + some test results in production systems. - Lustre File System Scalability and - Performance + Lustre File System Scalability + and Performance - - - + + + - Feature + + Feature + - Current Practical Range + + Current Practical Range + - Tested in Production + + Tested in Production + @@ -90,55 +109,69 @@ - Client Scalability + Client Scalability + - 100-100000 + 100-100000 - 50000+ clients, many in the 10000 to 20000 range + 50000+ clients, many in the 10000 to 20000 range - Client Performance + + Client Performance + - Single client: + Single client: + I/O 90% of network bandwidth - Aggregate: + + Aggregate: + 2.5 TB/sec I/O - Single client: + Single client: + 2 GB/sec I/O, 1000 metadata ops/sec - Aggregate: - 240 GB/sec I/O + + Aggregate: + + 240 GB/sec I/O - OSS Scalability + OSS Scalability + - Single OSS: + Single OSS: + 1-32 OSTs per OSS, 128TB per OST - OSS count: + OSS count: + 500 OSSs, with up to 4000 OSTs - Single OSS: + Single OSS: + 8 OSTs per OSS, 16TB per OST - OSS count: + OSS count: + 450 OSSs with 1000 4TB OSTs 192 OSSs with 1344 8TB OSTs @@ -146,81 +179,99 @@ - OSS Performance + OSS Performance + - Single OSS: - 5 GB/sec + Single OSS: + + 5 GB/sec - Aggregate: - 2.5 TB/sec + Aggregate: + + 2.5 TB/sec - Single OSS: - 2.0+ GB/sec + Single OSS: + + 2.0+ GB/sec - Aggregate: - 240 GB/sec + Aggregate: + + 240 GB/sec - MDS Scalability + MDS Scalability + - Single MDT: - 4 billion files (ldiskfs), 256 trillion files (ZFS) + Single MDT: + + 4 billion files (ldiskfs), 256 trillion files + (ZFS) - MDS count: - 1 primary + 1 backup - Up to 4096 MDTs and up to 4096 MDSs + MDS count: + + 1 primary + 1 backup + Up to 4096 MDTs and up to 4096 + MDSs - Single MDT: - 1 billion files + Single MDT: + + 1 billion files - MDS count: - 1 primary + 1 backup + MDS count: + + 1 primary + 1 backup - MDS Performance + MDS Performance + - 35000/s create operations, - 100000/s metadata stat operations + 35000/s create operations, + 100000/s metadata stat operations - 15000/s create operations, - 35000/s metadata stat operations + 15000/s create operations, + 35000/s metadata stat operations - File system Scalability + File system Scalability + - Single File: + Single File: + 2.5 PB max file size - Aggregate: + Aggregate: + 512 PB space, 4 billion files - Single File: + Single File: + multi-TB max file size - Aggregate: + Aggregate: + 55 PB space, 1 billion files @@ -230,232 +281,306 @@ Other Lustre software features are: - Performance-enhanced ext4 file system: The Lustre - file system uses an improved version of the ext4 journaling file system to store data - and metadata. This version, called - ldiskfs, has been enhanced to improve performance and - provide additional functionality needed by the Lustre file system. + + Performance-enhanced ext4 file + system:The Lustre file system uses an improved version of + the ext4 journaling file system to store data and metadata. This + version, called + + ldiskfs + , has been enhanced to improve performance and provide + additional functionality needed by the Lustre file system. - With the Lustre software release 2.4 and later, it is also possible to use ZFS as the backing filesystem for Lustre for the MDT, OST, and MGS storage. This allows Lustre to leverage the scalability and data integrity features of ZFS for individual storage targets. + With the Lustre software release 2.4 and later, + it is also possible to use ZFS as the backing filesystem for Lustre + for the MDT, OST, and MGS storage. This allows Lustre to leverage the + scalability and data integrity features of ZFS for individual storage + targets. - POSIX standard compliance: The full POSIX test - suite passes in an identical manner to a local ext4 file system, with limited exceptions - on Lustre clients. In a cluster, most operations are atomic so that clients never see - stale data or metadata. The Lustre software supports mmap() file I/O. + + POSIX standard compliance:The full + POSIX test suite passes in an identical manner to a local ext4 file + system, with limited exceptions on Lustre clients. In a cluster, most + operations are atomic so that clients never see stale data or + metadata. The Lustre software supports mmap() file I/O. - High-performance heterogeneous networking: The - Lustre software supports a variety of high performance, low latency networks and permits - Remote Direct Memory Access (RDMA) for InfiniBand* (utilizing - OpenFabrics Enterprise Distribution (OFED*) and other - advanced networks for fast and efficient network transport. Multiple RDMA networks can - be bridged using Lustre routing for maximum performance. The Lustre software also - includes integrated network diagnostics. + + High-performance heterogeneous + networking:The Lustre software supports a variety of high + performance, low latency networks and permits Remote Direct Memory + Access (RDMA) for InfiniBand + *(utilizing OpenFabrics Enterprise + Distribution (OFED + *) and other advanced networks for fast + and efficient network transport. Multiple RDMA networks can be + bridged using Lustre routing for maximum performance. The Lustre + software also includes integrated network diagnostics. - High-availability: The Lustre file system supports - active/active failover using shared storage partitions for OSS targets (OSTs). Lustre - software release 2.3 and earlier releases offer active/passive failover using a shared - storage partition for the MDS target (MDT). The Lustre file system can work with a variety of high - availability (HA) managers to allow automated failover and has no single point of failure (NSPF). - This allows application transparent recovery. Multiple mount protection (MMP) provides integrated protection from - errors in highly-available systems that would otherwise cause file system - corruption. + + High-availability:The Lustre file + system supports active/active failover using shared storage + partitions for OSS targets (OSTs). Lustre software release 2.3 and + earlier releases offer active/passive failover using a shared storage + partition for the MDS target (MDT). The Lustre file system can work + with a variety of high availability (HA) managers to allow automated + failover and has no single point of failure (NSPF). This allows + application transparent recovery. Multiple mount protection (MMP) + provides integrated protection from errors in highly-available + systems that would otherwise cause file system corruption. With Lustre software release 2.4 or later - servers and clients it is possible to configure active/active - failover of multiple MDTs. This allows scaling the metadata - performance of Lustre filesystems with the addition of MDT storage - devices and MDS nodes. + servers and clients it is possible to configure active/active + failover of multiple MDTs. This allows scaling the metadata + performance of Lustre filesystems with the addition of MDT storage + devices and MDS nodes. - Security: By default TCP connections are only - allowed from privileged ports. UNIX group membership is verified on the MDS. + + Security:By default TCP connections + are only allowed from privileged ports. UNIX group membership is + verified on the MDS. - Access control list (ACL), extended attributes: the - Lustre security model follows that of a UNIX file system, enhanced with POSIX ACLs. - Noteworthy additional features include root squash. + + Access control list (ACL), extended + attributes:the Lustre security model follows that of a + UNIX file system, enhanced with POSIX ACLs. Noteworthy additional + features include root squash. - Interoperability: The Lustre file system runs on a - variety of CPU architectures and mixed-endian clusters and is interoperable between - successive major Lustre software releases. + + Interoperability:The Lustre file + system runs on a variety of CPU architectures and mixed-endian + clusters and is interoperable between successive major Lustre + software releases. - Object-based architecture: Clients are isolated - from the on-disk file structure enabling upgrading of the storage architecture without - affecting the client. + + Object-based architecture:Clients + are isolated from the on-disk file structure enabling upgrading of + the storage architecture without affecting the client. - Byte-granular file and fine-grained metadata - locking: Many clients can read and modify the same file or directory - concurrently. The Lustre distributed lock manager (LDLM) ensures that files are coherent - between all clients and servers in the file system. The MDT LDLM manages locks on inode - permissions and pathnames. Each OST has its own LDLM for locks on file stripes stored - thereon, which scales the locking performance as the file system grows. + + Byte-granular file and fine-grained metadata + locking:Many clients can read and modify the same file or + directory concurrently. The Lustre distributed lock manager (LDLM) + ensures that files are coherent between all clients and servers in + the file system. The MDT LDLM manages locks on inode permissions and + pathnames. Each OST has its own LDLM for locks on file stripes stored + thereon, which scales the locking performance as the file system + grows. - Quotas: User and group quotas are available for a - Lustre file system. + + Quotas:User and group quotas are + available for a Lustre file system. - Capacity growth: The size of a Lustre file system - and aggregate cluster bandwidth can be increased without interruption by adding a new - OSS with OSTs to the cluster. + + Capacity growth:The size of a Lustre + file system and aggregate cluster bandwidth can be increased without + interruption by adding a new OSS with OSTs to the cluster. - Controlled striping: The layout of files across - OSTs can be configured on a per file, per directory, or per file system basis. This - allows file I/O to be tuned to specific application requirements within a single file - system. The Lustre file system uses RAID-0 striping and balances space usage across - OSTs. + + Controlled striping:The layout of + files across OSTs can be configured on a per file, per directory, or + per file system basis. This allows file I/O to be tuned to specific + application requirements within a single file system. The Lustre file + system uses RAID-0 striping and balances space usage across + OSTs. - Network data integrity protection: A checksum of - all data sent from the client to the OSS protects against corruption during data - transfer. + + Network data integrity protection:A + checksum of all data sent from the client to the OSS protects against + corruption during data transfer. - MPI I/O: The Lustre architecture has a dedicated - MPI ADIO layer that optimizes parallel I/O to match the underlying file system - architecture. + + MPI I/O:The Lustre architecture has + a dedicated MPI ADIO layer that optimizes parallel I/O to match the + underlying file system architecture. - NFS and CIFS export: Lustre files can be - re-exported using NFS (via Linux knfsd) or CIFS (via Samba) enabling them to be shared - with non-Linux clients, such as Microsoft* - Windows* and Apple* Mac OS - X*. + + NFS and CIFS export:Lustre files can + be re-exported using NFS (via Linux knfsd) or CIFS (via Samba) + enabling them to be shared with non-Linux clients, such as Microsoft + *Windows + *and Apple + *Mac OS X + *. - Disaster recovery tool: The Lustre file system - provides an online distributed file system check (LFSCK) that can restore consistency between - storage components in case of a major file system error. A Lustre file system can - operate even in the presence of file system inconsistencies, and LFSCK can run while the filesystem is in use, so LFSCK is not required to complete - before returning the file system to production. + + Disaster recovery tool:The Lustre + file system provides an online distributed file system check (LFSCK) + that can restore consistency between storage components in case of a + major file system error. A Lustre file system can operate even in the + presence of file system inconsistencies, and LFSCK can run while the + filesystem is in use, so LFSCK is not required to complete before + returning the file system to production. - Performance monitoring: The Lustre file system - offers a variety of mechanisms to examine performance and tuning. + + Performance monitoring:The Lustre + file system offers a variety of mechanisms to examine performance and + tuning. - Open source: The Lustre software is licensed under - the GPL 2.0 license for use with the Linux operating system. + + Open source:The Lustre software is + licensed under the GPL 2.0 license for use with the Linux operating + system.
- <indexterm> - <primary>Lustre</primary> - <secondary>components</secondary> - </indexterm>Lustre Components - An installation of the Lustre software includes a management server (MGS) and one or more - Lustre file systems interconnected with Lustre networking (LNET). - A basic configuration of Lustre file system components is shown in . + + <indexterm> + <primary>Lustre</primary> + <secondary>components</secondary> + </indexterm>Lustre Components + An installation of the Lustre software includes a management server + (MGS) and one or more Lustre file systems interconnected with Lustre + networking (LNET). + A basic configuration of Lustre file system components is shown in + .
- Lustre file system components in a basic - cluster + Lustre file system + components in a basic cluster - + - Lustre file system components in a basic cluster + Lustre file system components in a basic cluster
- <indexterm> - <primary>Lustre</primary> - <secondary>MGS</secondary> - </indexterm>Management Server (MGS) - The MGS stores configuration information for all the Lustre file systems in a cluster - and provides this information to other Lustre components. Each Lustre target contacts the - MGS to provide information, and Lustre clients contact the MGS to retrieve - information. - It is preferable that the MGS have its own storage space so that it can be managed - independently. However, the MGS can be co-located and share storage space with an MDS as - shown in . + + <indexterm> + <primary>Lustre</primary> + <secondary>MGS</secondary> + </indexterm>Management Server (MGS) + The MGS stores configuration information for all the Lustre file + systems in a cluster and provides this information to other Lustre + components. Each Lustre target contacts the MGS to provide information, + and Lustre clients contact the MGS to retrieve information. + It is preferable that the MGS have its own storage space so that it + can be managed independently. However, the MGS can be co-located and + share storage space with an MDS as shown in + .
Lustre File System Components - Each Lustre file system consists of the following components: + Each Lustre file system consists of the following + components: - Metadata Server (MDS) - The MDS makes metadata - stored in one or more MDTs available to Lustre clients. Each MDS manages the names and - directories in the Lustre file system(s) and provides network request handling for one - or more local MDTs. + + Metadata Server (MDS)- The MDS makes + metadata stored in one or more MDTs available to Lustre clients. Each + MDS manages the names and directories in the Lustre file system(s) + and provides network request handling for one or more local + MDTs. - Metadata Target (MDT ) - For Lustre software - release 2.3 and earlier, each file system has one MDT. The MDT stores metadata (such as - filenames, directories, permissions and file layout) on storage attached to an MDS. Each - file system has one MDT. An MDT on a shared storage target can be available to multiple - MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS - can serve the MDT and make it available to clients. This is referred to as MDS - failover. - Since Lustre software release 2.4, multiple MDTs are supported. Each - file system has at least one MDT. An MDT on a shared storage target can be available via - multiple MDSs, although only one MDS can export the MDT to the clients at one time. Two - MDS machines share storage for two or more MDTs. After the failure of one MDS, the - remaining MDS begins serving the MDT(s) of the failed MDS. + + Metadata Target (MDT) - For Lustre + software release 2.3 and earlier, each file system has one MDT. The + MDT stores metadata (such as filenames, directories, permissions and + file layout) on storage attached to an MDS. Each file system has one + MDT. An MDT on a shared storage target can be available to multiple + MDSs, although only one can access it at a time. If an active MDS + fails, a standby MDS can serve the MDT and make it available to + clients. This is referred to as MDS failover. + Since Lustre software release 2.4, multiple + MDTs are supported. Each file system has at least one MDT. An MDT on + a shared storage target can be available via multiple MDSs, although + only one MDS can export the MDT to the clients at one time. Two MDS + machines share storage for two or more MDTs. After the failure of one + MDS, the remaining MDS begins serving the MDT(s) of the failed + MDS. - Object Storage Servers (OSS) : The OSS provides - file I/O service and network request handling for one or more local OSTs. Typically, an - OSS serves between two and eight OSTs, up to 16 TB each. A typical configuration is an - MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a - large number of compute nodes. + + Object Storage Servers (OSS): The + OSS provides file I/O service and network request handling for one or + more local OSTs. Typically, an OSS serves between two and eight OSTs, + up to 16 TB each. A typical configuration is an MDT on a dedicated + node, two or more OSTs on each OSS node, and a client on each of a + large number of compute nodes. - Object Storage Target (OST) : User file data is - stored in one or more objects, each object on a separate OST in a Lustre file system. - The number of objects per file is configurable by the user and can be tuned to optimize - performance for a given workload. + + Object Storage Target (OST): User + file data is stored in one or more objects, each object on a separate + OST in a Lustre file system. The number of objects per file is + configurable by the user and can be tuned to optimize performance for + a given workload. - Lustre clients : Lustre clients are computational, - visualization or desktop nodes that are running Lustre client software, allowing them to - mount the Lustre file system. + + Lustre clients: Lustre clients are + computational, visualization or desktop nodes that are running Lustre + client software, allowing them to mount the Lustre file + system. - The Lustre client software provides an interface between the Linux virtual file system - and the Lustre servers. The client software includes a management client (MGC), a metadata - client (MDC), and multiple object storage clients (OSCs), one corresponding to each OST in - the file system. - A logical object volume (LOV) aggregates the OSCs to provide transparent access across - all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, - synchronized namespace. Several clients can write to different parts of the same file - simultaneously, while, at the same time, other clients can read from the file. - provides the requirements for - attached storage for each Lustre file system component and describes desirable - characteristics of the hardware used. + The Lustre client software provides an interface between the Linux + virtual file system and the Lustre servers. The client software includes + a management client (MGC), a metadata client (MDC), and multiple object + storage clients (OSCs), one corresponding to each OST in the file + system. + A logical object volume (LOV) aggregates the OSCs to provide + transparent access across all the OSTs. Thus, a client with the Lustre + file system mounted sees a single, coherent, synchronized namespace. + Several clients can write to different parts of the same file + simultaneously, while, at the same time, other clients can read from the + file. + + provides the + requirements for attached storage for each Lustre file system component + and describes desirable characteristics of the hardware used.
- <indexterm> - <primary>Lustre</primary> - <secondary>requirements</secondary> - </indexterm>Storage and hardware requirements for Lustre file system components + + <indexterm> + <primary>Lustre</primary> + <secondary>requirements</secondary> + </indexterm>Storage and hardware requirements for Lustre file system + components - - - + + + - + + + - Required attached storage + + Required attached storage + - Desirable hardware characteristics + + Desirable hardware + characteristics + @@ -463,253 +588,308 @@ - MDSs + MDSs + - 1-2% of file system capacity + 1-2% of file system capacity - Adequate CPU power, plenty of memory, fast disk storage. + Adequate CPU power, plenty of memory, fast disk + storage. - OSSs + OSSs + - 1-16 TB per OST, 1-8 OSTs per OSS + 1-16 TB per OST, 1-8 OSTs per OSS - Good bus bandwidth. Recommended that storage be balanced evenly across - OSSs. + Good bus bandwidth. Recommended that storage be balanced + evenly across OSSs. - Clients + Clients + - None + None - Low latency, high bandwidth network. + Low latency, high bandwidth network.
- For additional hardware requirements and considerations, see . + For additional hardware requirements and considerations, see + .
- <indexterm> - <primary>Lustre</primary> - <secondary>LNET</secondary> - </indexterm>Lustre Networking (LNET) - Lustre Networking (LNET) is a custom networking API that provides the communication - infrastructure that handles metadata and file I/O data for the Lustre file system servers - and clients. For more information about LNET, see . + + <indexterm> + <primary>Lustre</primary> + <secondary>LNET</secondary> + </indexterm>Lustre Networking (LNET) + Lustre Networking (LNET) is a custom networking API that provides + the communication infrastructure that handles metadata and file I/O data + for the Lustre file system servers and clients. For more information + about LNET, see + .
- <indexterm> - <primary>Lustre</primary> - <secondary>cluster</secondary> - </indexterm>Lustre Cluster - At scale, a Lustre file system cluster can include hundreds of OSSs and thousands of - clients (see ). More than one type of - network can be used in a Lustre cluster. Shared storage between OSSs enables failover - capability. For more details about OSS failover, see . + + <indexterm> + <primary>Lustre</primary> + <secondary>cluster</secondary> + </indexterm>Lustre Cluster + At scale, a Lustre file system cluster can include hundreds of OSSs + and thousands of clients (see + ). More than one + type of network can be used in a Lustre cluster. Shared storage between + OSSs enables failover capability. For more details about OSS failover, + see + .
- <indexterm> - <primary>Lustre</primary> - <secondary>at scale</secondary> - </indexterm>Lustre cluster at scale + + <indexterm> + <primary>Lustre</primary> + <secondary>at scale</secondary> + </indexterm>Lustre cluster at scale - + - Lustre file system cluster at scale + Lustre file system cluster at scale
- <indexterm> - <primary>Lustre</primary> - <secondary>storage</secondary> - </indexterm> - <indexterm> - <primary>Lustre</primary> - <secondary>I/O</secondary> - </indexterm> Lustre File System Storage and I/O - In Lustre software release 2.0, Lustre file identifiers (FIDs) were introduced to replace - UNIX inode numbers for identifying files or objects. A FID is a 128-bit identifier that - contains a unique 64-bit sequence number, a 32-bit object ID (OID), and a 32-bit version - number. The sequence number is unique across all Lustre targets in a file system (OSTs and - MDTs). This change enabled future support for multiple MDTs (introduced in Lustre software - release 2.3) and ZFS (introduced in Lustre software release 2.4). - Also introduced in release 2.0 is a feature call FID-in-dirent (also known as dirdata) in - which the FID is stored as part of the name of the file in the parent directory. This feature - significantly improves performance for ls command executions by reducing - disk I/O. The FID-in-dirent is generated at the time the file is created. + + <indexterm> + <primary>Lustre</primary> + <secondary>storage</secondary> + </indexterm> + <indexterm> + <primary>Lustre</primary> + <secondary>I/O</secondary> + </indexterm>Lustre File System Storage and I/O + In Lustre software release 2.0, Lustre file identifiers (FIDs) were + introduced to replace UNIX inode numbers for identifying files or objects. + A FID is a 128-bit identifier that contains a unique 64-bit sequence + number, a 32-bit object ID (OID), and a 32-bit version number. The sequence + number is unique across all Lustre targets in a file system (OSTs and + MDTs). This change enabled future support for multiple MDTs (introduced in + Lustre software release 2.4) and ZFS (introduced in Lustre software release + 2.4). + Also introduced in release 2.0 is a feature call + FID-in-dirent(also known as + dirdata) in which the FID is stored as + part of the name of the file in the parent directory. This feature + significantly improves performance for + ls command executions by reducing disk I/O. The + FID-in-dirent is generated at the time the file is created. - The FID-in-dirent feature is not compatible with the Lustre software release 1.8 format. - Therefore, when an upgrade from Lustre software release 1.8 to a Lustre software release 2.x - is performed, the FID-in-dirent feature is not automatically enabled. For upgrades from - Lustre software release 1.8 to Lustre software releases 2.0 through 2.3, FID-in-dirent can - be enabled manually but only takes effect for new files. - For more information about upgrading from Lustre software release 1.8 and enabling - FID-in-dirent for existing files, see Chapter 16 “Upgrading a Lustre File System”. + The FID-in-dirent feature is not compatible with the Lustre + software release 1.8 format. Therefore, when an upgrade from Lustre + software release 1.8 to a Lustre software release 2.x is performed, the + FID-in-dirent feature is not automatically enabled. For upgrades from + Lustre software release 1.8 to Lustre software releases 2.0 through 2.3, + FID-in-dirent can be enabled manually but only takes effect for new + files. + For more information about upgrading from Lustre software release + 1.8 and enabling FID-in-dirent for existing files, see + Chapter 16 “Upgrading a Lustre File + System”. - The LFSCK 1.5 file system administration tool released with Lustre - software release 2.4 provides functionality that enables FID-in-dirent for existing files. It - includes the following functionality: - - Generates IGIF mode FIDs for existing release 1.8 files. - - - Verifies the FID-in-dirent for each file to determine when it doesn’t exist or is - invalid and then regenerates the FID-in-dirent if needed. - - - Verifies the linkEA entry for each file to determine when it is missing or invalid - and then regenerates the linkEA if needed. The linkEA - consists of the file name plus its parent FID and is stored as an extended attribute in - the file itself. Thus, the linkEA can be used to parse out the full path name of a file - from root. - - - Information about where file data is located on the OST(s) is stored as an extended - attribute called layout EA in an MDT object identified by the FID for the file (see ). If the file is - a data file (not a directory or symbol link), the MDT object points to 1-to-N OST object(s) on - the OST(s) that contain the file data. If the MDT file points to one object, all the file data - is stored in that object. If the MDT file points to more than one object, the file data is - striped across the objects using RAID 0, and each object - is stored on a different OST. (For more information about how striping is implemented in a - Lustre file system, see . + The LFSCK file system consistency checking tool + released with Lustre software release 2.4 provides functionality that + enables FID-in-dirent for existing files. It includes the following + functionality: + + + Generates IGIF mode FIDs for existing files from a 1.8 version + file system files. + + + Verifies the FID-in-dirent for each file and regenerates the + FID-in-dirent if it is invalid or missing. + + + Verifies the linkEA entry for each and regenerates the linkEA + if it is invalid or missing. The + linkEAconsists of the file name and + parent FID. It is stored as an extended attribute in the file + itself. Thus, the linkEA can be used to reconstruct the full path name of + a file. + + + Information about where file data is located on the OST(s) is stored + as an extended attribute called layout EA in an MDT object identified by + the FID for the file (see + ). If the file is a regular file (not a + directory or symbol link), the MDT object points to 1-to-N OST object(s) on + the OST(s) that contain the file data. If the MDT file points to one + object, all the file data is stored in that object. If the MDT file points + to more than one object, the file data is + stripedacross the objects using RAID 0, + and each object is stored on a different OST. (For more information about + how striping is implemented in a Lustre file system, see + .
Layout EA on MDT pointing to file data on OSTs - + - Layout EA on MDT pointing to file data on OSTs + Layout EA on MDT pointing to file data on OSTs
- When a client wants to read from or write to a file, it first fetches the layout EA from - the MDT object for the file. The client then uses this information to perform I/O on the file, - directly interacting with the OSS nodes where the objects are stored. - This process is illustrated - in . + When a client wants to read from or write to a file, it first fetches + the layout EA from the MDT object for the file. The client then uses this + information to perform I/O on the file, directly interacting with the OSS + nodes where the objects are stored. + + This process is illustrated in + + .
Lustre client requesting file data - + - Lustre client requesting file data + Lustre client requesting file data
- The available bandwidth of a Lustre file system is determined as follows: + The available bandwidth of a Lustre file system is determined as + follows: - The network bandwidth equals the aggregated bandwidth of the OSSs - to the targets. + The + network bandwidthequals the aggregated bandwidth + of the OSSs to the targets. - The disk bandwidth equals the sum of the disk bandwidths of the - storage targets (OSTs) up to the limit of the network bandwidth. + The + disk bandwidthequals the sum of the disk + bandwidths of the storage targets (OSTs) up to the limit of the network + bandwidth. - The aggregate bandwidth equals the minimum of the disk bandwidth - and the network bandwidth. + The + aggregate bandwidthequals the minimum of the disk + bandwidth and the network bandwidth. - The available file system space equals the sum of the available - space of all the OSTs. + The + available file system spaceequals the sum of the + available space of all the OSTs.
- <indexterm> - <primary>Lustre</primary> - <secondary>striping</secondary> - </indexterm> - <indexterm> - <primary>striping</primary> - <secondary>overview</secondary> - </indexterm> Lustre File System and Striping - One of the main factors leading to the high performance of Lustre file systems is the - ability to stripe data across multiple OSTs in a round-robin fashion. Users can optionally - configure for each file the number of stripes, stripe size, and OSTs that are used. - Striping can be used to improve performance when the aggregate bandwidth to a single - file exceeds the bandwidth of a single OST. The ability to stripe is also useful when a - single OST does not have enough free space to hold an entire file. For more information - about benefits and drawbacks of file striping, see . - Striping allows segments or 'chunks' of data in a file to be stored on - different OSTs, as shown in . In the - Lustre file system, a RAID 0 pattern is used in which data is "striped" across a - certain number of objects. The number of objects in a single file is called the - stripe_count. - Each object contains a chunk of data from the file. When the chunk of data being written - to a particular object exceeds the stripe_size, the next chunk of data in - the file is stored on the next object. - Default values for stripe_count and stripe_size - are set for the file system. The default value for stripe_count is 1 - stripe for file and the default value for stripe_size is 1MB. The user - may change these values on a per directory or per file basis. For more details, see . - , the stripe_size - for File C is larger than the stripe_size for File A, allowing more data - to be stored in a single stripe for File C. The stripe_count for File A - is 3, resulting in data striped across three objects, while the - stripe_count for File B and File C is 1. - No space is reserved on the OST for unwritten data. File A in . + + Lustre + striping + + + striping + overview + Lustre File System and Striping + One of the main factors leading to the high performance of Lustre + file systems is the ability to stripe data across multiple OSTs in a + round-robin fashion. Users can optionally configure for each file the + number of stripes, stripe size, and OSTs that are used. + Striping can be used to improve performance when the aggregate + bandwidth to a single file exceeds the bandwidth of a single OST. The + ability to stripe is also useful when a single OST does not have enough + free space to hold an entire file. For more information about benefits + and drawbacks of file striping, see + . + Striping allows segments or 'chunks' of data in a file to be stored + on different OSTs, as shown in + . In the Lustre file + system, a RAID 0 pattern is used in which data is "striped" across a + certain number of objects. The number of objects in a single file is + called the + stripe_count. + Each object contains a chunk of data from the file. When the chunk + of data being written to a particular object exceeds the + stripe_size, the next chunk of data in the file is + stored on the next object. + Default values for + stripe_count and + stripe_size are set for the file system. The default + value for + stripe_count is 1 stripe for file and the default value + for + stripe_size is 1MB. The user may change these values on + a per directory or per file basis. For more details, see + . + + , the + stripe_size for File C is larger than the + stripe_size for File A, allowing more data to be stored + in a single stripe for File C. The + stripe_count for File A is 3, resulting in data striped + across three objects, while the + stripe_count for File B and File C is 1. + No space is reserved on the OST for unwritten data. File A in + .
- File striping on a Lustre file - system + File striping on a + Lustre file system - + - File striping pattern across three OSTs for three different data files. The file - is sparse and missing chunk 6. + File striping pattern across three OSTs for three different + data files. The file is sparse and missing chunk 6.
- The maximum file size is not limited by the size of a single target. In a Lustre file - system, files can be striped across multiple objects (up to 2000), and each object can be - up to 16 TB in size with ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of 31.25 PB for ldiskfs or 8EB with ZFS. Note that - a Lustre file system can support files up to 2^63 bytes (8EB), limited - only by the space available on the OSTs. + The maximum file size is not limited by the size of a single + target. In a Lustre file system, files can be striped across multiple + objects (up to 2000), and each object can be up to 16 TB in size with + ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of + 31.25 PB for ldiskfs or 8EB with ZFS. Note that a Lustre file system can + support files up to 2^63 bytes (8EB), limited only by the space available + on the OSTs. - Versions of the Lustre software prior to Release 2.2 limited the maximum stripe count - for a single file to 160 OSTs. + Versions of the Lustre software prior to Release 2.2 limited the + maximum stripe count for a single file to 160 OSTs. - Although a single file can only be striped over 2000 objects, Lustre file systems can - have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated I/O - bandwidth to the objects in a file, which can be as much as a bandwidth of up to 2000 - servers. On systems with more than 2000 OSTs, clients can do I/O using multiple files to - utilize the full file system bandwidth. - For more information about striping, see . + Although a single file can only be striped over 2000 objects, + Lustre file systems can have thousands of OSTs. The I/O bandwidth to + access a single file is the aggregated I/O bandwidth to the objects in a + file, which can be as much as a bandwidth of up to 2000 servers. On + systems with more than 2000 OSTs, clients can do I/O using multiple files + to utilize the full file system bandwidth. + For more information about striping, see + .
diff --git a/UpgradingLustre.xml b/UpgradingLustre.xml index 1409d3d..637785f 100644 --- a/UpgradingLustre.xml +++ b/UpgradingLustre.xml @@ -1,144 +1,198 @@ - + + Upgrading a Lustre File System - This chapter describes interoperability between Lustre software releases. It also provides - procedures for upgrading from Lustre software release 1.8 to Lustre softeware release 2.x , - from a Lustre software release 2.x to a more recent Lustre software release 2.x (major release - upgrade), and from a a Lustre software release 2.x.y to a more recent Lustre software release - 2.x.y (minor release upgrade). It includes the following sections: + This chapter describes interoperability between Lustre software + releases. It also provides procedures for upgrading from Lustre software + release 1.8 to Lustre softeware release 2.x , from a Lustre software release + 2.x to a more recent Lustre software release 2.x (major release upgrade), and + from a a Lustre software release 2.x.y to a more recent Lustre software + release 2.x.y (minor release upgrade). It includes the following + sections: - + + + - + + + - + + +
- <indexterm> - <primary>Lustre</primary> - <secondary>upgrading</secondary> - <see>upgrading</see> - </indexterm><indexterm> - <primary>upgrading</primary> - </indexterm>Release Interoperability and Upgrade Requirements - Lustre software release 2.x (major) - upgrade: + + <indexterm> + <primary>Lustre</primary> + <secondary>upgrading</secondary> + <see>upgrading</see> + </indexterm> + <indexterm> + <primary>upgrading</primary> + </indexterm>Release Interoperability and Upgrade Requirements + + + Lustre software release 2.x (major) + upgrade: + + - All servers must be upgraded at the same time, while some or all clients may be - upgraded. + All servers must be upgraded at the same time, while some or + all clients may be upgraded. - All servers must be be upgraded to a Linux kernel supported by the Lustre software. - See the Linux Test Matrix at for a list of tested Lustre distributions. + All servers must be be upgraded to a Linux kernel supported by + the Lustre software. See the Linux Test Matrix at + for a list of tested Lustre + distributions. - Clients to be upgraded to the Lustre software release 2.4 or higher must be running - a compatible Linux distribution. See the Linux Test Matrix at for a - list of tested Linux distributions. + Clients to be upgraded to the Lustre software release 2.4 or + higher must be running a compatible Linux distribution. See the Linux + Test Matrix at + for a list of tested Linux + distributions. - - Lustre software release 2.x.y release - (minor) upgrade: + + + + + Lustre software release 2.x.y release (minor) + upgrade: + + - All servers must be upgraded at the same time, while some or all clients may be - upgraded. + All servers must be upgraded at the same time, while some or all + clients may be upgraded. - Rolling upgrades are supported for minor releases allowing individual servers and - clients to be upgraded without stopping the Lustre file system. + Rolling upgrades are supported for minor releases allowing + individual servers and clients to be upgraded without stopping the + Lustre file system.
- <indexterm> - <primary>upgrading</primary> - <secondary>major release (2.x to 2.x)</secondary> - </indexterm><indexterm> - <primary>wide striping</primary> - </indexterm><indexterm> - <primary>MDT</primary> - <secondary>multiple MDSs</secondary> - </indexterm><indexterm> - <primary>large_xattr</primary> - <secondary>ea_inode</secondary> - </indexterm><indexterm> - <primary>wide striping</primary> - <secondary>large_xattr</secondary> - <tertiary>ea_inode</tertiary> - </indexterm>Upgrading to Lustre Software Release 2.x (Major Release) - The procedure for upgrading from a Lustre software release 2.x to a more recent 2.x - release of the Lustre software is described in this section. + + <indexterm> + <primary>upgrading</primary> + <secondary>major release (2.x to 2.x)</secondary> + </indexterm> + <indexterm> + <primary>wide striping</primary> + </indexterm> + <indexterm> + <primary>MDT</primary> + <secondary>multiple MDSs</secondary> + </indexterm> + <indexterm> + <primary>large_xattr</primary> + <secondary>ea_inode</secondary> + </indexterm> + <indexterm> + <primary>wide striping</primary> + <secondary>large_xattr</secondary> + <tertiary>ea_inode</tertiary> + </indexterm>Upgrading to Lustre Software Release 2.x (Major + Release) + The procedure for upgrading from a Lustre software release 2.x to a + more recent 2.x release of the Lustre software is described in this + section. - This procedure can also be used to upgrade Lustre software release 1.8.6-wc1 or later to - any Lustre software release 2.x. To upgrade other versions of Lustre software release 1.8.x, - contact your support provider. + This procedure can also be used to upgrade Lustre software release + 1.8.6-wc1 or later to any Lustre software release 2.x. To upgrade other + versions of Lustre software release 1.8.x, contact your support + provider. - In Lustre software release 2.2, a feature has been added that allows - striping across up to 2000 OSTs. By default, this "wide striping" feature is disabled. It is - activated by setting the large_xattr or ea_inode - option on the MDT using either - mkfs.lustre or tune2fs. For example after upgrading - an existing file system to Lustre software release 2.2 or later, wide striping can be - enabled by running the following command on the MDT device before mounting - it:tune2fs -O large_xattrOnce the wide striping feature is enabled and in - use on the MDT, it is not possible to directly downgrade the MDT file system to an earlier - version of the Lustre software that does not support wide striping. To disable wide striping: - - Delete all wide-striped files. - OR - Use lfs_migrate with the option -c - stripe_count (set stripe_count - to 160) to move the files to another location. - - - Unmount the MDT. - - - Run the following command to turn off the large_xattr - option:tune2fs -O ^large_xattr - - - Using either mkfs.lustre or tune2fs with - large_xattr or ea_inode option reseults in - ea_inode in the file system feature list. - + In Lustre software release 2.2, a feature has been + added that allows striping across up to 2000 OSTs. By default, this "wide + striping" feature is disabled. It is activated by setting the + large_xattr or + ea_inode option on the MDT using either + mkfs.lustre or + tune2fs. For example after upgrading an existing file + system to Lustre software release 2.2 or later, wide striping can be + enabled by running the following command on the MDT device before + mounting it: + tune2fs -O large_xattr + Once the wide striping feature is enabled and in use on the MDT, it is + not possible to directly downgrade the MDT file system to an earlier + version of the Lustre software that does not support wide striping. To + disable wide striping: + + + Delete all wide-striped files. + OR + Use + lfs_migrate with the option + -c + stripe_count(set + stripe_countto 160) to move the files to + another location. + + + Unmount the MDT. + + + Run the following command to turn off the + large_xattr option: + tune2fs -O ^large_xattr + + Using either + mkfs.lustre or + tune2fs with + large_xattr or + ea_inode option reseults in + ea_inode in the file system feature list. + - To generate a list of all files with more than 160 stripes use lfs - find with the --stripe-count - option:lfs find ${mountpoint} --stripe-count=+160 + To generate a list of all files with more than 160 stripes use + lfs find with the + --stripe-count option: + lfs find ${mountpoint} --stripe-count=+160 - In Lustre software release 2.4, a new feature allows using multiple MDTs, which can each - serve one or more remote sub-directories in the file system. The root - directory is always located on MDT0. - Note that clients running a release prior to the Lustre software release 2.4 can only - see the namespace hosted by MDT0 and will return an IO error if an attempt is made to access - a directory on another MDT. + In Lustre software release 2.4, a new feature allows using multiple + MDTs, which can each serve one or more remote sub-directories in the file + system. The + root directory is always located on MDT0. + Note that clients running a release prior to the Lustre software + release 2.4 can only see the namespace hosted by MDT0 and will return an + IO error if an attempt is made to access a directory on another + MDT. - To upgrade a Lustre software release 2.x to a more recent major release, complete these - steps: + To upgrade a Lustre software release 2.x to a more recent major + release, complete these steps: - Create a complete, restorable file system backup. + Create a complete, restorable file system backup. - Before installing the Lustre software, back up ALL data. The Lustre software - contains kernel modifications that interact with storage devices and may introduce - security issues and data loss if not installed, configured, or administered properly. If - a full backup of the file system is not practical, a device-level backup of the MDT file - system is recommended. See for a procedure. + Before installing the Lustre software, back up ALL data. The + Lustre software contains kernel modifications that interact with + storage devices and may introduce security issues and data loss if + not installed, configured, or administered properly. If a full backup + of the file system is not practical, a device-level backup of the MDT + file system is recommended. See + for a procedure. - Shut down the file system by unmounting all clients and servers in the order shown - below (unmounting a block device causes the Lustre software to be shut down on that - node): + Shut down the file system by unmounting all clients and servers + in the order shown below (unmounting a block device causes the Lustre + software to be shut down on that node): Unmount the clients. On each client node, run: @@ -156,30 +210,37 @@ Upgrade the Linux operating system on all servers to a compatible - (tested) Linux distribution and reboot. See the Linux Test Matrix at . + (tested) Linux distribution and reboot. See the Linux Test Matrix at + . - Upgrade the Linux operating system on all clients to Red Hat Enterprise Linux 6 or - other compatible (tested) distribution and reboot. See the Linux Test Matrix at . + Upgrade the Linux operating system on all clients to Red Hat + Enterprise Linux 6 or other compatible (tested) distribution and + reboot. See the Linux Test Matrix at + . - Download the Lustre server RPMs for your platform from the Lustre Releases - repository. See for a list of required packages. + Download the Lustre server RPMs for your platform from the + + Lustre Releasesrepository. See + for a list of required packages. - Install the Lustre server packages on all Lustre servers (MGS, MDSs, and OSSs). + Install the Lustre server packages on all Lustre servers (MGS, + MDSs, and OSSs). - Log onto a Lustre server as the root user + Log onto a Lustre server as the + root user - Use the yum command to install the packages: + Use the + yum command to install the packages: - # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... + # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... @@ -194,28 +255,33 @@ - Download the Lustre client RPMs for your platform from the Lustre Releases - repository. See for a list of required packages. + Download the Lustre client RPMs for your platform from the + + Lustre Releasesrepository. See + for a list of required packages. - The version of the kernel running on a Lustre client must be the same as the version - of the lustre-client-modules-ver package - being installed. If not, a compatible kernel must be installed on the client before the - Lustre client packages are installed. + The version of the kernel running on a Lustre client must be + the same as the version of the + lustre-client-modules- + verpackage being installed. If not, a + compatible kernel must be installed on the client before the Lustre + client packages are installed. - Install the Lustre client packages on each of the Lustre clients to be - upgraded. + Install the Lustre client packages on each of the Lustre clients + to be upgraded. - Log onto a Lustre client as the root user. + Log onto a Lustre client as the + root user. - Use the yum command to install the packages: + Use the + yum command to install the packages: - # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... + # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... @@ -230,144 +296,166 @@ - (Optional) For upgrades to Lustre software release 2.2 or higher, to enable wide - striping on an existing MDT, run the following command on the MDT - :mdt# tune2fs -O large_xattr device - For more information about wide striping, see . + (Optional) For upgrades to Lustre software release 2.2 or higher, + to enable wide striping on an existing MDT, run the following command + on the MDT : + mdt# tune2fs -O large_xattr device + For more information about wide striping, see + . - (Optional) For upgrades to Lustre software release 2.4 or higher, to format an - additional MDT, complete these steps: - - Determine the index used for the first MDT (each MDT must have unique index). - Enter:client$ lctl dl | grep mdc + (Optional) For upgrades to Lustre software release 2.4 or higher, + to format an additional MDT, complete these steps: + + + Determine the index used for the first MDT (each MDT must + have unique index). Enter: + client$ lctl dl | grep mdc 36 UP mdc lustre-MDT0000-mdc-ffff88004edf3c00 4c8be054-144f-9359-b063-8477566eb84e 5 - In this example, the next available index is 1. - - - Add the new block device as a new MDT at the next available index by entering - (on one - line):mds# mkfs.lustre --reformat --fsname=filesystem_name --mdt \ - --mgsnode=mgsnode --index 1 /dev/mdt1_device - - - + In this example, the next available index is 1. + + + Add the new block device as a new MDT at the next available + index by entering (on one line): + mds# mkfs.lustre --reformat --fsname=filesystem_name --mdt \ + --mgsnode=mgsnode --index 1 +/dev/mdt1_device + + - (Optional) If you are upgrading to Lustre software release 2.3 or higher from Lustre - software release 2.2 or earlier and want to enable the quota feature, complete these - steps: - - Before setting up the file system, enter on both the MDS and - OSTs:tunefs.lustre --quota - - - When setting up the file system, - enter:conf_param $FSNAME.quota.mdt=$QUOTA_TYPE + (Optional) If you are upgrading to Lustre software release 2.3 or + higher from Lustre software release 2.2 or earlier and want to enable + the quota feature, complete these steps: + + + Before setting up the file system, enter on both the MDS and + OSTs: + tunefs.lustre --quota + + + When setting up the file system, enter: + conf_param $FSNAME.quota.mdt=$QUOTA_TYPE conf_param $FSNAME.quota.ost=$QUOTA_TYPE - - + + - (Optional) If you are upgrading from Lustre software release 1.8, you must manually - enable the FID-in-dirent feature. On the MDS, - enter:tune2fs –O dirdata /dev/mdtdev + (Optional) If you are upgrading from Lustre software release 1.8, + you must manually enable the FID-in-dirent feature. On the MDS, enter: + tune2fs –O dirdata /dev/mdtdev - This step is not reversible. Do not complete this step until you are sure you will - not be downgrading the Lustre software. + This step is not reversible. Do not complete this step until + you are sure you will not be downgrading the Lustre software. - This step only enables FID-in-dirent for newly created files. If you are upgrading to - Lustre software release 2.4, you can use LFSCK 1.5 to enable FID-in-dirent for existing - files. For more information about FID-in-dirent and related functionalities in LFSCK 1.5, - see . + This step only enables FID-in-dirent for newly + created files. If you are upgrading to Lustre software release 2.4, + you can use LFSCK to enable FID-in-dirent for existing files. For + more information about FID-in-dirent and related functionalities in + LFSCK, see . - Start the Lustre file system by starting the components in the order shown in the - following steps: + Start the Lustre file system by starting the components in the + order shown in the following steps: - Mount the MGT. On the MGS, runmgs# mount -a -t lustre + Mount the MGT. On the MGS, run + mgs# mount -a -t lustre - Mount the MDT(s). On each MDT, run:mds# mount -a -t lustre + Mount the MDT(s). On each MDT, run: + mds# mount -a -t lustre Mount all the OSTs. On each OSS node, run: oss# mount -a -t lustre - This command assumes that all the OSTs are listed in the - /etc/fstab file. OSTs that are not listed in the - /etc/fstab file, must be mounted individually by running the - mount command: - mount -t lustre /dev/block_device /mount_point + This command assumes that all the OSTs are listed in the + /etc/fstab file. OSTs that are not listed in + the + /etc/fstab file, must be mounted individually + by running the mount command: + mount -t lustre /dev/block_device/mount_point - Mount the file system on the clients. On each client node, run: + Mount the file system on the clients. On each client node, + run: client# mount -a -t lustre - The mounting order described in the steps above must be followed for the intial mount - and registration of a Lustre file system after an upgrade. For a normal start of a Lustre - file system, the mounting order is MGT, OSTs, MDT(s), clients. + The mounting order described in the steps above must be followed + for the intial mount and registration of a Lustre file system after an + upgrade. For a normal start of a Lustre file system, the mounting order + is MGT, OSTs, MDT(s), clients. - If you have a problem upgrading a Lustre file system, see for some ways - to get help. + If you have a problem upgrading a Lustre file system, see + for some ways to get help.
- <indexterm> - <primary>upgrading</primary> - <secondary>2.X.y to 2.X.y (minor release)</secondary> - </indexterm>Upgrading to Lustre Software Release 2.x.y (Minor Release) - Rolling upgrades are supported for upgrading from any Lustre software release 2.x.y to a - more recent Lustre software release 2.X.y. This allows the Lustre file system to continue to - run while individual servers (or their failover partners) and clients are upgraded one at a - time. The procedure for upgrading a Lustre software release 2.x.y to a more recent minor - release is described in this section. - To upgrade Lustre software release 2.x.y to a more recent minor release, complete these - steps: + + <indexterm> + <primary>upgrading</primary> + <secondary>2.X.y to 2.X.y (minor release)</secondary> + </indexterm>Upgrading to Lustre Software Release 2.x.y (Minor + Release) + Rolling upgrades are supported for upgrading from any Lustre software + release 2.x.y to a more recent Lustre software release 2.X.y. This allows + the Lustre file system to continue to run while individual servers (or + their failover partners) and clients are upgraded one at a time. The + procedure for upgrading a Lustre software release 2.x.y to a more recent + minor release is described in this section. + To upgrade Lustre software release 2.x.y to a more recent minor + release, complete these steps: - Create a complete, restorable file system backup. + Create a complete, restorable file system backup. - Before installing the Lustre software, back up ALL data. The Lustre software - contains kernel modifications that interact with storage devices and may introduce - security issues and data loss if not installed, configured, or administered properly. If - a full backup of the file system is not practical, a device-level backup of the MDT file - system is recommended. See for a procedure. + Before installing the Lustre software, back up ALL data. The + Lustre software contains kernel modifications that interact with + storage devices and may introduce security issues and data loss if + not installed, configured, or administered properly. If a full backup + of the file system is not practical, a device-level backup of the MDT + file system is recommended. See + for a procedure. - Download the Lustre server RPMs for your platform from the Lustre Releases - repository. See for a list of required packages. + Download the Lustre server RPMs for your platform from the + + Lustre Releasesrepository. See + for a list of required packages. - For a rolling upgrade, complete any procedures required to keep the Lustre file system - running while the server to be upgraded is offline, such as failing over a primary server - to its secondary partner. + For a rolling upgrade, complete any procedures required to keep + the Lustre file system running while the server to be upgraded is + offline, such as failing over a primary server to its secondary + partner. - Unmount the Lustre server to be upgraded (MGS, MDS, or OSS) + Unmount the Lustre server to be upgraded (MGS, MDS, or + OSS) Install the Lustre server packages on the Lustre server. - Log onto the Lustre server as the root user + Log onto the Lustre server as the + root user - Use the yum command to install the packages: + Use the + yum command to install the packages: - # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... + # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... @@ -378,7 +466,8 @@ conf_param $FSNAME.quota.ost=$QUOTA_TYPE Mount the Lustre server to restart the Lustre software on the - server:server# mount -a -t lustre + server: + server# mount -a -t lustre Repeat these steps on each Lustre server. @@ -386,22 +475,25 @@ conf_param $FSNAME.quota.ost=$QUOTA_TYPE - Download the Lustre client RPMs for your platform from the Lustre Releases - repository. See for a list of required packages. + Download the Lustre client RPMs for your platform from the + + Lustre Releasesrepository. See + for a list of required packages. - Install the Lustre client packages on each of the Lustre clients to be - upgraded. + Install the Lustre client packages on each of the Lustre clients + to be upgraded. - Log onto a Lustre client as the root user. + Log onto a Lustre client as the + root user. - Use the yum command to install the packages: + Use the + yum command to install the packages: - # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... + # yum --nogpgcheck install pkg1.rpm pkg2.rpm ... @@ -412,7 +504,8 @@ conf_param $FSNAME.quota.ost=$QUOTA_TYPE Mount the Lustre client to restart the Lustre software on the - client:client# mount -a -t lustre + client: + client# mount -a -t lustre Repeat these steps on each Lustre client. @@ -420,8 +513,9 @@ conf_param $FSNAME.quota.ost=$QUOTA_TYPE - If you have a problem upgrading a Lustre file system, see for some - suggestions for how to get help. + If you have a problem upgrading a Lustre file system, see + for some suggestions for how to get + help.
-- 1.8.3.1