From a41458cded319478e71c3577fd255b1daa2b0eb4 Mon Sep 17 00:00:00 2001 From: Richard Henwood Date: Tue, 17 May 2011 18:07:01 -0500 Subject: [PATCH] FIX: xrefs --- BackupAndRestore.xml | 281 ++++++++++++++++++----------------------------- ManagingFileSystemIO.xml | 254 ++++++++++-------------------------------- 2 files changed, 162 insertions(+), 373 deletions(-) diff --git a/BackupAndRestore.xml b/BackupAndRestore.xml index ff67bbe..f907cc7 100644 --- a/BackupAndRestore.xml +++ b/BackupAndRestore.xml @@ -1,57 +1,34 @@ - + - Backing Up and Restoring a File System + Backing Up and Restoring a File System Lustre provides backups at the file system-level, device-level and file-level. This chapter describes how to backup and restore on Lustre, and includes the following sections: - Backing up a File System - - - - - - Backing Up and Restoring an MDS or OST (Device Level) - - - - - - Making a File-Level Backup of an OST File System - - - - - - Restoring a File-Level Backup - - - - - - Using LVM Snapshots with Lustre - - - - + + + + + + + + + + + + + + -
- <anchor xml:id="dbdoclet.50438207_pgfId-1292650" xreflabel=""/> -
- 17.1 <anchor xml:id="dbdoclet.50438207_56395" xreflabel=""/>Backing up a File System + +
+ 17.1 Backing up a File System Backing up a complete file system gives you full control over the files to back up, and allows restoration of individual files as needed. File system-level backups are also the easiest to integrate into existing backup solutions. File system backups are performed from a Lustre client (or many clients working parallel in different directories) rather than on individual server nodes; this is no different than backing up any other file system. However, due to the large size of most Lustre file systems, it is not always possible to get a complete backup. We recommend that you back up subsets of a file system. This includes subdirectories of the entire file system, filesets for a single user, files incremented by date, and so on. - - - - - - Note -In order to allow Lustre to scale the filesystem namespace for future applications, Lustre 2.x internally uses a 128-bit file identifier for all files. To interface with user applications, Lustre presents 64-bit inode numbers for the stat(), fstat(), and readdir() system calls on 64-bit applications, and 32-bit inode numbers to 32-bit applications. Some 32-bit applications accessing Lustre filesystems (on both 32-bit and 64-bit CPUs) may experience problems with the stat(), fstat() or readdir() system calls under certain circumstances, though the Lustre client should return 32-bit inode numbers to these applications. In particular, if the Lustre filesystem is exported from a 64-bit client via NFS to a 32-bit client, the Linux NFS server will export 64-bit inode numbers to applications running on the NFS client. If the 32-bit applications are not compiled with Large File Support (LFS), then they return EOVERFLOW errors when accessing the Lustre files. To avoid this problem, Linux NFS clients can use the kernel command-line option "nfs.enable_ino64=0" in order to force the NFS client to export 32-bit inode numbers to the client.Workaround: We very strongly recommend that backups using tar(1) and other utilities that depend on the inode number to uniquely identify an inode to be run on 64-bit clients. The 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit inode number, and as a result these utilities may operate incorrectly on 32-bit clients. - - - - + + In order to allow Lustre to scale the filesystem namespace for future applications, Lustre 2.x internally uses a 128-bit file identifier for all files. To interface with user applications, Lustre presents 64-bit inode numbers for the stat(), fstat(), and readdir() system calls on 64-bit applications, and 32-bit inode numbers to 32-bit applications. Some 32-bit applications accessing Lustre filesystems (on both 32-bit and 64-bit CPUs) may experience problems with the stat(), fstat() or readdir() system calls under certain circumstances, though the Lustre client should return 32-bit inode numbers to these applications. In particular, if the Lustre filesystem is exported from a 64-bit client via NFS to a 32-bit client, the Linux NFS server will export 64-bit inode numbers to applications running on the NFS client. If the 32-bit applications are not compiled with Large File Support (LFS), then they return EOVERFLOW errors when accessing the Lustre files. To avoid this problem, Linux NFS clients can use the kernel command-line option "nfs.enable_ino64=0" in order to force the NFS client to export 32-bit inode numbers to the client.Workaround: We very strongly recommend that backups using tar(1) and other utilities that depend on the inode number to uniquely identify an inode to be run on 64-bit clients. The 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit inode number, and as a result these utilities may operate incorrectly on 32-bit clients. +
<anchor xml:id="dbdoclet.50438207_pgfId-1293842" xreflabel=""/>17.1.1 Lustre_rsync The lustre_rsync feature keeps the entire file system in sync on a backup by replicating the file system’s changes to a second file system (the second file system need not be a Lustre file system, but it must be sufficiently large). Lustre_rsync uses Lustre changelogs to efficiently synchronize the file systems without having to scan (directory walk) the Lustre file system. This efficiency is critically important for large file systems, and distinguishes the Lustre lustre_rsync feature from other replication/backup solutions. @@ -61,25 +38,19 @@ The first time that lustre_rsync is run, the user must specify a set of parameters for the program to use. These parameters are described in the following table and in lustre_rsync. On subsequent runs, these parameters are stored in the the status file, and only the name of the status file needs to be passed to lustre_rsync. Before using lustre_rsync: - Register the changelog user. For details, see the changelog_register parameter in the lctl. - - - + Register the changelog user. For details, see the (changelog_register) parameter in the (lctl). - AND - Verify that the Lustre file system (source) and the replica file system (target) are identical before registering the changelog user. If the file systems are discrepant, use a utility, e.g. regular rsync (not lustre_rsync), to make them identical. - - - The lustre_rsync utility uses the following parameters: - - + + Parameter @@ -101,7 +72,7 @@ --user=<user id> - The changelog user ID for the specified MDT. To use lustre_rsync, the changelog user must be registered. For details, see the changelog_register parameter in lctl. This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. + The changelog user ID for the specified MDT. To use lustre_rsync, the changelog user must be registered. For details, see the changelog_register parameter in (lctl). This is a mandatory option if a valid status log created during a previous synchronization operation (--statuslog) is not specified. --statuslog=<log> @@ -109,7 +80,7 @@ --xattr <yes|no> - Specifies whether extended attributes (xattrs) are synchronized or not. The default is to synchronize extended attributes.Note - Disabling xattrs causes Lustre striping information not to be synchronized. + Specifies whether extended attributes (xattrs) are synchronized or not. The default is to synchronize extended attributes.Disabling xattrs causes Lustre striping information not to be synchronized. --verbose @@ -170,29 +141,14 @@ get2 \
-
- 17.2 <anchor xml:id="dbdoclet.50438207_71633" xreflabel=""/>Backing Up and Restoring an MDS or OST (Device Level) +
+ 17.2 Backing Up and Restoring an MDS or OST (Device Level) In some cases, it is useful to do a full device-level backup of an individual device (MDT or OST), before replacing hardware, performing maintenance, etc. Doing full device-level backups ensures that all of the data and configuration files is preserved in the original state and is the easiest method of doing a backup. For the MDT file system, it may also be the fastest way to perform the backup and restore, since it can do large streaming read and write operations at the maximum bandwidth of the underlying devices. - - - - - - Note -Keeping an updated full backup of the MDT is especially important because a permanent failure of the MDT file system renders the much larger amount of data in all the OSTs largely inaccessible and unusable. - - - - - - - - - - Note -In Lustre 2.0 and 2.1 the only correct way to perform an MDT backup and restore is to do a device-level backup as is described in this section. The ability to do MDT file-level backups is not functional in these releases because of the inability to restore the Object Index (OI) file correctly (see bug 22741 for details). - - - - + + Keeping an updated full backup of the MDT is especially important because a permanent failure of the MDT file system renders the much larger amount of data in all the OSTs largely inaccessible and unusable. + + In Lustre 2.0 and 2.1 the only correct way to perform an MDT backup and restore is to do a device-level backup as is described in this section. The ability to do MDT file-level backups is not functional in these releases because of the inability to restore the Object Index (OI) file correctly (see bug 22741 for details). + If hardware replacement is the reason for the backup or if a spare storage device is available, it is possible to do a raw copy of the MDT or OST from one block device to the other, as long as the new device is at least as large as the original device. To do this, run: dd if=/dev/{original} of=/dev/{new} bs=1M @@ -202,135 +158,110 @@ get2 \ Even in the face of hardware errors, the ldiskfs file system is very robust and it may be possible to recover the file system data after running e2fsck -f on the new device.
-
- 17.3 <anchor xml:id="dbdoclet.50438207_21638" xreflabel=""/>Making a File-Level Backup of an OST File System +
+ 17.3 Making a File-Level Backup of an OST File System This procedure provides another way to backup or migrate the data of an OST at the file level, so that the unused space of the OST does not need to be backed up. Backing up a single OST device is not necessarily the best way to perform backups of the Lustre file system, since the files stored in the backup are not usable without metadata stored on the MDT. However, it is the preferred method for migration of OST devices, especially when it is desirable to reformat the underlying file system with different configuration options or to reduce fragmentation. - - - - - - Note -In Lustre 2.0 and 2.1 the only correct way to perform an MDT backup and restore is to do a device-level backup as is described in this section. The ability to do MDT file-level backups is not functional in these releases because of the inability to restore the Object Index (OI) file correctly (see bug 22741 for details). - - - - - 1. Make a mountpoint for the file system. + + In Lustre 2.0 and 2.1 the only correct way to perform an MDT backup and restore is to do a device-level backup as is described in this section. The ability to do MDT file-level backups is not functional in these releases because of the inability to restore the Object Index (OI) file correctly (see bug 22741 for details). + + Make a mountpoint for the file system. [oss]# mkdir -p /mnt/ost - 2. Mount the file system. + + Mount the file system. [oss]# mount -t ldiskfs /dev/{ostdev} /mnt/ost - 3. Change to the mountpoint being backed up. + + Change to the mountpoint being backed up. [oss]# cd /mnt/ost - 4. Back up the extended attributes. + + Back up the extended attributes. [oss]# getfattr -R -d -m '.*' -e hex -P . > ea-$(date +%Y%m%d).bak - - - - - - Note -If the tar(1) command supports the --xattr option, the getfattr step may be unnecessary as long as it does a backup of the "trusted" attributes. However, completing this step is not harmful and can serve as an added safety measure. - - - - - - - - - - Note -In most distributions, the getfattr command is part of the "attr" package. If the getfattr command returns errors like Operation not supported, then the kernel does not correctly support EAs. Stop and use a different backup method. - - - - - 5. Verify that the ea-$date.bak file has properly backed up the EA data on the OST. + If the tar(1) command supports the --xattr option, the getfattr step may be unnecessary as long as it does a backup of the "trusted" attributes. However, completing this step is not harmful and can serve as an added safety measure. + In most distributions, the getfattr command is part of the "attr" package. If the getfattr command returns errors like Operation not supported, then the kernel does not correctly support EAs. Stop and use a different backup method. + + + Verify that the ea-$date.bak file has properly backed up the EA data on the OST. Without this attribute data, the restore process may be missing extra data that can be very useful in case of later file system corruption. Look at this file with more or a text editor. Each object file should hae a corresponding item similar to this: [oss]# file: O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000 - 6. Back up all file system data. + + Back up all file system data. [oss]# tar czvf {backup file}.tgz --sparse . - - - - - - Note -In Lustre 1.6.7 and later, the --sparse option reduces the size of the backup file. Be sure to use it so the tar command does not mistakenly create an archive full of zeros. - - - - - 7. Change directory out of the file system. + In Lustre 1.6.7 and later, the --sparse option reduces the size of the backup file. Be sure to use it so the tar command does not mistakenly create an archive full of zeros. + + + Change directory out of the file system. [oss]# cd - - 8. Unmount the file system. + + Unmount the file system. [oss]# umount /mnt/ost - - - - - - Note -When restoring an OST backup on a different node as part of an OST migration, you also have to change server NIDs and use the --writeconf command to re-generate the configuration logs. See Changing a Server NID. - - - - + When restoring an OST backup on a different node as part of an OST migration, you also have to change server NIDs and use the --writeconf command to re-generate the configuration logs. See (Changing a Server NID). + + + +
-
- 17.4 <anchor xml:id="dbdoclet.50438207_22325" xreflabel=""/>Restoring a File-Level Backup +
+ 17.4 Restoring a File-Level Backup To restore data from a file-level backup, you need to format the device, restore the file data and then restore the EA data. - 1. Format the new device. + + + Format the new device. [oss]# mkfs.lustre --ost --index {OST index} {other options} newdev} - 2. Mount the file system. + + Mount the file system. [oss]# mount -t ldiskfs {newdev} /mnt/ost - 3. Change to the new file system mount point. + + Change to the new file system mount point. [oss]# cd /mnt/ost - 4. Restore the file system backup. + + Restore the file system backup. [oss]# tar xzvpf {backup file} --sparse - 5. Restore the file system extended attributes. + + Restore the file system extended attributes. [oss]# setfattr --restore=ea-${date}.bak - 6. Verify that the extended attributes were restored. + + Verify that the extended attributes were restored. [oss]# getfattr -d -m ".*" -e hex O/0/d0/100992 trusted.fid= \ 0x0d822200000000004a8a73e500000000808a0100000000000000000000000000 - 7. Change directory out of the file system. + + Change directory out of the file system. [oss]# cd - - 8. Unmount the new file system. + + Unmount the new file system. [oss]# umount /mnt/ost + + If the file system was used between the time the backup was made and when it was restored, then the lfsck tool (part of Lustre e2fsprogs) can optionally be run to ensure the file system is coherent. If all of the device file systems were backed up at the same time after the entire Lustre file system was stopped, this is not necessary. In either case, the file system should be immediately usable even if lfsck is not run, though there may be I/O errors reading from files that are present on the MDT but not the OSTs, and files that were created after the MDT backup will not be accessible/visible.
-
- 17.5 <anchor xml:id="dbdoclet.50438207_31553" xreflabel=""/>Using LVM Snapshots with Lustre +
+ 17.5 Using LVM Snapshots with Lustre If you want to perform disk-based backups (because, for example, access to the backup system needs to be as fast as to the primary Lustre file system), you can use the Linux LVM snapshot tool to maintain multiple, incremental file system backups. Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of the main Lustre file system will probably result in unacceptable performance losses. You should create a new, backup Lustre file system and periodically (e.g., nightly) back up new/changed files to it. Periodic snapshots can be taken of this backup file system to create a series of "full" backups. - - - - - - Note -Creating an LVM snapshot is not as reliable as making a separate backup, because the LVM snapshot shares the same disks as the primary MDT device, and depends on the primary MDT device for much of its data. If the primary MDT device becomes corrupted, this may result in the snapshot being corrupted. - - - - + + Creating an LVM snapshot is not as reliable as making a separate backup, because the LVM snapshot shares the same disks as the primary MDT device, and depends on the primary MDT device for much of its data. If the primary MDT device becomes corrupted, this may result in the snapshot being corrupted. +
<anchor xml:id="dbdoclet.50438207_pgfId-1292752" xreflabel=""/>17.5.1 Creating an LVM-based Backup File System Use this procedure to create a backup Lustre file system for use with the LVM snapshot mechanism. - 1. Create LVM volumes for the MDT and OSTs. + + Create LVM volumes for the MDT and OSTs. Create LVM devices for your MDT and OST targets. Make sure not to use the entire disk for the targets; save some room for the snapshots. The snapshots start out as 0 size, but grow as you make changes to the current file system. If you expect to change 20% of the file system between backups, the most recent snapshot will be 20% of the target size, the next older one will be 40%, etc. Here is an example: cfs21:~# pvcreate /dev/sda1 Physical volume "/dev/sda1" successfully created @@ -344,8 +275,9 @@ get2 \ ACTIVE '/dev/volgroup/MDT' [200.00 MB] inherit ACTIVE '/dev/volgroup/OST0' [200.00 MB] inherit - 2. Format the LVM volumes as Lustre targets. - In this example, the backup file system is called “main†and designates the current, most up-to-date backup. + + Format the LVM volumes as Lustre targets. + In this example, the backup file system is called 'main' and designates the current, most up-to-date backup. cfs21:~# mkfs.lustre --mdt --fsname=main /dev/volgroup/MDT No management node specified, adding MGS to this MDT. Permanent disk data: @@ -389,6 +321,7 @@ index -F /dev/volgroup/MDT cfs21:~# mount -t lustre /dev/volgroup/OST0 /mnt/ost cfs21:~# mount -t lustre cfs21:/main /mnt/main +
<anchor xml:id="dbdoclet.50438207_pgfId-1292809" xreflabel=""/>17.5.2 Backing up New/Changed Files to the Backup File System @@ -423,7 +356,8 @@ index -F /dev/volgroup/MDT
<anchor xml:id="dbdoclet.50438207_pgfId-1292832" xreflabel=""/>17.5.4 Restoring the File System From a Snapshot Use this procedure to restore the file system from an LVM snapshot. - 1. Rename the LVM snapshot. + + Rename the LVM snapshot. Rename the file system snapshot from "main" to "back" so you can mount it without unmounting "main". This is recommended, but not required. Use the --reformat flag to tunefs.lustre to force the name change. For example: cfs21:~# tunefs.lustre --reformat --fsname=back --writeconf /dev/volgroup/M\ DTb1 @@ -482,18 +416,22 @@ ts cfs21:~# rm /mnt/ostback/last_rcvd cfs21:~# umount /mnt/ostback - 2. Mount the file system from the LVM snapshot. + + Mount the file system from the LVM snapshot. For example: cfs21:~# mount -t lustre /dev/volgroup/MDTb1 /mnt/mdtback \ cfs21:~# mount -t lustre /dev/volgroup/OSTb1 /mnt/ostback cfs21:~# mount -t lustre cfs21:/back /mnt/back + 3. Note the old directory contents, as of the snapshot time. For example: cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back fstab passwds + +
<anchor xml:id="dbdoclet.50438207_pgfId-1292898" xreflabel=""/>17.5.5 Deleting Old Snapshots @@ -506,18 +444,7 @@ ts You can also extend or shrink snapshot volumes if you find your daily deltas are smaller or larger than expected. Run: lvextend -L10G /dev/volgroup/MDTb1 - - - - - - Note - Extending snapshots seems to be broken in older LVM. It is working in LVM v2.02.01. - - - - -   + Extending snapshots seems to be broken in older LVM. It is working in LVM v2.02.01.
-
diff --git a/ManagingFileSystemIO.xml b/ManagingFileSystemIO.xml index a44effc..111d50b 100644 --- a/ManagingFileSystemIO.xml +++ b/ManagingFileSystemIO.xml @@ -1,44 +1,29 @@ - + - Managing the File System and I/O + Managing the File System and I/O This chapter describes file striping and I/O options, and includes the following sections: + - Handling Full OSTs + - + - Creating and Managing OST Pools + - + - Adding an OST to a Lustre File System - - - - - - Performing Direct I/O - - - - - - Other I/O Options - - - + -
- <anchor xml:id="dbdoclet.50438211_pgfId-1294597" xreflabel=""/> -
- 19.1 <anchor xml:id="dbdoclet.50438211_17536" xreflabel=""/>Handling <anchor xml:id="dbdoclet.50438211_marker-1295529" xreflabel=""/>Full OSTs + +
+ 19.1 Handling <anchor xml:id="dbdoclet.50438211_marker-1295529" xreflabel=""/>Full OSTs Sometimes a Lustre file system becomes unbalanced, often due to incorrectly-specified stripe settings, or when very large files are created that are not striped over all of the OSTs. If an OST is full and an attempt is made to write more information to the file system, an error occurs. The procedures below describe how to handle a full OST. The MDS will normally handle space balancing automatically at file creation time, and this procedure is normally not needed, but may be desirable in certain circumstances (e.g. when creating very large files that would consume more than the total free space of the full OSTs).
@@ -78,12 +63,14 @@ nt=100
<anchor xml:id="dbdoclet.50438211_pgfId-1294633" xreflabel=""/>19.1.2 Taking a Full OST Offline To avoid running out of space in the file system, if the OST usage is imbalanced and one or more OSTs are close to being full while there are others that have a lot of space, the full OSTs may optionally be deactivated at the MDS to prevent the MDS from allocating new objects there. - 1. Log into the MDS server: + + Log into the MDS server: [root@LustreClient01 ~]# ssh root@192.168.0.10 root@192.168.0.10's password: Last login: Wed Nov 26 13:35:12 2008 from 192.168.0.6 - 2. Use the lctl dl command to show the status of all file system components: + + Use the lctl dl command to show the status of all file system components: [root@mds ~]# lctl dl 0 UP mgs MGS MGS 9 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5 @@ -97,10 +84,12 @@ nt=100 9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5 10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5 - 3. Use lctl deactivate to take the full OST offline: + + Use lctl deactivate to take the full OST offline: [root@mds ~]# lctl --device 7 deactivate - 4. Display the status of the file system components: + + Display the status of the file system components: [root@mds ~]# lctl dl 0 UP mgs MGS MGS 9 1 UP mgc MGC192.168.0.10@tcp e384bb0e-680b-ce25-7bc9-81655dd1e813 5 @@ -114,26 +103,31 @@ nt=100 9 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5 10 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5 + The device list shows that OST0002 is now inactive. When new files are created in the file system, they will only use the remaining active OSTs. Either manual space rebalancing can be done by migrating data to other OSTs, as shown in the next section, or normal file deletion and creation can be allowed to passively rebalance the space usage.
<anchor xml:id="dbdoclet.50438211_pgfId-1294681" xreflabel=""/>19.1.3 Migrating Data within a File System As stripes cannot be moved within the file system, data must be migrated manually by copying and renaming the file, removing the original file, and renaming the new file with the original file name. The simplest way to do this is to use the lfs_migrate command (see lfs_migrate). However, the steps for migrating a file by hand are also shown here for reference. - 1. Identify the file(s) to be moved. + + Identify the file(s) to be moved. In the example below, output from the getstripe command indicates that the file test_2 is located entirely on OST2: [client]# lfs getstripe /mnt/lustre/test_2 /mnt/lustre/test_2 obdidx objid objid group 2 8 0x8 0 - 2. To move single object(s), create a new copy and remove the original. Enter: + + To move single object(s), create a new copy and remove the original. Enter: [client]# cp -a /mnt/lustre/test_2 /mnt/lustre/test_2.tmp [client]# mv /mnt/lustre/test_2.tmp /mnt/lustre/test_2 - 3. To migrate large files from one or more OSTs, enter: + + To migrate large files from one or more OSTs, enter: [client]# lfs find --ost {OST_UUID} -size +1G | lfs_migrate -y - 4. Check the file system balance. + + Check the file system balance. The df output in the example below shows a more balanced system compared to the df output in the example in Checking OST Space Usage. [client]# lfs df -h UUID bytes Used Available Use% \ @@ -156,6 +150,7 @@ nt=100 filesystem summary: 11.8G 7.3G 3.9G 61% \ /mnt/lustre +
<anchor xml:id="dbdoclet.50438211_pgfId-1296756" xreflabel=""/>19.1.4 Returning an Inactive OST Back Online @@ -176,57 +171,28 @@ nt=100
-
- 19.2 <anchor xml:id="dbdoclet.50438211_75549" xreflabel=""/>Creating and Managing <anchor xml:id="dbdoclet.50438211_marker-1295531" xreflabel=""/>OST Pools +
+ 19.2 Creating and Managing <anchor xml:id="dbdoclet.50438211_marker-1295531" xreflabel=""/>OST Pools The OST pools feature enables users to group OSTs together to make object placement more flexible. A 'pool' is the name associated with an arbitrary subset of OSTs in a Lustre cluster. OST pools follow these rules: An OST can be a member of multiple pools. - - - No ordering of OSTs in a pool is defined or implied. - - - Stripe allocation within a pool follows the same rules as the normal stripe allocator. - - - OST membership in a pool is flexible, and can change over time. - - - When an OST pool is defined, it can be used to allocate files. When file or directory striping is set to a pool, only OSTs in the pool are candidates for striping. If a stripe_index is specified which refers to an OST that is not a member of the pool, an error is returned. OST pools are used only at file creation. If the definition of a pool changes (an OST is added or removed or the pool is destroyed), already-created files are not affected. - - - - - - Note -An error (EINVAL) results if you create a file using an empty pool. - - - - - - - - - - Note -If a directory has pool striping set and the pool is subsequently removed, the new files created in this directory have the (non-pool) default striping pattern for that directory applied and no error is returned. - - - - + An error (EINVAL) results if you create a file using an empty pool. + If a directory has pool striping set and the pool is subsequently removed, the new files created in this directory have the (non-pool) default striping pattern for that directory applied and no error is returned. +
<anchor xml:id="dbdoclet.50438211_pgfId-1293517" xreflabel=""/>19.2.1 <anchor xml:id="dbdoclet.50438211_71392" xreflabel=""/>Working with OST Pools OST pools are defined in the configuration log on the MGS. Use the lctl command to: @@ -234,55 +200,20 @@ nt=100 Create/destroy a pool - - - Add/remove OSTs in a pool - - - List pools and OSTs in a specific pool - - - The lctl command MUST be run on the MGS. Another requirement for managing OST pools is to either have the MDT and MGS on the same node or have a Lustre client mounted on the MGS node, if it is separate from the MDS. This is needed to validate the pool commands being run are correct. - - - - - - - - - - - - - - - - Caution -Running the writeconf command on the MDS erases all pools information (as well as any other parameters set using lctl conf_param). We recommend that the pools definitions (and conf_param settings) be executed using a script, so they can be reproduced easily after a writeconf is performed. - - - - + Running the writeconf command on the MDS erases all pools information (as well as any other parameters set using lctl conf_param). We recommend that the pools definitions (and conf_param settings) be executed using a script, so they can be reproduced easily after a writeconf is performed. + To create a new pool, run: lctl pool_new <fsname>.<poolname> - - - - - - Note -The pool name is an ASCII string up to 16 characters. - - - - + The pool name is an ASCII string up to 16 characters. + To add the named OST to a pool, run: lctl pool_add <fsname>.<poolname> <ost_list> @@ -291,44 +222,21 @@ nt=100 <ost_list> is <fsname->OST<index_range>[_UUID] - - - <index_range> is <ost_index_start>-<ost_index_end>[,<index_range>] or <ost_index_start>-<ost_index_end>/<step> - - - If the leading <fsname> and/or ending _UUID are missing, they are automatically added. For example, to add even-numbered OSTs to pool1 on file system lustre, run a single command (add) to add many OSTs to the pool at one time: lctl pool_add lustre.pool1 OST[0-10/2] - - - - - - Note -Each time an OST is added to a pool, a new llog configuration record is created. For convenience, you can run a single command. - - - - + Each time an OST is added to a pool, a new llog configuration record is created. For convenience, you can run a single command. To remove a named OST from a pool, run: lctl pool_remove <fsname>.<poolname> <ost_list> To destroy a pool, run: lctl pool_destroy <fsname>.<poolname> - - - - - - Note -All OSTs must be removed from a pool before it can be destroyed. - - - - + All OSTs must be removed from a pool before it can be destroyed. + To list pools in the named file system, run: lctl pool_list <fsname> | <pathname> @@ -346,26 +254,9 @@ nt=100 [--count|-c stripe_count] [--pool|-p pool_name] <dir|filename> - - - - - - Note -If you specify striping with an invalid pool name, because the pool does not exist or the pool name was mistyped, lfs setstripe returns an error. Run lfs pool_list to make sure the pool exists and the pool name is entered correctly. - - - - - - - - - - Note -The --pool option for lfs setstripe is compatible with other modifiers. For example, you can set striping on a directory to use an explicit starting index. - - - - + If you specify striping with an invalid pool name, because the pool does not exist or the pool name was mistyped, lfs setstripe returns an error. Run lfs pool_list to make sure the pool exists and the pool name is entered correctly. + The --pool option for lfs setstripe is compatible with other modifiers. For example, you can set striping on a directory to use an explicit starting index. +
@@ -375,32 +266,26 @@ nt=100 A directory or file can be given an extended attribute (EA), that restricts striping to a pool. - - - Pools can be used to group OSTs with the same technology or performance (slower or faster), or that are preferred for certain jobs. Examples are SATA OSTs versus SAS OSTs or remote OSTs versus local OSTs. - - - A file created in an OST pool tracks the pool by keeping the pool name in the file LOV EA. - - -
-
- 19.3 <anchor xml:id="dbdoclet.50438211_11204" xreflabel=""/>Adding an OST to a Lustre File System +
+ 19.3 Adding an OST to a Lustre File System To add an OST to existing Lustre file system: - 1. Add a new OST by passing on the following commands, run: + + Add a new OST by passing on the following commands, run: $ mkfs.lustre --fsname=spfs --ost --mgsnode=mds16@tcp0 /dev/sda $ mkdir -p /mnt/test/ost0 $ mount -t lustre /dev/sda /mnt/test/ost0 - 2. Migrate the data (possibly). + + + Migrate the data (possibly). The file system is quite unbalanced when new empty OSTs are added. New file creations are automatically balanced. If this is a scratch file system or files are pruned at a regular interval, then no further work may be needed. Files existing prior to the expansion can be rebalanced with an in-place copy, which can be done with a simple script. The basic method is to copy existing files to a temporary file, then move the temp file over the old one. This should not be attempted with files which are currently being written to by users or applications. This operation redistributes the stripes over the entire set of OSTs. A very clever migration script would do the following: @@ -408,37 +293,23 @@ nt=100 Examine the current distribution of data. - - - Calculate how much data should move from each full OST to the empty ones. - - - Search for files on a given full OST (using lfs getstripe). - - - Force the new destination OST (using lfs setstripe). - - - Copy only enough files to address the imbalance. - - - + If a Lustre administrator wants to explore this approach further, per-OST disk-usage statistics can be found under /proc/fs/lustre/osc/*/rpc_stats
-
- 19.4 <anchor xml:id="dbdoclet.50438211_80295" xreflabel=""/>Performing <anchor xml:id="dbdoclet.50438211_marker-1291962" xreflabel=""/>Direct I/O +
+ 19.4 Performing <anchor xml:id="dbdoclet.50438211_marker-1291962" xreflabel=""/>Direct I/O Lustre supports the O_DIRECT flag to open. Applications using the read() and write() calls must supply buffers aligned on a page boundary (usually 4 K). If the alignment is not correct, the call returns -EINVAL. Direct I/O may help performance in cases where the client is doing a large amount of I/O and is CPU-bound (CPU utilization 100%).
@@ -449,8 +320,8 @@ nt=100 To remove this flag, use chattr -i
-
- 19.5 <anchor xml:id="dbdoclet.50438211_61024" xreflabel=""/>Other I/O Options +
+ 19.5 Other I/O Options This section describes other I/O options, including checksums.
<anchor xml:id="dbdoclet.50438211_pgfId-1291975" xreflabel=""/>19.5.1 Lustre <anchor xml:id="dbdoclet.50438211_marker-1291974" xreflabel=""/>Checksums @@ -480,16 +351,8 @@ from 192.168.1.1@tcp inum 8991479/2386814769 object 1127239/0 extent [10240\ $ echo <algorithm name> /proc/fs/lustre/osc/<fsname>-OST<index>- \osc-*/checksum_\ type - - - - - - Note -The in-memory checksum always uses the adler32 algorithm, if available, and only falls back to crc32 if adler32 cannot be used. - - - - + The in-memory checksum always uses the adler32 algorithm, if available, and only falls back to crc32 if adler32 cannot be used. + In the following example, the cat command is used to determine that Lustre is using the adler32 checksum algorithm. Then the echo command is used to change the checksum algorithm to crc32. A second cat command confirms that the crc32 checksum algorithm is now in use. $ cat /proc/fs/lustre/osc/lustre-OST0000-osc- \ffff81012b2c48e0/checksum_ty\ pe @@ -502,6 +365,5 @@ pe
-
-- 1.8.3.1