From 6f39a1bf0de1939a2a3357a3f65c1e9de8baec4f Mon Sep 17 00:00:00 2001 From: Andreas Dilger Date: Thu, 20 Mar 2014 18:24:38 -0600 Subject: [PATCH] LUDOC-56 zfs: discuss ZFS in parts of the manual Added ZFS filesystem limits in the Lustre limits tables, since they are significantly different from ldiskfs filesystem limits. Include mention of ZFS in a few key places. This definitely is just a starting point, since a full discussion of formatting and maintaining ZFS is needed in the manual. Signed-off-by: Andreas Dilger Signed-off-by: Richard Henwood Change-Id: Icaea5657e43ec58466f4ea8047fbec65b2805701 Reviewed-on: http://review.whamcloud.com/9740 Tested-by: Jenkins --- BackupAndRestore.xml | 5 +++ ConfiguringLustre.xml | 2 +- ConfiguringQuotas.xml | 2 +- LustreRecovery.xml | 2 +- SettingUpLustreSystem.xml | 49 ++++++++++++++++------------ SystemConfigurationUtilities.xml | 70 +++++----------------------------------- TroubleShootingRecovery.xml | 2 +- UnderstandingLustre.xml | 45 ++++++++++++++------------ 8 files changed, 71 insertions(+), 106 deletions(-) diff --git a/BackupAndRestore.xml b/BackupAndRestore.xml index b41c8db..83202c3 100644 --- a/BackupAndRestore.xml +++ b/BackupAndRestore.xml @@ -230,6 +230,11 @@ Changelog records consumed: 42 system data after running e2fsck -fy /dev/{newdev} on the new device, along with ll_recover_lost_found_objs for OST devices. + With Lustre software version 2.6 and later, there is + no longer a need to run ll_recover_lost_found_objs on + the OSTs, since the LFSCK scanning will automatically + move objects from lost+found back into its correct + location on the OST after directory corruption.
<indexterm><primary>backup</primary><secondary>OST file system</secondary></indexterm><indexterm><primary>backup</primary><secondary>MDT file system</secondary></indexterm>Making a File-Level Backup of an OST or MDT File System diff --git a/ConfiguringLustre.xml b/ConfiguringLustre.xml index 2932e78..d7f4725 100644 --- a/ConfiguringLustre.xml +++ b/ConfiguringLustre.xml @@ -91,7 +91,7 @@ Create the OST. On the OSS node, run: mkfs.lustre --fsname=fsname --mgsnode=MGS_NID --ost --index=OST_index /dev/block_device - When you create an OST, you are formatting a ldiskfs file system on a block storage device like you would with any local file system. + When you create an OST, you are formatting a ldiskfs or ZFS file system on a block storage device like you would with any local file system. You can have as many OSTs per OSS as the hardware or drivers allow. For more information about storage and memory requirements for a Lustre file system, see . You can only configure one OST per block device. You should create an OST that uses the raw block device and does not use partitioning. You should specify the OST index number at format time in order to simplify translating the OST number in error messages or file striping to the OSS node and block device later on. diff --git a/ConfiguringQuotas.xml b/ConfiguringQuotas.xml index 810c86e..3ba656a 100644 --- a/ConfiguringQuotas.xml +++ b/ConfiguringQuotas.xml @@ -68,7 +68,7 @@ relies on the backend file system to maintain per-user/group block and inode usage: - For ldiskfs backend, mkfs.lustre now creates empty quota files + For ldiskfs backends, mkfs.lustre now creates empty quota files and enables the QUOTA feature flag in the superblock which turns quota accounting on at mount time automatically. e2fsck was also modified to fix the quota files when the QUOTA feature flag is present. diff --git a/LustreRecovery.xml b/LustreRecovery.xml index a524d8e..36205b9 100644 --- a/LustreRecovery.xml +++ b/LustreRecovery.xml @@ -188,7 +188,7 @@ Each client request processed by the server that involves any state change (metadata update, file open, write, etc., depending on server type) is assigned a transaction number by the server that is a target-unique, monotonically increasing, server-wide 64-bit integer. The transaction number for each file system-modifying request is sent back to the client along with the reply to that client request. The transaction numbers allow the client and server to unambiguously order every modification to the file system in case recovery is needed. Each reply sent to a client (regardless of request type) also contains the last committed transaction number that indicates the highest transaction number committed to the - file system. The ldiskfs backing file system that the Lustre software + file system. The ldiskfs and ZFS backing file systems that the Lustre software uses enforces the requirement that any earlier disk operation will always be committed to disk before a later disk operation, so the last committed transaction number also reports that any requests with a lower transaction number have been committed to disk. diff --git a/SettingUpLustreSystem.xml b/SettingUpLustreSystem.xml index 83de2ee..c64be79 100644 --- a/SettingUpLustreSystem.xml +++ b/SettingUpLustreSystem.xml @@ -128,6 +128,15 @@ unusable for general storage. Thus, at least 400 MB of space is used on each OST before any file object data is saved. + With a ZFS backing filesystem for the MDT or OST, + the space allocation for inodes and file data is dynamic, and inodes are + allocated as needed. A minimum of 2kB of usable space (before RAID) is + needed for each inode, exclusive of other overhead such as directories, + internal log files, extended attributes, ACLs, etc. + Since the size of extended attributes and ACLs is highly dependent on + kernel versions and site-specific policies, it is best to over-estimate + the amount of space needed for the desired number of inodes, and any + excess space will be utilized to store more inodes.
<indexterm> <primary>setup</primary> @@ -389,7 +398,7 @@ system, but a single MDS can host multiple MDTs, each one for a separate file system.</para> <para condition="l24">The Lustre software release 2.4 and later requires one MDT for - the root. Upto 4095 additional MDTs can be added to the file system and attached + the filesystem root. Up to 4095 additional MDTs can be added to the file system and attached into the namespace with remote directories.</para> </entry> </row> @@ -410,11 +419,11 @@ <para> Maximum OST size</para> </entry> <entry> - <para> 128TB </para> + <para> 128TB (ldiskfs), 256TB (ZFS)</para> </entry> <entry> <para>This is not a <emphasis>hard</emphasis> limit. Larger OSTs are possible but - today typical production systems do not go beyond 128TB per OST. </para> + today typical production systems do not go beyond the stated limit per OST. </para> </entry> </row> <row> @@ -425,7 +434,7 @@ <para> 131072</para> </entry> <entry> - <para>The maximum number of clients is a constant that can be changed at compile time.</para> + <para>The maximum number of clients is a constant that can be changed at compile time. Up to 30000 clients have been used in production.</para> </entry> </row> <row> @@ -433,10 +442,10 @@ <para> Maximum size of a file system</para> </entry> <entry> - <para> 512 PB</para> + <para> 512 PB (ldiskfs), 1EB (ZFS)</para> </entry> <entry> - <para>Each OST or MDT on 64-bit kernel servers can have a file system up to 128 TB. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para> + <para>Each OST or MDT on 64-bit kernel servers can have a file system up to the above limit. On 32-bit systems, due to page cache limits, 16TB is the maximum block device size, which in turn applies to the size of OST on 32-bit kernel servers.</para> <para>You can have multiple OST file systems on a single OSS node.</para> </entry> </row> @@ -476,12 +485,13 @@ <row> <entry> <para> Maximum object size</para> </entry> <entry> - <para> 16 TB</para> + <para> 16TB (ldiskfs), 256TB (ZFS)</para> </entry> <entry> <para>The amount of data that can be stored in a single object. An object - corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies. - Files can consist of up to 2000 stripes, each 16 TB in size. </para> + corresponds to a stripe. The ldiskfs limit of 16 TB for a single object applies. + For ZFS the limit is the size of the underlying OST. + Files can consist of up to 2000 stripes, each stripe can contain the maximum object size. </para> </entry> </row> <row> @@ -491,14 +501,13 @@ <entry> <para> 16 TB on 32-bit systems</para> <para> </para> - <para> 31.25 PB on 64-bit systems</para> + <para> 31.25 PB on 64-bit ldiskfs systems, 8EB on 64-bit ZFS systems</para> </entry> <entry> <para>Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. - Hence, files can be 64-bits in size. An additional size limit of up to the number - of stripes is imposed, where each stripe is 16 TB.</para> - <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para> + Hence, files can be 2^63 bits (8EB) in size if the backing filesystem can support large enough objects.</para> + <para>A single file can have a maximum of 2000 stripes, which gives an upper single file limit of 31.25 PB for 64-bit ldiskfs systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.</para> </entry> </row> <row> @@ -506,7 +515,7 @@ <para> Maximum number of files or subdirectories in a single directory</para> </entry> <entry> - <para> 10 million files</para> + <para> 10 million files (ldiskfs), 2^48 (ZFS)</para> </entry> <entry> <para>The Lustre software uses the ldiskfs hashed directory code, which has a limit @@ -521,14 +530,14 @@ <para> Maximum number of files in the file system</para> </entry> <entry> - <para> 4 billion</para> - <para condition='l24'>4096 * 4 billion</para> + <para> 4 billion (ldiskfs), 256 trillion (ZFS)</para> + <para condition='l24'>4096 times the per-MDT limit</para> </entry> <entry> <para>The ldiskfs file system imposes an upper limit of 4 billion inodes. By default, the MDS file system is formatted with 2KB of space per inode, meaning 1 billion inodes per file system of 2 TB.</para> <para>This can be increased initially, at the time of MDS file system creation. For more information, see <xref linkend="settinguplustresystem"/>.</para> - <para condition="l24">Each additional MDT can hold up to 4 billion additional files, depending - on available inodes and the distribution directories and files in the file + <para condition="l24">Each additional MDT can hold up to the above maximum number of additional files, depending + on available space and the distribution directories and files in the file system.</para> </entry> </row> @@ -540,7 +549,7 @@ <para> 255 bytes (filename)</para> </entry> <entry> - <para>This limit is 255 bytes for a single filename, the same as in an ldiskfs file system.</para> + <para>This limit is 255 bytes for a single filename, the same as the limit in the underlying file systems.</para> </entry> </row> <row> @@ -559,7 +568,7 @@ <para> Maximum number of open files for a Lustre file system</para> </entry> <entry> - <para> None</para> + <para> No limit</para> </entry> <entry> <para>The Lustre software does not impose a maximum for the number of open files, diff --git a/SystemConfigurationUtilities.xml b/SystemConfigurationUtilities.xml index fc39f27..d29681b 100644 --- a/SystemConfigurationUtilities.xml +++ b/SystemConfigurationUtilities.xml @@ -820,8 +820,11 @@ ll_recover_lost_found_objs
Description - The first time Lustre writes to an object, it saves the MDS inode number and the objid as an extended attribute on the object, so in case of directory corruption of the OST, it is possible to recover the objects. Running e2fsck fixes the corrupted OST directory, but it puts all of the objects into a lost and found directory, where they are inaccessible to Lustre. Use the ll_recover_lost_found_objs utility to recover all (or at least most) objects from a lost and found directory and return them to the O/0/d* directories. - To use ll_recover_lost_found_objs, mount the file system locally (using the -t ldiskfs command), run the utility and then unmount it again. The OST must not be mounted by Lustre when ll_recover_lost_found_objs is run. + The first time Lustre modifies an object, it saves the MDS inode number and the objid as an extended attribute on the object, so in case of directory corruption of the OST, it is possible to recover the objects. Running e2fsck fixes the corrupted OST directory, but it puts all of the objects into a lost and found directory, where they are inaccessible to Lustre. Use the ll_recover_lost_found_objs utility to recover all (or at least most) objects from a lost and found directory and return them to the O/0/d* directories. + To use ll_recover_lost_found_objs, mount the file system locally (using the -t ldiskfs, or -t zfs command), run the utility and then unmount it again. The OST must not be mounted by Lustre when ll_recover_lost_found_objs is run. + This utility is not needed for 2.6 and later, + since the LFSCK online scanning will move objects + from lost+found to the proper place in the OST.
Options @@ -911,7 +914,7 @@ Timestamp Read-delta ReadRate Write-delta WriteRate
<indexterm><primary>llog_reader</primary></indexterm> llog_reader - The llog_reader utility parses Lustre's on-disk configuration logs. + The llog_reader utility translates a Lustre configuration log into human-readable form.
Synopsis llog_reader filename @@ -919,7 +922,7 @@ llog_reader
Description The llog_reader utility parses the binary format of Lustre's on-disk configuration logs. Llog_reader can only read logs; use tunefs.lustre to write to them. - To examine a log file on a stopped Lustre server, mount its backing file system as ldiskfs, then use llog_reader to dump the log file's contents, for example: + To examine a log file on a stopped Lustre server, mount its backing file system as ldiskfs or zfs, then use llog_reader to dump the log file's contents, for example: mount -t ldiskfs /dev/sda /mnt/mgs llog_reader /mnt/mgs/CONFIGS/tfs-client To examine the same log file on a running Lustre server, use the ldiskfs-enabled debugfs utility (called debug.ldiskfs on some distributions) to extract the file, for example: @@ -1559,7 +1562,7 @@ mkfs.lustre --backfstype=fstype - Forces a particular format for the backing file system (such as ext3, ldiskfs). + Forces a particular format for the backing file system such as ldiskfs (the default) or zfs. @@ -2474,10 +2477,6 @@ Application Profiling Utilities The following utilities are located in /usr/bin. lustre_req_history.sh The lustre_req_history.sh utility (run from a client), assembles as much Lustre RPC request history as possible from the local node and from the servers that were contacted, providing a better picture of the coordinated network activity. - llstat.sh - The llstat.sh utility handles a wider range of statistics files, and has command line switches to produce more graphable output. - plot-llstat.sh - The plot-llstat.sh utility plots the output from llstat.sh using gnuplot.
More /proc Statistics for Application Profiling @@ -2626,59 +2625,6 @@ wait quit EOF - Feature Requests - The loadgen utility is intended to grow into a more comprehensive test tool; feature requests are encouraged. The current feature requests include: - - - Locking simulation - - - - - Many (echo) clients cache locks for the specified resource at the same time. - - - Many (echo) clients enqueue locks for the specified resource simultaneously. - - - - - obdsurvey functionality - - - - - Fold the Lustre I/O kit's obdsurvey script functionality into loadgen - - - - -
-
- <indexterm><primary>llog_reader</primary></indexterm> -llog_reader - The llog_reader utility translates a Lustre configuration log into human-readable form. -
-
- Synopsis - llog_reader filename - -
-
- Description - llog_reader parses the binary format of Lustre's on-disk configuration logs. It can only read the logs. Use tunefs.lustre to write to them. - To examine a log file on a stopped Lustre server, mount its backing file system as ldiskfs, then use llog_reader to dump the log file's contents. For example: - mount -t ldiskfs /dev/sda /mnt/mgs -llog_reader /mnt/mgs/CONFIGS/tfs-client - - To examine the same log file on a running Lustre server, use the ldiskfs-enabled debugfs utility (called debug.ldiskfs on some distributions) to extract the file. For example: - debugfs -c -R 'dump CONFIGS/tfs-client /tmp/tfs-client' /dev/sda -llog_reader /tmp/tfs-client - - - Although they are stored in the CONFIGS directory, mountdata files do not use the config log format and will confuse llog_reader. - - See Also
<indexterm><primary>lr_reader</primary></indexterm> diff --git a/TroubleShootingRecovery.xml b/TroubleShootingRecovery.xml index db703e8..b48ffd5 100644 --- a/TroubleShootingRecovery.xml +++ b/TroubleShootingRecovery.xml @@ -16,7 +16,7 @@ </listitem> </itemizedlist> <section xml:id="dbdoclet.50438225_71141"> - <title><indexterm><primary>recovery</primary><secondary>corruption of backing file system</secondary></indexterm>Recovering from Errors or Corruption on a Backing File System + <indexterm><primary>recovery</primary><secondary>corruption of backing ldiskfs file system</secondary></indexterm>Recovering from Errors or Corruption on a Backing ldiskfs File System When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the file system. ldiskfs journaling ensures that the file system remains consistent over a system crash. The backing file systems are never accessed directly diff --git a/UnderstandingLustre.xml b/UnderstandingLustre.xml index 0a2f0b0..28bca1e 100644 --- a/UnderstandingLustre.xml +++ b/UnderstandingLustre.xml @@ -172,19 +172,17 @@ - Single MDS: - 4 billion files + Single MDT: + 4 billion files (ldiskfs), 256 trillion files (ZFS) MDS count: 1 primary + 1 backup - Since Lustre software release 2.4: - - Up to 4096 MDSs and up to 4096 MDTs + Up to 4096 MDTs and up to 4096 MDSs - Single MDS: - 750 million files + Single MDT: + 1 billion files MDS count: 1 primary + 1 backup @@ -223,7 +221,7 @@ multi-TB max file size Aggregate: - 10 PB space, 750 million files + 55 PB space, 1 billion files @@ -234,11 +232,14 @@ Performance-enhanced ext4 file system: The Lustre file system uses an improved version of the ext4 journaling file system to store data - and metadata. This version, called ldiskfs, has been enhanced to improve performance and + and metadata. This version, called + ldiskfs, has been enhanced to improve performance and provide additional functionality needed by the Lustre file system. + With the Lustre software release 2.4 and later, it is also possible to use ZFS as the backing filesystem for Lustre for the MDT, OST, and MGS storage. This allows Lustre to leverage the scalability and data integrity features of ZFS for individual storage targets. + + POSIX standard compliance: The full POSIX test suite passes in an identical manner to a local ext4 file system, with limited exceptions on Lustre clients. In a cluster, most operations are atomic so that clients never see @@ -257,16 +258,20 @@ High-availability: The Lustre file system supports active/active failover using shared storage partitions for OSS targets (OSTs). Lustre software release 2.3 and earlier releases offer active/passive failover using a shared - storage partition for the MDS target (MDT). - With Lustre software release 2.4 or later servers and clients it is - possible to configure active/active failover of multiple MDTs. This allows application - transparent recovery. The Lustre file system can work with a variety of high - availability (HA) managers to allow automated failover and has no single point of - failure (NSPF). Multiple mount protection (MMP) provides integrated protection from + storage partition for the MDS target (MDT). The Lustre file system can work with a variety of high + availability (HA) managers to allow automated failover and has no single point of failure (NSPF). + This allows application transparent recovery. Multiple mount protection (MMP) provides integrated protection from errors in highly-available systems that would otherwise cause file system corruption. + With Lustre software release 2.4 or later + servers and clients it is possible to configure active/active + failover of multiple MDTs. This allows scaling the metadata + performance of Lustre filesystems with the addition of MDT storage + devices and MDS nodes. + + Security: By default TCP connections are only allowed from privileged ports. UNIX group membership is verified on the MDS. @@ -690,10 +695,10 @@ The maximum file size is not limited by the size of a single target. In a Lustre file - system, files can be striped across multiple objects (up to 2000), and each object can be - up to 16 TB in size with ldiskfs. This leads to a maximum file size of 31.25 PB. (Note that - a Lustre file system can support files up to 2^64 bytes depending on the backing storage - used by OSTs.) + system, files can be striped across multiple objects (up to 2000), and each object can be + up to 16 TB in size with ldiskfs, or up to 256PB with ZFS. This leads to a maximum file size of 31.25 PB for ldiskfs or 8EB with ZFS. Note that + a Lustre file system can support files up to 2^63 bytes (8EB), limited + only by the space available on the OSTs. Versions of the Lustre software prior to Release 2.2 limited the maximum stripe count for a single file to 160 OSTs. -- 1.8.3.1