From 75ebff14a4d8714411bf755755b0823a2987dc44 Mon Sep 17 00:00:00 2001 From: Andreas Dilger Date: Thu, 9 Jul 2015 05:08:02 -0600 Subject: [PATCH] LUDOC-296 protocol: remove internal details from descriptions [2015-07-09] Tidy up some of the details in the removal of the implementation details. [2015-07-08] Remove the internal implementation details from descriptions, in particular for obd_import and obd_export. Some fields in those structures will be useful in the future to describe the connection state machine, but are commented out for now. Remove the lustre_file_ids.txt file, since it largely duplicated the file-id.txt file. Change-Id: I79f499fa16af80b237706535b185c39ea40e7b3f Signed-off-by: Andreas Dilger Signed-off-by: Andrew C. Uselton Reviewed-on: http://review.whamcloud.com/15543 Tested-by: Jenkins --- Makefile | 1 - connection.txt | 20 ++-- data_types.txt | 165 ++++++++++++++++------------- early_lock_cancellation.txt | 26 +++-- export.txt | 177 ++++++++++++++++++-------------- extract_setattr.txt | 2 +- file_id.txt | 123 ++++++++++++++++++++-- file_system_operations.txt | 4 +- getattr.txt | 4 +- glossary.txt | 28 ++--- import.txt | 186 +++++++++++++++++++-------------- layout_intent.txt | 4 +- ldlm.txt | 3 +- ldlm_bl_callback.txt | 4 +- ldlm_cancel.txt | 4 +- ldlm_cp_callback.txt | 6 +- ldlm_enqueue.txt | 23 ++--- ldlm_gl_callback.txt | 6 +- ldlm_lock_desc.txt | 26 +++-- ldlm_request.txt | 1 - ldlm_resource_desc.txt | 8 +- ldlm_resource_id.txt | 2 +- llog.txt | 66 +++++++++++- lov_mds_md.txt | 87 ++++++++++++---- lustre_file_ids.txt | 26 ----- lustre_handle.txt | 14 ++- lustre_rpcs.txt | 22 ++-- mds_connect.txt | 62 ++++++++--- mds_disconnect.txt | 2 +- mds_getstatus.txt | 18 ++-- mds_getxattr.txt | 3 +- mds_reint.txt | 8 +- mds_statfs.txt | 10 +- mdt_body.txt | 13 ++- mdt_rec_reint.txt | 11 +- mgs_disconnect.txt | 7 +- mount.txt | 30 +++--- obd_connect_data.txt | 245 ++++++++++++++++++++++++-------------------- obd_statfs.txt | 21 ++-- ost_connect.txt | 4 +- ost_disconnect.txt | 5 +- ost_lvb.txt | 2 +- ost_setattr_structs.txt | 19 ---- ost_statfs.txt | 7 +- ptlrpc_body.txt | 17 +-- security.txt | 3 +- 46 files changed, 929 insertions(+), 596 deletions(-) delete mode 100644 lustre_file_ids.txt diff --git a/Makefile b/Makefile index 991c260..07be478 100644 --- a/Makefile +++ b/Makefile @@ -113,7 +113,6 @@ TEXT = protocol.txt \ llog_origin_handle_next_block.txt \ llog_origin_handle_read_header.txt \ data_types.txt \ - lustre_file_ids.txt \ lustre_handle.txt \ ptlrpc_body.txt \ mdt_structs.txt \ diff --git a/connection.txt b/connection.txt index cc0b739..fe6dd65 100644 --- a/connection.txt +++ b/connection.txt @@ -8,8 +8,8 @@ example of two such entities are a client and a target on some server. The target is identified by name to the client through an interaction with the management server. The client then 'connects' to the given target on the indicated server by sending the appropriate -version of the *_CONNECT message (MGS_CONNECT, MDS_CONNECT, or -OST_CONNECT - colectively *_CONNECT) and receiving back the +version of the various *_CONNECT message (MGS_CONNECT, MDS_CONNECT, or +OST_CONNECT) and receiving back the corresponding *_CONNECT reply. The server creates an 'export' for the connection between the target and the client, and the export holds the server state information for that connection. When the client gets the @@ -22,7 +22,8 @@ There are also connections between the servers: Each MDS and OSS has a connection to the MGS, where the MDS (respectively the OSS) plays the role of the client in the above discussion. That is, the MDS initiates the connection and has an import for the MGS, while the MGS has an -export for each MDS. Each MDS connects to each OST, with an import on +export for each MDS. Each MDS creates an 'object storage proxy' (OSP) +connection to each OST, with an import on the MDS and an export on the OSS. This connection supports requests from the MDS to the OST to create and destroy data objects, to set attributes (such as permission bits), and for 'statfs' information for @@ -30,10 +31,10 @@ precreation needs. Each OSS also connects to the first MDS to get access to auxiliary services, with an import on the OSS and an export on the first MDS. The auxiliary services are: the File ID Location Database (FLDB), the quota master service, and the sequence -controller. This connection for auxiliary services is a 'lightweight' -one in that it has no replay functionality and consumes no space on -the MDS for client data. Each MDS connects also to all other MDSs for -DNE needs. +controller. This connection for auxiliary services is a 'light weight +proxy' (LWP) connection in that it has no replay functionality and +consumes no space on the MDS for client data. Each MDS connects also +to all other MDSs for DNE needs. Finally, for some communications the roles of message initiation and message reply are reversed. This is the case, for instance, with @@ -43,11 +44,6 @@ other end of the connection maintains a 'reverse-import'. The reverse-import uses the same structure as a regular import, and the reverse-export uses the same structure as a regular export. -################################################################# -Fixme: Move the obd_connect_data.txt include to where it first -gets introduced. Perhaps in obd_connect_client.txt? -################################################################# - include::obd_connect_data.txt[] include::import.txt[] diff --git a/data_types.txt b/data_types.txt index c959e52..b0a0f05 100644 --- a/data_types.txt +++ b/data_types.txt @@ -29,17 +29,20 @@ A grant value is part of a client's state for a given target. It provides an upper bound to the amount of dirty cache data the client will allow that is destined for the target. The value is established by agreement between the server and the client and represents a -guarantee by the server that the target storage has space for the -dirty data. The client can ask for additional grant, which the server -may provide depending on how full the target is. +guarantee by the server that the target storage has enough free space +for at least the amount of granted dirty data. The client can ask for +additional grant with each write RPC, which the server may provide +depending on how much available (ungranted and unallocated) space the +target has. LOV Index ^^^^^^^^^ [[lov-index]] -Each target is assigned an LOV index (by the 'mkfs' command line) as -the target is added to the file system. This value is stored in the -MGS in order to identify its role in the file system. +Each target is assigned an LOV index (by the 'mkfs.lustre' command line) +as the target is added to the file system. This value is stored by the +target locally as well as on the MGS in order to serve as a unique +identifier in the file system. Transaction Number ^^^^^^^^^^^^^^^^^^ @@ -48,9 +51,9 @@ Transaction Number For each target there is a sequence of values (a strictly increasing series of numbers) where each operation that can modify the file system is assigned the next number in the series. This is the -transaction number, and it imposes a strict serial ordering to all of -the file system modifying operations. For file system modifying -requests the server generates the next value in the sequence and +transaction number, and it imposes a strict serial ordering for all +file system modifying operations. For file system modifying +requests the server assigns the next value in the sequence and informs the client of the value in the 'pb_transno' field of the 'ptlrpc_body' of its reply to the client's request. For replys to requests that do not modify the file system the 'pb_transno' field in @@ -63,33 +66,10 @@ I have not figured out how so called 'eadata' buffers are handled, yet. I am told that this is not just for extended attributes, but is a generic structure. -Lustre Capabilities -^^^^^^^^^^^^^^^^^^^ - -A 'lustre_capa' structure conveys details about the capabilities -supported (or requested) between a client and a given target. I am -told that this is deprecated in later version of Lustre. - ----- -#define CAPA_HMAC_MAX_LEN 64 -struct lustre_capa { - struct lu_fid lc_fid; - __u64 lc_opc; - __u64 lc_uid; - __u64 lc_gid; - __u32 lc_flags; - __u32 lc_keyid; - __u32 lc_timeout; - __u32 lc_expiry; - __u8 lc_hmac[CAPA_HMAC_MAX_LEN]; -} ----- +Also, see <>. -include::lustre_file_ids.txt[] - - -MGS Configuration Reference -^^^^^^^^^^^^^^^^^^^^^^^^^^^ +MGS Configuration Data +^^^^^^^^^^^^^^^^^^^^^^ ---- #define MTI_NAME_MAXLEN 64 @@ -104,10 +84,13 @@ struct mgs_config_body { ---- The 'mgs_config_body' structure has information identifying to the MGS -which Lustre file system the client is asking about. - -MGS Configuration Data -^^^^^^^^^^^^^^^^^^^^^^ +which Lustre file system the client is requesting configuration information +from. 'mcb_name' contains the filesystem name (fsname). 'mcb_offset' +contains the next record number in the configuration llog to process +(see <> for details), not the byte offset or bulk transfer units. +'mcb_bits' is the log2 of the units of minimum bulk transfer size, +typically 4096 or 8192 bytes, while 'mcb_units' is the maximum number of +2^mcb_bits sized units that can be transferred in a single request. ---- struct mgs_config_res { @@ -116,18 +99,23 @@ struct mgs_config_res { }; ---- -The 'mgs_config_res' structure returns information about the Lustre -file system. +The 'mgs_config_res' structure returns information describing the +replied configuration llog data requested in 'mgs_config_body'. +'mcr_offset' contains the last configuration record number returned +by this reply. 'mcr_size' contains the maximum record index in the +entire configuration llog. When 'mcr_offset' equals 'mcr_size' there +are no more records to process in the log. include::lustre_handle.txt[] Lustre Message Header ^^^^^^^^^^^^^^^^^^^^^ -[[lustre-message-header]] +[[struct-lustre-msg]] Every message has an initial header that informs the receiver about -the size of the rest of the message to follow along with a few other -details. +the number of buffers and their size for the rest of the message to +follow, along with other important information about the request or +reply message. ---- #define LUSTRE_MSG_MAGIC_V2 0x0BD00BD3 @@ -146,53 +134,67 @@ struct lustre_msg_v2 { #define lustre_msg lustre_msg_v2 ---- -The 'lm_buffcount' field gives the number of buffers that will follow +The 'lm_bufcount' field holds the number of buffers that will follow the header. The header and sequence of buffers constitutes one message. Each of the buffers is a sequence of bytes whose contents -corresponds to one of the structures described in this section. There -will always be at least one, and no message has more than eight. +corresponds to one of the structures described in this section. Each +message will always have at least one buffer, and no message can have +more than thirty-one buffers. The 'lm_secflvr' field gives an indication of whether any sort of cyptographic encoding of the subsequent buffers will be in force. The value is zero if there is no "crypto" and gives a code identifying the "flavor" of crypto if it is employed. Further, if crypto is employed -there will only be one buffer following (i.e. buffcount = 1), and that -buffer is an encoding of what would otherwise have been the sequence -of buffers normally following the header. This document will defer all -discussion of cryptography. An chapter is planned that will address it -separately. +there will only be one buffer following (i.e. 'lm_bufcount' = 1), and +that buffer holds an encoding of what would otherwise have been the +sequence of buffers normally following the header. Cryptography will +be discussed in a separate chapter. -The 'lm_magic' field is a "magic" value (LUSTRE_MSG_MAGIC_V2) that is +The 'lm_magic' field is a "magic" value (LUSTRE_MSG_MAGIC_V2 = 0x0BD00BD3, +'OBD' for 'object based device') that is checked in order to positively identify that the message is intended for the use to which it is being put. That is, we are indeed dealing with a Lustre message, and not, for example, corrupted memory or a bad pointer. -The 'lm_repsize' field is an indication from the sender of an action -request of the maximum available space that has been set aside for -any reply to the request. A reply that attempts to use more than that -much space will be discarded. - -The 'lm_cksum' has to do with the <> settings for the -cluster. Fixme: This may not be in current use. We need to verify. - -The 'lm_flags' field can be set to enable adaptive timeouts support -with the value MSGHDR_AT_SUPPORT. +The 'lm_repsize' field in a request indicates the maximum available +space that has been reserved for any reply to the request. A reply +that attempts to use more than the reserved space will be discarded. + +The 'lm_cksum' field contains a checksum of the 'ptlrpc_body' buffer +to allow the receiver to verify that the message is intact. This is +used to verify that an 'early reply' has not been overwritten by the +actual reply message. If the 'MSGHDR_CKSUM_INCOMPAT18' flag is set +in requests since Lustre 1.8 +(the server will send early reply messages with the appropriate 'lm_cksum' +if it understands the flag +and is mandatory in Lustre 2.8 and later. + +The 'lm_flags' field contains flags that affect the low-level RPC +protocol. The 'MSGHDR_AT_SUPPORT' (0x1) bit indicates that the sender +understands adaptive timeouts and can receive 'early reply' messages +to extend its waiting period rather than timing out. This flag was +introduced in Lustre 1.6. The 'MSGHDR_CKSUM_INCOMPAT18' (0x2) bit +indicates that 'lm_cksum' is computed on the full 'ptlrpc_body' +message buffer rather than on the original 'ptlrpc_body_v2' structure +size (88 bytes). It was introduced in Lustre 1.8 and is mandatory +for all requests in Lustre 2.8 and later. The 'lm_padding*' fields are reserved for future use. -The array of 'lm_bufflens' values has 'lm_bufcount' entries. Each -entry corresponds to, and gives the length of, one of the buffers that -will follow. - -The entire header is required to be a multiple of eight bytes -long. Thus there may need to an extra four bytes of padding after the -'lm_bufflens' array if that array has an odd number of entries. +The array of 'lm_buflens' values has 'lm_bufcount' entries. Each +entry corresponds to, and gives the length in bytes of, one of the +buffers that will follow. The entire header, and each of the buffers, +is required to be a multiple of eight bytes long to ensure the buffers +are properly aligned to hold __u64 values. Thus there may be an extra +four bytes of padding after the 'lm_buflens' array if that array has +an odd number of entries. include::ptlrpc_body.txt[] Object Based Disk UUID ^^^^^^^^^^^^^^^^^^^^^^ +[[struct-obd-uuid]] ---- #define UUID_MAX 40 @@ -201,8 +203,21 @@ struct obd_uuid { }; ---- +The 'ost_uuid' contains an ASCII-formatted string that identifies +the entity uniquely within the filesystem. Clients use an RFC-4122 +hexadecimal UUID of the form ''de305d54-75b4-431b-adb2-eb6b9e546014'' +that is randomly generated. Servers may use a string-based identifier +of the form ''fsname-TGTindx_UUID''. + +File IDentifier (FID) +^^^^^^^^^^^^^^^^^^^^^ + +See <>. + OST ID ^^^^^^ +[[struct-ost-id]] +The 'ost_id' identifies a single object on a particular OST. ---- struct ost_id { @@ -212,6 +227,14 @@ struct ost_id { __u64 oi_seq; } oi; struct lu_fid oi_fid; - } LUSTRE_ANONYMOUS_UNION_NAME; + }; }; ---- + +The 'ost_id' structure contains an identifier for a single OST object. +The 'oi' structure holds the OST object identifier as used with Lustre +1.8 and earlier, where the 'oi_seq' field is typically zero, and the +'oi_id' field is an integer identifying an object on a particular +OST (which is identified separately). Since Lustre 2.5 it is possible +for OST objects to also be identified with a unique FID that identifies +both the OST on which it resides as well as the object identifier itself. diff --git a/early_lock_cancellation.txt b/early_lock_cancellation.txt index ee4dbea..cbe4939 100644 --- a/early_lock_cancellation.txt +++ b/early_lock_cancellation.txt @@ -2,17 +2,29 @@ Early Lock Cancellation ^^^^^^^^^^^^^^^^^^^^^^^ [[early-lock-cancellation]] -In some circumstances (see <>) a client holding +In some circumstances (see for example <>) a +client holding a lock "knows" that the operation it is initiating will necessitate canceling a lock that it holds. In such a circumstance the lock is said to be "in conflict" with the operation. For example, a side -effect of the GNU 'fileutils' is that it always issues a 'stat' prior -to other update operations, so when issuing an update to a resource -attribute on the MDT a) the client will already have a read lock on -the resource, and b) the MDT will then need a lock on that same +effect of GNU 'fileutils' is that it often issues a 'stat()' on a file +prior to other update operations such as 'unlink()', so when issuing +an update to a resource attribute on the MDT +a) the client will already have a read lock on the resource, and +b) the MDT will then need a lock on that same resource in order to perform the update. Rather than waiting for an additional round of RPCs to carry out the lock cancellation (on the client) the RPC requesting the update can include a cancellation of -the lock as part of its request. This proactive lock cancellation is -called "early lock cancellation" or sometimes ELC. +one or more locks as part of its request. This proactive lock cancellation +is called "early lock cancellation" or sometimes ELC. +ELC can also proactively cancel unused cached locks on the client +from the lock LRU list to avoid the need for later lock callbacks +(if the lock is needed on another client) or to avoid extra lock +cancellation RPCs if the number of cached locks on the client becomes +too large. + +ELC lock cancellation is only possible for locks that reside on the +target to which the RPC is being sent. It is not possible to cancel +locks via ELC on a different target, even if that target is serviced +by the same server. diff --git a/export.txt b/export.txt index 6393388..141d0d9 100644 --- a/export.txt +++ b/export.txt @@ -1,7 +1,6 @@ - Export ^^^^^^ - +[[obd-export]] An 'obd_export' structure for a given target is created on a server for each client that connects to that target. The exports for all the clients for a given target are managed together. The export represents @@ -9,84 +8,94 @@ the connection state between the client and target as well as the current state of any ongoing activity. Thus each pending request will have a reference to the export. The export is discarded if the connection goes away, but only after all the references to it have -been cleaned up. The state information for each export is also -maintained on disk. In the event of a server failure, that or another -server can read the export date from disk to enable recovery. +been cleaned up. Some state information for each export is also +maintained on disk. In the event of a server failure and recovery, +the server can read the export data from disk to allow the client +to reconnect and participate in recovery, otherwise a client without +any export data will not be allowed to participate in recovery. ---- struct obd_export { - struct portals_handle exp_handle; - atomic_t exp_refcount; - atomic_t exp_rpc_count; - atomic_t exp_cb_count; - atomic_t exp_replay_count; - atomic_t exp_locks_count; -#if LUSTRE_TRACKS_LOCK_EXP_REFS - cfs_list_t exp_locks_list; - spinlock_t exp_locks_list_guard; -#endif - struct obd_uuid exp_client_uuid; - cfs_list_t exp_obd_chain; - cfs_hlist_node_t exp_uuid_hash; - cfs_hlist_node_t exp_nid_hash; - cfs_list_t exp_obd_chain_timed; - struct obd_device *exp_obd; - struct obd_import *exp_imp_reverse; - struct nid_stat *exp_nid_stats; - struct ptlrpc_connection *exp_connection; - __u32 exp_conn_cnt; - cfs_hash_t *exp_lock_hash; - cfs_hash_t *exp_flock_hash; - cfs_list_t exp_outstanding_replies; - cfs_list_t exp_uncommitted_replies; - spinlock_t exp_uncommitted_replies_lock; - __u64 exp_last_committed; - cfs_time_t exp_last_request_time; - cfs_list_t exp_req_replay_queue; - spinlock_t exp_lock; - struct obd_connect_data exp_connect_data; - enum obd_option exp_flags; - unsigned long - exp_failed:1, - exp_in_recovery:1, - exp_disconnected:1, - exp_connecting:1, - exp_delayed:1, - exp_vbr_failed:1, - exp_req_replay_needed:1, - exp_lock_replay_needed:1, - exp_need_sync:1, - exp_flvr_changed:1, - exp_flvr_adapt:1, - exp_libclient:1, - exp_need_mne_swab:1; - enum lustre_sec_part exp_sp_peer; - struct sptlrpc_flavor exp_flvr; - struct sptlrpc_flavor exp_flvr_old[2]; - cfs_time_t exp_flvr_expire[2]; - spinlock_t exp_rpc_lock; - cfs_list_t exp_hp_rpcs; - cfs_list_t exp_reg_rpcs; - cfs_list_t exp_bl_list; - spinlock_t exp_bl_list_lock; - union { - struct tg_export_data eu_target_data; - struct mdt_export_data eu_mdt_data; - struct filter_export_data eu_filter_data; - struct ec_export_data eu_ec_data; - struct mgs_export_data eu_mgs_data; - } u; - struct nodemap *exp_nodemap; + struct portals_handle exp_handle; + struct obd_uuid exp_client_uuid; + struct obd_connect_data exp_connect_data; }; ---- -The 'exp_handle' is a little extra information as compared with a -'struct lustre_handle', which is just the cookie. The cookie that the -server generates to uniquely identify this connection gets put into -this structure along with their information about the device in -question. This is the cookie the *_CONNECT reply sends back to the -client and is then stored int he client's import. +////////////////////////////////////////////////////////////////////// +////vvvv +This is the full obd_export. + +struct obd_export { + struct portals_handle exp_handle; + atomic_t exp_refcount; + atomic_t exp_rpc_count; + atomic_t exp_cb_count; + atomic_t exp_replay_count; + atomic_t exp_locks_count; + struct obd_uuid exp_client_uuid; + cfs_list_t exp_obd_chain; + cfs_hlist_node_t exp_uuid_hash; + cfs_hlist_node_t exp_nid_hash; + cfs_list_t exp_obd_chain_timed; + struct obd_device *exp_obd; + struct obd_import *exp_imp_reverse; + struct nid_stat *exp_nid_stats; + struct ptlrpc_connection *exp_connection; + __u32 exp_conn_cnt; + cfs_hash_t *exp_lock_hash; + cfs_hash_t *exp_flock_hash; + cfs_list_t exp_outstanding_replies; + cfs_list_t exp_uncommitted_replies; + spinlock_t exp_uncommitted_replies_lock; + __u64 exp_last_committed; + cfs_time_t exp_last_request_time; + cfs_list_t exp_req_replay_queue; + spinlock_t exp_lock; + struct obd_connect_data exp_connect_data; + enum obd_option exp_flags; + unsigned long + exp_failed:1, + exp_in_recovery:1, + exp_disconnected:1, + exp_connecting:1, + exp_delayed:1, + exp_vbr_failed:1, + exp_req_replay_needed:1, + exp_lock_replay_needed:1, + exp_need_sync:1, + exp_flvr_changed:1, + exp_flvr_adapt:1, + exp_libclient:1, + exp_need_mne_swab:1; + enum lustre_sec_part exp_sp_peer; + struct sptlrpc_flavor exp_flvr; + struct sptlrpc_flavor exp_flvr_old[2]; + cfs_time_t exp_flvr_expire[2]; + spinlock_t exp_rpc_lock; + cfs_list_t exp_hp_rpcs; + cfs_list_t exp_reg_rpcs; + cfs_list_t exp_bl_list; + spinlock_t exp_bl_list_lock; + union { + struct tg_export_data eu_target_data; + struct mdt_export_data eu_mdt_data; + struct filter_export_data eu_filter_data; + struct ec_export_data eu_ec_data; + struct mgs_export_data eu_mgs_data; + } u; + struct nodemap *exp_nodemap; +}; +////^^^^ +////////////////////////////////////////////////////////////////////// + +The 'exp_handle' holds the cookie that the server generates at +*_CONNECT time to uniquely identify this connection from the client. +This cookie is also sent back to the client in the reply and is +then stored in the client's import. +////////////////////////////////////////////////////////////////////// +////vvvv The 'exp_refcount' gets incremented whenever some aspect of the export is "in use". The arrival of an otherwise unprocessed message for this target will increment the refcount. A reference by an LDLM lock that @@ -99,10 +108,16 @@ that export. The reference counter is also used for debug purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard' are further debug info that list the actual locks accounted for in 'exp_locks_count'. +////^^^^ +////////////////////////////////////////////////////////////////////// -The 'exp_client_uuid' gives the UUID of the client connected to this -export. Fixme: when and how does the UUID get generated? +The 'exp_client_uuid' holds the UUID of the client connected to this +export. This UUID is randomly generated by the client and the same +UUID is used by the client for connecting to all servers, so that +the servers may identify the client amongst themselves if necessary. +////////////////////////////////////////////////////////////////////// +////vvvv The server maintains all the exports for a given target on a circular list. Each export's place on that list is maintained in the 'exp_obd_chain'. A common activity is to look up the export based on @@ -132,7 +147,6 @@ The 'exp_connection' points to the connection information for this export. This is the information about the actual networking pathway(s) that get used for communication. - The 'exp_conn_cnt' notes the connection count value from the client at the time of the connection. In the event that more than one connection request is issued before the connection is established then the @@ -176,12 +190,18 @@ list 'exp_req_replay_queue'. The 'exp_lock' spin-lock is used for access control to the exports flags, as well as the 'exp_outstanding_replies' list and the revers import, if any. +////^^^^ +////////////////////////////////////////////////////////////////////// The 'exp_connect_data' refers to an 'obd_connect_data' structure for the connection established between this target and the client this -export refers to. See also the corresponding entry in the import and +export refers to. The 'exp_connect_data' describes the mutually +supported features that were negotiated between the client and server +at connect time. See also the corresponding entry in the import and in the connect messages passed between the hosts. +////////////////////////////////////////////////////////////////////// +////vvvv The 'exp_flags' field encodes three directives as follows: ---- enum obd_option { @@ -297,4 +317,5 @@ an MGS. The 'exp_bl_lock_at' field supports adaptive timeouts which will be discussed separately. (fixme: so discuss it somewhere.) - +////^^^^ +////////////////////////////////////////////////////////////////////// diff --git a/extract_setattr.txt b/extract_setattr.txt index 46adb2c..46db630 100644 --- a/extract_setattr.txt +++ b/extract_setattr.txt @@ -56,7 +56,7 @@ the protocol document. Miscellaneous Structured Data Types ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -include::lustre_file_ids.txt[] +<> include::lustre_handle.txt[] diff --git a/file_id.txt b/file_id.txt index 80726cb..aa26478 100644 --- a/file_id.txt +++ b/file_id.txt @@ -1,21 +1,124 @@ Lustre File Identifiers ~~~~~~~~~~~~~~~~~~~~~~~ -[[fids]] +[[struct-lu-fid]] ----- +Each resource stored on a target is assigned an identifier that is +unique to that resource. + +[source,c] +-------- struct lu_fid { __u64 f_seq; __u32 f_oid; __u32 f_ver; }; +-------- + +A file identifier ('FID') is a 128-bit numbers that uniquely identifies a +single file or directory on an MDTs or OSTs within a single Lustre file +system. The FID for a Lustre file or directory is the FID from the +corresponding MDT entry for the file. Each of the data resource for that +file will also have a FID for each corresponding OST resource on which the +file stores data. + +The 'f_seq' field holds the sequence number, or SEQ, and is used in +conjunction with the 'FID location database' (FLDB) to +determine on which target the resource is located. All resources with the +same 'f_seq' value will be on the same target. A target can have more +than one 'f_seq' value assigned to it. + +The 'f_oid' field holds the unique 'object identifier' (OID) within the +sequence to identify a particular object. + +The 'f_ver' value identifies which version of a resource is being +identified, in the event that the resource is being updated, and +different hosts might be referring to different versions of the same +resource. It has never been used as of Lustre 2.8. + +---- +enum fid_seq { + FID_SEQ_OST_MDT0 = 0, + FID_SEQ_LLOG = 1, /* unnamed llogs */ + FID_SEQ_ECHO = 2, + FID_SEQ_UNUSED_START = 3, + FID_SEQ_UNUSED_END = 9, + FID_SEQ_LLOG_NAME = 10, /* named llogs */ + FID_SEQ_RSVD = 11, + FID_SEQ_IGIF = 12, + FID_SEQ_IGIF_MAX = 0x0ffffffffULL, + FID_SEQ_IDIF = 0x100000000ULL, + FID_SEQ_IDIF_MAX = 0x1ffffffffULL, + /* Normal FID sequence starts from this value, i.e. 1<<33 */ + FID_SEQ_START = 0x200000000ULL, + /* sequence for local pre-defined FIDs listed in local_oid */ + FID_SEQ_LOCAL_FILE = 0x200000001ULL, + FID_SEQ_DOT_LUSTRE = 0x200000002ULL, + /* sequence is used for local named objects FIDs generated + * by local_object_storage library */ + FID_SEQ_LOCAL_NAME = 0x200000003ULL, + /* Because current FLD will only cache the fid sequence, instead + * of oid on the client side, if the FID needs to be exposed to + * clients sides, it needs to make sure all of fids under one + * sequence will be located in one MDT. */ + FID_SEQ_SPECIAL = 0x200000004ULL, + FID_SEQ_QUOTA = 0x200000005ULL, + FID_SEQ_QUOTA_GLB = 0x200000006ULL, + FID_SEQ_ROOT = 0x200000007ULL, /* Located on MDT0 */ + FID_SEQ_LAYOUT_RBTREE = 0x200000008ULL, + /* sequence is used for update logs of cross-MDT operation */ + FID_SEQ_UPDATE_LOG = 0x200000009ULL, + /* Sequence is used for the directory under which update logs + * are created. */ + FID_SEQ_UPDATE_LOG_DIR = 0x20000000aULL, + FID_SEQ_NORMAL = 0x200000400ULL, + FID_SEQ_LOV_DEFAULT = 0xffffffffffffffffULL +}; ---- -File IDs ('fids') are 128-bit numbers that uniquely identify files and -directories on the MDTs and OSTs of a Lustre file system. The fid for -a Lustre file or directory is the fid from the corresponding MDT entry -for the file. Each of the data objects for that file will also have a -fid for each corresponding piece of the file on each of the -OSTs. Encoded in the fid is the target on which that file metadata or -file fragment resides. The map from fid to target is in the File -Location DataBase (FLDB). +There are several reserved ranges of FID sequence values, to allow for +interoperability with older Lustre filesystems, to identify "well known" +objects for internal or external use, as well as for future expansion. + +The 'FID_SEQ_OST_MDT0' (0x0) range is reserved for OST objects created +by MDT0 in non-DNE filesystems. Since all such OST objects used an +'f_seq' value of zero these FIDs are not unique across the filesystem, +but the reservation of 'FID_SEQ_OST_MDT0' allows these FIDs to co-exist +with other FIDs in the same 128-bit identifier space. + +The 'FID_SEQ_LLOG' (0x1) range is reserved for unnamed Lustre log (llog) +files, used only internally on the MDS since Lustre 2.4, but previously +exposed over the network. + +The 'FID_SEQ_ECHO' (0x2) range is used for temporary objects for +testing purposes such as obdfiler-survey. + +The 'FID_SEQ_LLOG_NAME' (0x10) range is used for named llog files such +as configuration logs and the ChangeLog. + +The 'FID_SEQ_IGIF' (0xb-0xffffffff) range is reserved for 'inode and +generation in FID' (IGIF) inodes allocated by MDSes before Lustre 2.0. +This corresponds to the 4 billion maximum inode number that could be +allocated for such filesystems. The 'f_oid' field for IGIF FIDs +contains the inode version number, and as such there is normally only +a single object for each 'f_seq' value. + +The 'FID_SEQ_IDIF' (0x100000000-0x1fffffffff) range is reserved for +mapping OST objects that were created by MDT0 using 'FID_SEQ_OST_MDT0' +to filesystem-unique FIDs. The second 16-bit field (bits 16-31) of the +'f_seq' field contains the OST index (0-65535). The low 16-bit field +(bits 0-15) of 'f_seq' contains the high (bits 32-47) bits of the OST +object ID, and the 32-bit 'f_oid' field contains the low 32 bits of +the OST object ID. + +The 'FID_SEQ_LOCAL_FILE' (0x200000001) range is reserved for "well +known" objects internal to the server and is not exposed to the network. + +The 'FID_SEQ_DOT_LUSTRE' (0x200000002) range is reserved for files +under the hidden ".lustre" directory in the root of the filesystem. + +The 'FID_SEQ_LOCAL_NAME' (0x200000003) range is reserved for objects +internal to the server that are allocated by name. +The 'FID_SEQ_NORMAL' (0x200000400+) range is used for normal object +identifiers. These objects are visible in the namespace if allocated +by an MDT, or may be OST objects. diff --git a/file_system_operations.txt b/file_system_operations.txt index 722b192..f4a62a9 100644 --- a/file_system_operations.txt +++ b/file_system_operations.txt @@ -4,8 +4,8 @@ File System Operations [[file-system-operations]] Lustre is a POSIX compliant file system that provides namespace and -data storage services to clients. It implements all the usual file -system functionality including creating, writing, reading, and +data storage services to clients. It implements the normal file system +functionality including creating, writing, reading, and removing files and directories. These file system operations are implemented via <>, which carry out communication and coordination with the servers. In this section diff --git a/getattr.txt b/getattr.txt index b44234c..73912bf 100644 --- a/getattr.txt +++ b/getattr.txt @@ -1,6 +1,6 @@ Getattr ~~~~~~~ - +[[getattr]] The 'getattr' VFS method is used to examine the attributes associated with a resource (it is an inode operation). The attributes are the same ones returned by a 'stat' operation: mode, uid, guid, size, @@ -90,7 +90,7 @@ is these flags: | OBD_MD_FLDIREA | dir extended attributes | OBD_MD_FLMODEASIZE | mode size | OBD_MD_MEA | -| OBD_MD_FLACL | ACL +| OBD_MD_FLACL | access control list (ACL) | OBD_MD_FLMDSCAPA | |==== diff --git a/glossary.txt b/glossary.txt index 35cb51d..92194fb 100644 --- a/glossary.txt +++ b/glossary.txt @@ -22,9 +22,9 @@ consistent (cache-coherent) view of the data objects in the file system. early reply:: -A message returned in response to a request message that allows the -target to extend timeout values without otherwise indicating the -request has progressed. +A partial message reply returned in response to a request message +that allows the target to indicate to the requestor to increase its +timeout value without otherwise indicating the request has progressed. export:: The term for the per-client shared state information held by a server. @@ -36,14 +36,14 @@ The term for the per-target shared state information held by a client. Inodes and Extended Attributes:: Metadata about a Lustre file is encoded in the extended attributes of -an inode in the back-end file system on the MDS. Some of that +an inode in the back-end file system on the MDT. Some of that information establishes the stripe layout of the file. The size of the -stripe layout information varies among Lustre file systems. The amount +stripe layout information may be different for each file. The amount of space reserved for layout information is a small as possible given -the maximum stripe count possible on the file system. Clients, servers, +the maximum stripe count on the file system. Clients, servers, and the distributed lock manager will all need to be aware of this -size, which is communicated in the 'ocd_ibis_known' filed of the -'obd_connect_data' structure. +size, which is communicated in the 'ocd_max_easize' filed of the +<> structure. LNet:: A lower level protocol employed by PtlRPC to abstract the mechanisms @@ -68,13 +68,15 @@ Metadata Server:: A metadata server (MDS) is a computer responsible for running the Lustre kernel services in support of managing the POSIX-compliant name space and the indices associating files in that name space with the -locations of their corresponding objects. As of v2.4 there can be -multiple MDSs in a Lustre file system. +locations of their corresponding objects. As of Lustre 2.4 there can be +multiple MDSs in a single Lustre file system. Metadata Target:: A metadata target (MDT) is the service provided by an MDS that mediates the management of name space indices on the underlying file -system hardware. As of v2.4 there can be multiple MDTs on an MDS. +system hardware. As of Lustre 2.4 there can be multiple MDTs on an MDS. +An MDT can be served by different MDS nodes at different times, but +will only be available from a single MDS at any given time. Object Based Disk:: Object Based Disk (OBD) is the term used for any target, MDT or OST. @@ -88,7 +90,9 @@ system. Object Storage Target:: An object storage target (OST) is the service provided by an OSS that mediates the placement of data objects on the specific underlying file -system hardware. There can be multiple OSTs on a given OSS. +system hardware. There can be multiple OSTs on a given OSS. An OST +may be served by different OSS nodes at different times, but will +only be available from one OSS at any given time. protocol:: An agreed upon formalism for communicating between two entities, such diff --git a/import.txt b/import.txt index 0981794..555e12c 100644 --- a/import.txt +++ b/import.txt @@ -1,83 +1,101 @@ Import ^^^^^^ +[[obd-import]] + +The 'obd_import' structure holds the connection state for between each +client and each target it is connected to. ---- +struct obd_import { + enum lustre_imp_state imp_state; + int imp_generation; + __u32 imp_conn_cnt; + struct lustre_handle imp_remote_handle; + struct obd_connect_data imp_connect_data; +}; +---- + +////////////////////////////////////////////////////////////////////// +This is the rest of the info associated with obd_import: + #define IMP_STATE_HIST_LEN 16 struct import_state_hist { enum lustre_imp_state ish_state; time_t ish_time; }; struct obd_import { - struct portals_handle imp_handle; - atomic_t imp_refcount; - struct lustre_handle imp_dlm_handle; - struct ptlrpc_connection *imp_connection; - struct ptlrpc_client *imp_client; - cfs_list_t imp_pinger_chain; - cfs_list_t imp_zombie_chain; - cfs_list_t imp_replay_list; - cfs_list_t imp_sending_list; - cfs_list_t imp_delayed_list; - cfs_list_t imp_committed_list; - cfs_list_t *imp_replay_cursor; - struct obd_device *imp_obd; - struct ptlrpc_sec *imp_sec; - struct mutex imp_sec_mutex; - cfs_time_t imp_sec_expire; - wait_queue_head_t imp_recovery_waitq; - atomic_t imp_inflight; - atomic_t imp_unregistering; - atomic_t imp_replay_inflight; - atomic_t imp_inval_count; - atomic_t imp_timeouts; - enum lustre_imp_state imp_state; - struct import_state_hist imp_state_hist[IMP_STATE_HIST_LEN]; - int imp_state_hist_idx; - int imp_generation; - __u32 imp_conn_cnt; - int imp_last_generation_checked; - __u64 imp_last_replay_transno; - __u64 imp_peer_committed_transno; - __u64 imp_last_transno_checked; - struct lustre_handle imp_remote_handle; - cfs_time_t imp_next_ping; - __u64 imp_last_success_conn; - cfs_list_t imp_conn_list; - struct obd_import_conn *imp_conn_current; - spinlock_t imp_lock; - /* flags */ - unsigned long - imp_no_timeout:1, - imp_invalid:1, - imp_deactive:1, - imp_replayable:1, - imp_dlm_fake:1, - imp_server_timeout:1, - imp_delayed_recovery:1, - imp_no_lock_replay:1, - imp_vbr_failed:1, - imp_force_verify:1, - imp_force_next_verify:1, - imp_pingable:1, - imp_resend_replay:1, - imp_no_pinger_recover:1, - imp_need_mne_swab:1, - imp_force_reconnect:1, - imp_connect_tried:1; - __u32 imp_connect_op; - struct obd_connect_data imp_connect_data; - __u64 imp_connect_flags_orig; - int imp_connect_error; - __u32 imp_msg_magic; - __u32 imp_msghdr_flags; /* adjusted based on server capability */ - struct ptlrpc_request_pool *imp_rq_pool; /* emergency request pool */ - struct imp_at imp_at; /* adaptive timeout data */ - time_t imp_last_reply_time; /* for health check */ + struct portals_handle imp_handle; + atomic_t imp_refcount; + struct lustre_handle imp_dlm_handle; + struct ptlrpc_connection *imp_connection; + struct ptlrpc_client *imp_client; + cfs_list_t imp_pinger_chain; + cfs_list_t imp_zombie_chain; + cfs_list_t imp_replay_list; + cfs_list_t imp_sending_list; + cfs_list_t imp_delayed_list; + cfs_list_t imp_committed_list; + cfs_list_t *imp_replay_cursor; + struct obd_device *imp_obd; + struct ptlrpc_sec *imp_sec; + struct mutex imp_sec_mutex; + cfs_time_t imp_sec_expire; + wait_queue_head_t imp_recovery_waitq; + atomic_t imp_inflight; + atomic_t imp_unregistering; + atomic_t imp_replay_inflight; + atomic_t imp_inval_count; + atomic_t imp_timeouts; + enum lustre_imp_state imp_state; + struct import_state_hist imp_state_hist[IMP_STATE_HIST_LEN]; + int imp_state_hist_idx; + int imp_generation; + __u32 imp_conn_cnt; + int imp_last_generation_checked; + __u64 imp_last_replay_transno; + __u64 imp_peer_committed_transno; + __u64 imp_last_transno_checked; + struct lustre_handle imp_remote_handle; + cfs_time_t imp_next_ping; + __u64 imp_last_success_conn; + cfs_list_t imp_conn_list; + struct obd_import_conn *imp_conn_current; + spinlock_t imp_lock; + /* flags */ + unsigned long + imp_no_timeout:1, + imp_invalid:1, + imp_deactive:1, + imp_replayable:1, + imp_dlm_fake:1, + imp_server_timeout:1, + imp_delayed_recovery:1, + imp_no_lock_replay:1, + imp_vbr_failed:1, + imp_force_verify:1, + imp_force_next_verify:1, + imp_pingable:1, + imp_resend_replay:1, + imp_no_pinger_recover:1, + imp_need_mne_swab:1, + imp_force_reconnect:1, + imp_connect_tried:1; + __u32 imp_connect_op; + struct obd_connect_data imp_connect_data; + __u64 imp_connect_flags_orig; + int imp_connect_error; + __u32 imp_msg_magic; + __u32 imp_msghdr_flags; /* adjusted based on server capability */ + struct ptlrpc_request_pool *imp_rq_pool; /* emergency request pool */ + struct imp_at imp_at; /* adaptive timeout data */ + time_t imp_last_reply_time; /* for health check */ }; ----- +////////////////////////////////////////////////////////////////////// +////////////////////////////////////////////////////////////////////// +////vvvv The 'imp_handle' value is the unique id for the import, and is used as -a hash key to gain access to it. It is not used in any of the Lustre +a hash key to it. It is not used in any of the Lustre protocol messages, but rather is just for internal reference. The 'imp_refcount' is also for internal use. The value is incremented @@ -166,6 +184,8 @@ multiple threads waiting on this process to complete. The 'imp_timeout' field is a counter that is incremented every time there is a timeout in communication with the target. +////^^^^ +////////////////////////////////////////////////////////////////////// The 'imp_state' tracks the state of the import. It draws from the enumerated set of values: @@ -185,12 +205,17 @@ enumerated set of values: | LUSTRE_IMP_FULL | 9 | LUSTRE_IMP_EVICTED | 10 |===== + +////////////////////////////////////////////////////////////////////// +////vvvv fixme: what are the transitions between these states? The 'imp_state_hist' array maintains a list of the last 16 (IMP_STATE_HIST_LEN) states the import was in, along with the time it entered each (fixme: or is it when it left that state?). The list is maintained in a circular manner, so the 'imp_state_hist_idx' points to the entry in the list for the most recently visited state. +////^^^^ +////////////////////////////////////////////////////////////////////// The 'imp_generation' and 'imp_conn_cnt' fields are monotonically increasing counters. Every time a connection request is sent to the @@ -198,6 +223,8 @@ target the 'imp_conn_cnt' counter is incremented, and every time a reply is received for the connection request the 'imp_generation' counter is incremented. +////////////////////////////////////////////////////////////////////// +////vvvv The 'imp_last_generation_checked' implements an optimization. When a replay process has successfully traversed the reply list the 'imp_generation' value is noted here. If the generation has not @@ -205,27 +232,32 @@ incremented then the replay list does not need to be traversed again. During replay the 'imp_last_replay_transno' is set to the transaction number of the last request being replayed, and -'imp_peer_committed_transno is set to the 'pb_last_committed' value -(of the 'ptlrpc_body') from replies if that value is higher than the +'imp_peer_committed_transno' is set to the 'pb_last_committed' value +(of the <>) from replies if that value is higher than the previous 'imp_peer_committed_transno'. The 'imp_last_transno_checked' field implements an optimization. It is set to the -'imp_last_replay_transno' as its replay is initiated. If -'imp_last_transno_checked' is still 'imp_last_replay_transno' and -'imp_generation' is still 'imp_last_generation_checked' then there +'imp_last_replay_transno' as its replay is initiated. + +If 'imp_last_transno_checked' is still 'imp_last_replay_transno' and +'imp_generation' is still 'imp_last_generation_checked' then there are no additional requests ready to be removed from the replay list. Furthermore, 'imp_last_transno_checked' may no longer be needed, since the committed transactions are now maintained on a separate list. +////^^^^ +////////////////////////////////////////////////////////////////////// The 'imp_remote_handle' is the handle sent by the target in a connection reply message to uniquely identify the export for this target and client that is maintained on the server. This is the handle used in all subsequent messages to the target. -There are two separate ping intervals (fixme: what are the -values?). If there are no uncommitted messages for the target then the -default ping interval is used to set the 'imp_next_ping' to the time +////////////////////////////////////////////////////////////////////// +////vvvv +There are two separate ping intervals. If there are no uncommitted +messages for the target then the default ping interval, based on the +Adaptive Timeout value, is used to set the 'imp_next_ping' to the time the next ping needs to be sent. If there are uncommitted requests then -a "short interval" is used to set the time for the next ping. +a "short interval" of 7s is used to set the time for the next ping. The 'imp_last_success_conn' value is set to the time of the last successful connection. fixme: The source says it is in 64 bit @@ -289,3 +321,5 @@ or 'imp_pingable' flags? During recovery, the client sets the the current value of 'imp_replay_last_transno'. The 'imp_need_mne_swab' flag indicates a version dependent circumstance where swabbing was inadvertently left out of one processing step. +////^^^^ +////////////////////////////////////////////////////////////////////// diff --git a/layout_intent.txt b/layout_intent.txt index 250cd54..9d75e47 100644 --- a/layout_intent.txt +++ b/layout_intent.txt @@ -28,5 +28,5 @@ enum { }; ---- -The other fields - li_flags, li_start, adn li_end - are reserved for -future use, but do not currently play arole. +The other fields - 'li_flags', 'li_start', and 'li_end' - are reserved for +future use. diff --git a/ldlm.txt b/ldlm.txt index 97840be..51e3299 100644 --- a/ldlm.txt +++ b/ldlm.txt @@ -7,8 +7,7 @@ become relevant to the discussion of various file system operations. ################################################################# Fixme: Move the ldlm sturucture includes to where they first -gets introduced. In the sections that have ldlm_enqueue -operations. +get introduced. In the sections that have ldlm_enqueue operations. ################################################################# include::layout_intent.txt[] diff --git a/ldlm_bl_callback.txt b/ldlm_bl_callback.txt index cc3be06..5a26eec 100644 --- a/ldlm_bl_callback.txt +++ b/ldlm_bl_callback.txt @@ -29,9 +29,9 @@ just an acknowledgment of receipt, and doesn't carry any further information about the lock or the resource. 'ptlrpc_body':: -RPC descriptor. +RPC descriptor. <> 'ldlm_request':: Description of the lock being requested. Which resource is the target, -what lock is current, and what lock desired. +what lock is current, and what lock desired. <> diff --git a/ldlm_cancel.txt b/ldlm_cancel.txt index 1f0a254..80b186f 100644 --- a/ldlm_cancel.txt +++ b/ldlm_cancel.txt @@ -21,10 +21,10 @@ art: ////////////////////////////////////////////////////////////////////// 'ptlrpc_body':: -RPC descriptor. +RPC descriptor. <> 'ldlm_request':: The request RPC identifies the lock being canceled. Only the first 'lock_handle' field is used. The rest of the 'ldlm_request' sturcture -is not used. +is not used. <> diff --git a/ldlm_cp_callback.txt b/ldlm_cp_callback.txt index eb5a69c..071db0f 100644 --- a/ldlm_cp_callback.txt +++ b/ldlm_cp_callback.txt @@ -27,12 +27,12 @@ resource. The reply is just an acknowledgment of receipt, and doesn't carry any further information about the lock or the resource. 'ptlrpc_body':: -RPC descriptor. +RPC descriptor <>. 'ldlm_request':: Description of the lock being requested. Which resource is the target, -what lock is current, and what lock desired. +what lock is current, and what lock desired. <> 'ost_lvb':: -Attribute data associated with a resource on an OST. +Attribute data associated with a resource on an OST. <> diff --git a/ldlm_enqueue.txt b/ldlm_enqueue.txt index 18b1e7b..aa36ad6 100644 --- a/ldlm_enqueue.txt +++ b/ldlm_enqueue.txt @@ -87,14 +87,15 @@ art: ////////////////////////////////////////////////////////////////////// 'ptlrpc_body':: -RPC descriptor. +RPC descriptor <>. 'ldlm_request':: Description of the lock being requested. Which resource is the target, -what lock is current, and what lock desired. +what lock is current, and what lock desired. <> 'ldlm_intent':: Description of the intent being included with the lock request. +<> 'layout_intent':: Description of the layout information that is the subject of a layout @@ -103,7 +104,7 @@ intent. 'mdt_body':: In a request, an indication (in the 'mbo_valid' field) of what attributes the requester would like. In a reply, (again based on -'mbo_valid') the values being returned. +'mbo_valid') the values being returned. <> 'lustre_capa':: So called "capabilities" structure. This is deprecated in recent @@ -115,14 +116,10 @@ A text field supplying the name of the desired resource. 'ldlm_reply':: Resembling the 'ldlm_request', but in this case indicating what the -LDLM actually granted as well as relevant policy data. - -'mdt_md':: -Layout data for the resource. This buffer is optional and will appear -as zero length in some packets. +LDLM actually granted as well as relevant policy data. <> 'mdt_body':: -Metadata about a given resource. +Metadata about a given resource. <> 'ACL data':: Access Control List data associated with a resource. @@ -146,18 +143,18 @@ Lock Value Block:: A Lock Value Block (LVB) is included in the LDLM_ENQUEUE reply message when one of three things needs to be communicated back to the requester. The three alternatives are captured by the -'ost_lvb'structure, the 'lov_mds_md' structure, and one other related +'ost_lvb' structure, the 'lov_mds_md' structure, and one other related to quotas (and not yet incorporated into this document). LDLM_ENQUEUE reply RPCs may include a zero length instance of an LVB. 'ost_lvb':: An LVB to communicate attribute data for an extent associated with a resource on a lock. It is returned from an OST to a client requesting -an extent lock. +an extent lock. <> 'lov_mds_md':: -An LVB to communicate layout data associated with a resource. It is +Layout data associated with a resource. It is returned from an MDT to a client requesting a lock a lock with a layout intent. In an intent request (as opposed to a reply and as yet unimplemanted) it will modify the layout. It will not be included -(zero length) in requests in current releases. +(zero length) in requests in current releases. <> diff --git a/ldlm_gl_callback.txt b/ldlm_gl_callback.txt index 0115ad2..24564c0 100644 --- a/ldlm_gl_callback.txt +++ b/ldlm_gl_callback.txt @@ -28,12 +28,12 @@ and notify the requester of size and time attributes once that is done. The reply updates the attributes on the requester. 'ptlrpc_body':: -RPC descriptor. +RPC descriptor. <> 'ldlm_request':: Description of the lock being requested. Which resource is the target, -what lock is current, and what lock desired. +what lock is current, and what lock desired. <> 'ost_lvb':: -Attribute data associated with a resource on an OST. +Attribute data associated with a resource on an OST. <> diff --git a/ldlm_lock_desc.txt b/ldlm_lock_desc.txt index 755f442..59d03aa 100644 --- a/ldlm_lock_desc.txt +++ b/ldlm_lock_desc.txt @@ -9,9 +9,9 @@ lock being requested or granted. It appears in ---- struct ldlm_lock_desc { struct ldlm_resource_desc l_resource; - ldlm_mode_t l_req_mode; - ldlm_mode_t l_granted_mode; - ldlm_wire_policy_data_t l_policy_data; + enum ldlm_mode l_req_mode; + enum ldlm_mode l_granted_mode; + union ldlm_wire_policy_data l_policy_data; }; ---- @@ -24,7 +24,7 @@ being requested and the kind of lock that has been granted. The field values are: ---- -typedef enum { +enum ldlm_mode { LCK_EX = 1, /* exclusive */ LCK_PW = 2, /* privileged write */ LCK_PR = 4, /* privileged read */ @@ -33,8 +33,9 @@ typedef enum { LCK_NL = 32, /* */ LCK_GROUP = 64, /* */ LCK_COS = 128, /* */ -} ldlm_mode_t; +}; ---- +[[struct-ldlm-mode]] Despite the fact that the lock modes are not overlapping, these lock modes are exclusive. In addition the mode value 0 is the MINMODE, @@ -48,14 +49,14 @@ fewer privileges were granted than requested, and the The 'l_policy_data' field gives the kind of resource being requested/granted. It is a union of these struct definitions: -[[ldlm-wire-policy-data-t]] +[[struct-ldlm-wire-policy-data]] ---- -typedef union { +union ldlm_wire_policy_data { struct ldlm_extent l_extent; struct ldlm_flock_wire l_flock; struct ldlm_inodebits l_inodebits; -} ldlm_wire_policy_data_t; +}; ---- ---- @@ -64,6 +65,9 @@ struct ldlm_extent { __u64 end; __u64 gid; }; +---- +[[struct-ldlm-extent]] +---- struct ldlm_flock_wire { __u64 lfw_start; __u64 lfw_end; @@ -71,14 +75,16 @@ struct ldlm_flock_wire { __u32 lfw_padding; __u32 lfw_pid; }; +---- +[[struct-ldlm-flock-wire]] +---- struct ldlm_inodebits { __u64 bits; }; ---- +[[struct-ldlm-inodebits]] Thus the lock may be on an 'extent', a contiguous sequence of bytes in a regular file; an 'flock wire', whatever to heck that is; or a portion of an inode. For a "plain" lock (or one with no type at all) the 'l_policy_data' field has zero length. - - diff --git a/ldlm_request.txt b/ldlm_request.txt index 93f9f5d..241dc74 100644 --- a/ldlm_request.txt +++ b/ldlm_request.txt @@ -1,4 +1,3 @@ - LDLM Request ^^^^^^^^^^^^ [[struct-ldlm-request]] diff --git a/ldlm_resource_desc.txt b/ldlm_resource_desc.txt index c50781f..2b48f31 100644 --- a/ldlm_resource_desc.txt +++ b/ldlm_resource_desc.txt @@ -7,7 +7,7 @@ being locked, along with what sort of thing it is. ---- struct ldlm_resource_desc { - ldlm_type_t lr_type; + struct ldlm_type lr_type; __u32 lr_padding; /* also fix lustre_swab_ldlm_resource_desc */ struct ldlm_res_id lr_name; }; @@ -32,11 +32,11 @@ objects of the locking operation. See the discussion of <>. ---- -typedef enum { +enum ldlm_type { LDLM_PLAIN = 10, LDLM_EXTENT = 11, LDLM_FLOCK = 12, LDLM_IBITS = 13, -} ldlm_type_t; +}; ---- - +[[struct-ldlm-type]] diff --git a/ldlm_resource_id.txt b/ldlm_resource_id.txt index 563f1b1..293b4f9 100644 --- a/ldlm_resource_id.txt +++ b/ldlm_resource_id.txt @@ -12,5 +12,5 @@ struct ldlm_res_id { The 'name' array holds identifiers for the resource in question. Those identifiers may be the elements of a 'struct lu_fid' file ID, or they -may be other uniquely identifying values for the resource. See <>. +may be other uniquely identifying values for the resource. See <>. diff --git a/llog.txt b/llog.txt index edbc606..5f2d010 100644 --- a/llog.txt +++ b/llog.txt @@ -2,8 +2,20 @@ The Lustre Log Facility ~~~~~~~~~~~~~~~~~~~~~~~ [[llog]] +The Lustre log (llog) file may contain a number of different types of +data structures, including redo records for uncommitted distributed +transactions such as unlink or ownership changes, configuration records +for targets and clients, or ChangeLog records to track changes to the +filesystem for external consumption, among others. + +Each llog file begins with an 'llog_log_hdr' that describes the llog +file itself, followed by a series of log records that are appended +sequentially to the file. Each record, including the header itself, +begins with an 'llog_rec_hdr' and ends with an 'llog_rec_tail'. + LLOG Log ID ^^^^^^^^^^^ +[[struct-llog-logid]] ---- struct llog_logid { @@ -12,9 +24,12 @@ struct llog_logid { }; ---- +The 'llog_logid' structure is used to identify a single Lustre log file. +It holds a <> in 'lgl_oi', which is typically a FID. + LLog Information ^^^^^^^^^^^^^^^^ - +[[struct-llogd-body]] ---- struct llogd_body { struct llog_logid lgd_logid; @@ -38,7 +53,7 @@ enum llog_flag { LLog Record Header ^^^^^^^^^^^^^^^^^^ - +[[struct-llog-rec-hdr]] ---- struct llog_rec_hdr { __u32 lrh_len; @@ -48,9 +63,36 @@ struct llog_rec_hdr { }; ---- +The 'llog_rec_hdr' is at the start of each llog record and describes +the log record. 'lrh_len' holds the record size in bytes, including +the header and tail. 'lrh_index' is the record index within the llog +file and is sequentially increasing for each subsequent record. It +can be used to determine the offset within the llog file when searching +for an arbitrary record within the file. 'lrh_type' describes the type +of data stored in this record. + +---- +enum llog_op_type { + LLOG_PAD_MAGIC = LLOG_OP_MAGIC | 0x00000, + OST_SZ_REC = LLOG_OP_MAGIC | 0x00f00, + MDS_UNLINK64_REC = LLOG_OP_MAGIC | 0x90000 | (MDS_REINT << 8) | + REINT_UNLINK, + MDS_SETATTR64_REC = LLOG_OP_MAGIC | 0x90000 | (MDS_REINT << 8) | + REINT_SETATTR, + OBD_CFG_REC = LLOG_OP_MAGIC | 0x20000, + LLOG_GEN_REC = LLOG_OP_MAGIC | 0x40000, + CHANGELOG_REC = LLOG_OP_MAGIC | 0x60000, + CHANGELOG_USER_REC = LLOG_OP_MAGIC | 0x70000, + HSM_AGENT_REC = LLOG_OP_MAGIC | 0x80000, + UPDATE_REC = LLOG_OP_MAGIC | 0xa0000, + LLOG_HDR_MAGIC = LLOG_OP_MAGIC | 0x45539, + LLOG_LOGID_MAGIC = LLOG_OP_MAGIC | 0x4553b, +}; +---- + LLog Record Tail ^^^^^^^^^^^^^^^^ - +[[struct-llog-rec-tail]] ---- struct llog_rec_tail { __u32 lrt_len; @@ -58,9 +100,14 @@ struct llog_rec_tail { }; ---- +The 'llog_rec_tail' is at the end of each llog record. The 'lrt_len' +and 'lrt_index' fields must be the same as 'lrh_len' and 'lrh_index' +in the header. They can be used to verify record integrity, as well +as allowing processing the llog records in reverse order. + LLog Log Header Information ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - +[[struct-llog-log-hdr]] ---- struct llog_log_hdr { struct llog_rec_hdr llh_hdr; @@ -78,3 +125,14 @@ struct llog_log_hdr { }; ---- +The llog records start and end on a record size boundary, typically +8192 bytes, or as stored in 'llh_size', which allows some degree of +random access within the llog file, even with variable record sizes. +It is possible to interpolate the offset of an arbitrary record +within the file by estimating the byte offset of a particular record +index using 'llh_count' and the llog file size and aligning it to +the chunk boundary 'llh_size'. The record index in the 'llog_rec_hdr' +of the first record in that chunk can be used to further refine the +estimate of the offset of the desired index down to a single chunk, +and then sequential access can be used to find the actual record. + diff --git a/lov_mds_md.txt b/lov_mds_md.txt index b0f1cdd..97933cb 100644 --- a/lov_mds_md.txt +++ b/lov_mds_md.txt @@ -2,13 +2,16 @@ MDS Lock Value Block ^^^^^^^^^^^^^^^^^^^^ [[struct-lov-mds-md]] -The 'lov_mds_md' structure is a Lock Value Block (LVB) for layout -locks. In replies to lock requests and other situations requiring +The 'lov_mds_md' structure contains the layout of a single file. +In replies to lock requests and other situations requiring layout information from an MDT the 'lov_mds_md' information provides -details about the layout of a file across the OSTs. +details about the layout of a file across the OSTs. There may be +different types of layouts for different files, either 'lov_mds_md_v1' +or 'lov_mds_md_v3' as of Lustre 2.7, though they are very similar in +structure. ---- -struct lov_mds_md { +struct lov_mds_md_v1 { __u32 lmm_magic; __u32 lmm_pattern; struct ost_id lmm_oi; @@ -17,31 +20,77 @@ struct lov_mds_md { __u16 lmm_layout_gen; struct lov_ost_data_v1 lmm_objects[0]; }; + +struct lov_mds_md_v3 { + __u32 lmm_magic; + __u32 lmm_pattern; + struct ost_id lmm_oi; + __u32 lmm_stripe_size; + __u16 lmm_stripe_count; + __u16 lmm_layout_gen; + char lmm_pool_name[16]; + struct lov_ost_data_v1 lmm_objects[0]; +}; ---- -The 'lmm_magic' field is filled in with the value LOV_MAGIC -(0x0BD10BD0) when the structure is in use. If the structure is in a +The 'lmm_magic' field is filled in with the magic value that describes +the type of layout in use for a particular file. The rest of the layout +structure may vary depending on the magic value, but the 'lmm_magic' +value must be the first member of any valid file layout. The values +LOV_MAGIC_V1 (0x0BD10BD0) and LOV_MAGIC_V3 (0x0BD30BD0) are most +commonly used when the structure is in use. If the structure is in a buffer of an RPC without the magic number in place, then the rest of the structure is ignored. -The 'lmm_pattern' field is only ever set to LOV_PATTERN_RAID0 -(0x001). +The 'lmm_pattern' field describes the layout pattern within a particular +LOV_MAGIC layout type. Currently in Lustre 2.7 the most common pattern +is LOV_PATTERN_RAID0 (0x001). The high 16 bits of the 'lmm_pattern' field +may also contain flags that modify the layout. LOV_PATTERN_F_HOLE means +that there is a hole (missing object) within the layout, which is normally +caused by corruption or loss of the file layout that had to be rebuilt +by LFSCK. LOV_PATTERN_F_RELEASED is used by HSM to indicate that the +file data is not resident in the filesystem, but rather in an external +archive, so the layout is only a stub that describes the layout to use +when the file is restored. -The 'lmm_oi' field gives the LOV object ID for the first OST of the -layout. This is the OST where striping will begin. +The 'lmm_oi' field gives the identifier of the MDT object to which this +file belongs. See <> for more details. -The 'lmm_stripe_size' field give the stripe size for the object. This -is how many bytes will be on a particular OST before going to the next -stripe. +The 'lmm_stripe_size' field give the stripe size for the layout in bytes. +This is how many bytes will be on a particular OST object before going +to the next stripe. -The 'lmm_stripe_count' field gives how many OSTs the file is striped +The 'lmm_stripe_count' field gives how many OST objects the file is striped over. -The 'lmm_layout_gen' field ges updated as the layout of the obeject is -updated. This way out-of-date references to the layout can be -recognized. +The 'lmm_layout_gen' field is updated as the layout of the obeject is +updated. If the 'lmm_layout_gen' is modified, then clients can detect +the layout has changed when refreshing the layout after having lost the +layout lock. + +The 'lmm_pool_name' field is only in LOV_MAGIC_V3 layouts, and contains +the name of the OST pool used during the creation of the layout. + +For instantiated layouts, 'lmm_objects' array contains 'lmm_stripe_count' +'lov_ost_data_v1' structures with the per-stripe data for each OST object +in the layout. For uninstantiated layout templates, there will not be any +entries in the 'lmm_objects' array, which can be determined by the overall +layout size. + +---- +struct lov_ost_data_v1 { /* per-stripe data structure (little-endian)*/ + struct ost_id l_ost_oi; /* OST object ID */ + __u32 l_ost_gen; /* generation of this l_ost_idx */ + __u32 l_ost_idx; /* OST index in LOV (lov_tgt_desc->tgts) */ +}; +---- +[[struct-lov-ost-data-v1]] -The 'lmm_objects' array gives per-stripe data for more complex -(non-uniform) layouts. +The 'lov_ost_data_v1' structure is used for both LOV_MAGIC_V1 and +LOV_MAGIC_V3 types to describe the OST objects allocated to this +layout, one per object. +'l_ost_id' identifies the object on the OST specified by 'l_ost_idx'. +It may contain a OST object ID or a FID as described in <>. +The 'l_ost_gen' field is currently unused. diff --git a/lustre_file_ids.txt b/lustre_file_ids.txt deleted file mode 100644 index 6dc9ae5..0000000 --- a/lustre_file_ids.txt +++ /dev/null @@ -1,26 +0,0 @@ -Lustre File IDs -^^^^^^^^^^^^^^^ -[[file-id]] - -Each resource stored on a target is assigned an identifier that is -unique to that resource. - ----- -struct lu_fid { - __u64 f_seq; - __u32 f_oid; - __u32 f_ver; -}; ----- - -The 'f_seq' field identifies the target. That is, all the resources -with a common 'f_seq' will be on the same target. A target can have -more than one 'f_seq' value assigned to it. - -The 'f_oid' gives the specific value for a given resource that is -unique to that resource on that target. - -The 'f_ver' value identifies which version of a resource is being -identified, in the event that the resource is being updated, and -different hosts might be referring to different versions of the same -resource. diff --git a/lustre_handle.txt b/lustre_handle.txt index 996121a..722e26b 100644 --- a/lustre_handle.txt +++ b/lustre_handle.txt @@ -2,9 +2,17 @@ Lustre Handle ^^^^^^^^^^^^^ [[struct-lustre-handle]] -A Lustre handle is an opaque cookie that identifies some local object -(e.g. connection, open file, DLM lock, etc) to another node or another -layer in the software stack. +A 'lustre_handle' is an opaque cookie that identifies some local object +to another node or another layer in the software stack. It is not the +physical address of the data structure in memory to avoid memory +corruption in case the object has been freed, but rather a cookie in a +lookup table that provides a layer of indirection that can safely +determine if the object still exists or not. + +A 'lustre_handle' can identify a variety of different types of objects, +such as client-target connections, open file handles, DLM lock handles, +etc. The meaning of the handle is dependent on the context in which it +is used. ---- struct lustre_handle { diff --git a/lustre_rpcs.txt b/lustre_rpcs.txt index c13df8f..50cb5f2 100644 --- a/lustre_rpcs.txt +++ b/lustre_rpcs.txt @@ -2,15 +2,19 @@ Lustre RPCs ----------- [[lustre-rpcs]] -Lustre operations are denoted by the 'pb_opc' op-code field of the -RPC descriptor. Each operation is implemented as a pair of messages, -with the 'pb_type' field set to PTLRPC_MSG_REQUEST for requests -initiating the operation, and PTLRPC_MSG_REPLY for replies. Note that -as a general matter, the receipt by a client of the reply message only -assures the client hat the server has initiated the action, if -any. See the discussion on <> -and <> for how the client is given confirmation that a -request has been completed. +Lustre RPC operations are implemented as a pair of messages, a request +and its reply. The 'pb_type' field is set to PTLRPC_MSG_REQUEST for +requests initiating the operation, and normally PTLRPC_MSG_REPLY for +replies unless the message encountered a fatal error before it could +be processed, in which case it will contain PTLRPC_MSG_ERR. + +The type of operation requested is denoted by the 'pb_opc' op-code field +of the RPC request. Note that as a general matter, the receipt by a +client of the reply message only assures the client that the server +has initiated the action, if any, but does not guarantee that any +modification has been committed to persistent storage. See the +discussion on <> and <> for +how the client is given confirmation that a request has been completed. include::ost_setattr.txt[] diff --git a/mds_connect.txt b/mds_connect.txt index 99dfa4b..55437e4 100644 --- a/mds_connect.txt +++ b/mds_connect.txt @@ -5,8 +5,11 @@ RPC 38: MDS CONNECT - Client connection to an MDS .MDS_CONNECT (38) [options="header"] |==== -| request | reply -| obd_connect_client | obd_connect_server +| request +| ptlrpc_body | tgt_uuid | client_uuid | lustre_handle | obd_connect_data | +| +| reply +| ptlrpc_body | obd_connect_data | |==== N.B. This is nearly identical to the explanation for OST_CONNECT and @@ -18,31 +21,62 @@ When a client initiates a connection to a specific target on an MDS, it does so by sending an 'obd_connect_client' message and awaiting the reply from the MDS of an 'obd_connect_server' message. From a previous interaction with the MGS the client knows the UUID of the target MDT, -and can fill that value into the 'obd_connect_client' message. +and must fill that value into the 'tgt_uuid' buffer of the request. -The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the -capabilities appropriate to the client. The 'ocd_brw_size' is set to the -largest value for the size of an RPC that the client can handle. The -'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client -considers appropriate. Other fields in the descriptor and -'obd_connect_data' structures are zero, as is the 'lustre_handle' -element. +The 'client_uuid' buffer holds the randomly-generated 128-bit UUID of +the client. The 'client_uuid' is unique for each mount of the client. +Even if the same client mounts the same filesystem multiple times it +will generate a new 'client_uuid' value for each mount. + +The 'lustre_handle' buffer contains the cookie for this connection, +and is zero for a new mount. If the client is reconnecting to a +server after a loss of communication, the 'lustre_handle' contains +the connection cookie previously assigned by the server and returned +to the client in the reply. This allows the server to determine if +the client connection matches any previous connection from this client. + +The 'ocd_connect_flags' buffer is initialized to the set of features +that the client supports for metadata targets. For Lustre 2.7 clients +these features are OBD_CONNECT_IBITS, OBD_CONNECT_NODEVOH, +OBD_CONNECT_ATTRFID, OBD_CONNECT_VERSION, OBD_CONNECT_BRW_SIZE, +OBD_CONNECT_CANCELSET, OBD_CONNECT_FID, OBD_CONNECT_AT, OBD_CONNECT_LOV_V3, +OBD_CONNECT_VBR, OBD_CONNECT_FULL20, OBD_CONNECT_64BITHASH, +OBD_CONNECT_EINPROGRESS, OBD_CONNECT_JOBSTATS, OBD_CONNECT_LVB_TYPE, +OBD_CONNECT_LAYOUTLOCK, OBD_CONNECT_PINGLESS, OBD_CONNECT_MAX_EASIZE, +OBD_CONNECT_FLOCK_DEAD, OBD_CONNECT_DISP_STRIPE, OBD_CONNECT_LFSCK, and +OBD_CONNECT_OPEN_BY_FID, as described in <>. + +The 'ocd_brw_size' field is set to the largest size in bytes that the +client can use for bulk transfers. The 'ocd_ibits_known' field is set to +the supported set of IBITS locks, as of Lustre 2.7 these are +MDS_INODELOCK_LOOKUP, MDS_INODELOCK_UPDATE, MDS_INODELOCK_OPEN, +MDS_INODELOCK_LAYOUT, MDS_INODELOCK_PERM, and MDS_INODELOCK_XATTR, +as described in <>. + +The 'ocd_version' field contains the version of Lustre +running on the client. Other fields in the 'obd_connect_data' +structures are zero. Once the server receives the 'obd_connect_client' message on behalf of the given target it replies with an 'obd_connect_server' message. In -that message the server sends the 'pb__handle' to uniquely -identify the connection for subsequent communication. The client notes -that handle in its import for the given target. +the reply message the server sets the 'pb_handle' to uniquely +identify this connection for subsequent communication. The client notes +that handle in its import for the given target. The reply also returns +the 'obd_connect_data' as interpreted by the MDS. Any features that +the MDS does not support are masked out of 'ocd_feature_flags'. Supported +features that have assigned fields are filled in by the MDS. fixme: Are there circumstances that could lead to the 'status' value in the reply being non-zero? What would lead to that and what error values would result? +ans: yes, at least the standard errors The target maintains the last committed transaction for a client in its export for that client. If this is the first connection, then that last transaction value would just be zero. If there were previous transactions for the client, then the transaction number for the last -such committed transaction is put in the 'pb_last_committed' field. +such committed transaction is put in the <> +'pb_last_committed' field. In a connection request the operation is not file system modifying, so the 'pb_transno' value will be zero in the reply as well. diff --git a/mds_disconnect.txt b/mds_disconnect.txt index 65507df..e48989c 100644 --- a/mds_disconnect.txt +++ b/mds_disconnect.txt @@ -10,5 +10,5 @@ RPC 39: MDS DISCONNECT - Disconnect client from an MDS |==== The information exchanged in a DISCONNECT message is that normally -conveyed in the RPC descriptor. +conveyed in the RPC descriptor, as described in <>. diff --git a/mds_getstatus.txt b/mds_getstatus.txt index 8600d65..82f6fd5 100644 --- a/mds_getstatus.txt +++ b/mds_getstatus.txt @@ -2,20 +2,20 @@ RPC 40: MDS GETSTATUS - get the status from a target ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [[mds-getstatus-rpc]] -The MDS_GETSTATUS command targets a specific MDT. If there are several, -the client will need to send a separate message for each. +The MDS_GETSTATUS request is used to determine the filesystem ROOT FID +during initial mount. It is always sent to MDT0000 initially. .MDS_GETSTATUS (40) [options="header"] |==== -| request | reply -| mdt_body_only | mdt_body_capa +| request | reply +| mdt_body | mdt_body |==== -The 'mdt_body_only' request message is one that provides information -about a specific MDT, identifying which (among several possible) -target is being asked about. +The request message 'mdt_body' is normally empty. + +The reply message 'mdt_body' contains only the FID of the filesystem +ROOT in 'mbo_fid1'. The client can then use the returned FID to +fetch inode attributes for the root inode of the local mountpoint. -In the reply there is additional information about the MDT's -capabilities. diff --git a/mds_getxattr.txt b/mds_getxattr.txt index f6528f8..c6960e0 100644 --- a/mds_getxattr.txt +++ b/mds_getxattr.txt @@ -2,8 +2,7 @@ RPC 49: MDS_GETXATTR ~~~~~~~~~~~~~~~~~~~~ [[mds-getxattr-rpc]] -An RPC that provides information about extended attributes for a -resource. +An RPC that fetches extended attributes for a resource. .MDS_GETXATTR Generic Packet Structure image::mds-getxattr-generic.png["MDS_GETXATTR Generic Packet Structure",height=100] diff --git a/mds_reint.txt b/mds_reint.txt index 91bf47d..2491047 100644 --- a/mds_reint.txt +++ b/mds_reint.txt @@ -1,10 +1,10 @@ RPC 36: MDS_REINT ~~~~~~~~~~~~~~~~~ -[[mds-reint-rpm]] +[[mds-reint-rpc]] -An RPC that implements an operation that will change the information -on an MDT. There are a variety of operations all gathered under the -MDS_REINT 'opcode'. +An RPC that implements an operation that will change the state of +an object on an MDT. There are a variety of operations all gathered +under the MDS_REINT 'opcode'. ---- typedef enum { diff --git a/mds_statfs.txt b/mds_statfs.txt index 9121e2d..1a12e06 100644 --- a/mds_statfs.txt +++ b/mds_statfs.txt @@ -3,9 +3,11 @@ RPC 41: MDS_STATFS [[mds-statfs-rpc]] MDS_STATFS is an RPC that queries data about the underlying file -system for a given MDT. It is generated in response to an explicit -call for 'statfs' information from the VFS via the 'statfs(2)' -function. +system for a given MDT. +//// +It is generated in response to an explicit call for 'statfs' +information from the VFS via the 'statfs(2)' function. +//// The MDS_STATFS request message is a so-called "empty" message in that it only has a buffer for the 'ptlrpc_body' with the 'pb_opc' value @@ -35,7 +37,7 @@ code if it doesn't. 'ptlrpc_body':: RPC descriptor. Only the 'pb_opc' value (MDS_STATFS = 41) is directly relevant to the MDS_STATFS request message. The rest of the 'ptlrpc_body' fields handle generic information about the -RPC, as discussed in <>, including generic error +RPC, as discussed in <>, including generic error conditions. In a normal reply ('pb_type' = PTL_RPC_MSG_REPLY) the 'pb_status' field is 0. The one error that can be returned in 'pb_status' that is speficially from OST_STATFS' handling is -ENOMEM, diff --git a/mdt_body.txt b/mdt_body.txt index 1e31729..ecce92d 100644 --- a/mdt_body.txt +++ b/mdt_body.txt @@ -3,7 +3,7 @@ MDT Body [[struct-mdt-body]] An 'mdt_body' structure conveys information about the metadata for a -resource. +single resource, typically an MDT inode. ---- struct mdt_body { @@ -48,7 +48,7 @@ For an operation that involves two resources both the 'mbo_fid1' and 'mbo_fid2' fields will be filled in. If the 'mdt_body' is part of an RPC affecting or involving only a single resource then 'mbo_fid1' will designate that resource and 'mbo_fid2' will be cleared (see -<>). +<>). The 'mbo_handle' field indicates the identity of an open file related to the operation, if any. If there is no lock then it is just 0. @@ -100,9 +100,12 @@ The 'mbo_nlink' field gives the number of hard links to the resource. The 'mbo_eadatasize' field gives the size of the extended attributes, which hold such information as the resource's layout. -The 'mbo_aclsize' field supports POSIX access control lists (ACLs). +The 'mbo_aclsize' field indicates the size of the POSIX access control +lists (ACLs) in bytes. -The 'mbo_max_mdsize' field +The 'mbo_max_mdsize' field indicates the maximu size of the file layout, +as described in <> -The 'mbo_max_cookiesize' field is unused. +The 'mbo_max_cookiesize' field is unused since Lustre 2.4. It formerly +held the maximum permissible size of llog cookies for destroyed OST objectb diff --git a/mdt_rec_reint.txt b/mdt_rec_reint.txt index 16dab6b..c099e13 100644 --- a/mdt_rec_reint.txt +++ b/mdt_rec_reint.txt @@ -23,9 +23,9 @@ struct mdt_rec_reint { __u32 rr_suppgid2_h; struct lu_fid rr_fid1; struct lu_fid rr_fid2; - obd_time rr_mtime; - obd_time rr_atime; - obd_time rr_ctime; + __u64 rr_mtime; + __u64 rr_atime; + __u64 rr_ctime; __u64 rr_size; __u64 rr_blocks; __u32 rr_bias; @@ -76,8 +76,9 @@ The 'rr_fid1' and 'rr_fid2' fields specify the file IDs for the resources being operated upon. If only one resource is being acted upon then 'rr_fid2' is not used. -The 'rr_mtime' 'rr_atime' and 'rr_ctime' fields give the values of the -time attributes. The 'obd_time' type is also a '__u64'. +The 'rr_mtime' 'rr_atime' and 'rr_ctime' fields give the object +timestamps in seconds for the last modification time, the last +access time, and the last metadata change time, respectively. The 'rr_size' field gives the value of the size attribute. diff --git a/mgs_disconnect.txt b/mgs_disconnect.txt index 79d5e8d..89e5174 100644 --- a/mgs_disconnect.txt +++ b/mgs_disconnect.txt @@ -10,8 +10,7 @@ RPC 251: MGS DISCONNECT - Disconnect client from an MGS |==== N.B. The usual 'struct req_format' definition does not exist for -MGS_DISCONNECT. It will take a more careful code review to verify that -it also has 'empty' messages gong back and forth. +MGS_DISCONNECT. -The information exchanged in a DISCONNECT message is that normally -conveyed in the RPC descriptor. +The information exchanged in a DISCONNECT message is only that normally +conveyed in the RPC descriptor, as described in <>. diff --git a/mount.txt b/mount.txt index 1cf0659..9fc3e93 100644 --- a/mount.txt +++ b/mount.txt @@ -3,11 +3,11 @@ Mount Before any other interaction can take place between a client and a Lustre file system the client must 'mount' the file system, and Lustre -services must already be in place (on the servers). A file system -mount may be initiated at the Linux shell command line, which in turn -invokes the 'mount()' system call. Kernel modules for Lustre -exchange a series of messages with the servers, beginning with -messages that retrieve details about the file system from the +services must already be started on the servers. A file system +mount may be initiated via the 'mount()' system call, passing the +name of the MGS and Lustre filesystem to the kernel. The Lustre +client exchanges a series of messages with the servers, beginning with +messages that retrieve the file system configuration <> from the management server (MGS). This provides the client with the identities of all the metadata servers (MDSs) and targets (MDTs) as well as all the object storage servers (OSSs) and targets (OSTs). The client then @@ -69,13 +69,14 @@ the server this connection state is referred to as an 'export', and again the server has an export for each client that has connected to it. There a separate export for each client for each target. -The client begins by carrying out the MGS_CONNECT Lustre operation, -which establishes the connection (creates the import and the export) -between the client and the MGS. The connect message from the client -includes a 'handle' to uniquely identify itself (subsequent messages -to the LDLM will refer to that client-handle). The connection data -from the client also proposes the set of <> appropriate to connecting to an MGS. +The client begins by carrying out the <> +Lustre operation, which establishes the connection (creates the +import and the export) between the client and the MGS. The connect +message from the client includes a 'lustre_handle' to uniquely +identify itself. Subsequent request messages to the MGS will refer +to that client-handle to verify that the client has already connected. +The connection data from the client also proposes the set of +<> appropriate to connecting to an MGS. .Flags for the client connection to an MGS [options="header"] @@ -99,11 +100,12 @@ Once the connection is established the client gets configuration information for the file system from the MGS in four stages. First, the two exchange messages establishing the file system wide security policy that will be followed in all subsequent communications. Second, -the client gets a bitmap instructing it as to which among the +the client gets a configuration <> starting with a bitmap +instructing it as to which among the configuration records on the MGS it needs. Third, reading those records from the MGS gives the client the list of all the servers and targets it will need to communicate with. Fourth, the client reads -cluster wide configuration data (the sort that might be set at the +the cluster wide configuration data (the sort that might be set at the client command line with a 'lctl conf_param' command). The following paragraphs go into these four stages in more detail. diff --git a/obd_connect_data.txt b/obd_connect_data.txt index 0978bd5..0ca5076 100644 --- a/obd_connect_data.txt +++ b/obd_connect_data.txt @@ -1,9 +1,19 @@ OBD Connect Data ^^^^^^^^^^^^^^^^ -[[obd-connect-data]] +[[struct-obd-connect-data]] An 'obd_connect_data' structure accompanies every connect operation in -both the request message and in the reply message. +both the request message and in the reply message. It contains the +mutually supported features that are negotiated between the client and +server at *_CONNECT time. + +At *_CONNECT time the client sends in its request 'ocd_connect_flags' +the flags for all features that it understands and intends to use (for +example 'OBD_CONNECT_RDONLY' is optional depending on client mount +options). The request also contains other fields that are only valid +if the matching flag is set. The server replies in 'ocd_connect_flags' +with the subset of feature flags that it understands and intends to honour. +The server may set fields in the reply for mutually-understood features. ---- struct obd_connect_data { @@ -45,7 +55,7 @@ The 'ocd_connect_flags' field encodes the connect flags giving the capabilities of a connection between client and target. Several of those flags (noted in comments above and the discussion below) actually control whether the remaining fields of 'obd_connect_data' -get used. The [[connect-flags]] flags are: +get used. The [[obd-connect-flags]] flags are: ---- #define OBD_CONNECT_RDONLY 0x1ULL /*client has read-only access*/ @@ -57,23 +67,15 @@ get used. The [[connect-flags]] flags are: #define OBD_CONNECT_REQPORTAL 0x40ULL /*Separate non-IO req portal */ #define OBD_CONNECT_ACL 0x80ULL /*access control lists */ #define OBD_CONNECT_XATTR 0x100ULL /*client use extended attr */ -#define OBD_CONNECT_CROW 0x200ULL /*MDS+OST create obj on write*/ #define OBD_CONNECT_TRUNCLOCK 0x400ULL /*locks on server for punch */ #define OBD_CONNECT_TRANSNO 0x800ULL /*replay sends init transno */ #define OBD_CONNECT_IBITS 0x1000ULL /*support for inodebits locks*/ -#define OBD_CONNECT_JOIN 0x2000ULL /*files can be concatenated. - *We do not support JOIN FILE - *anymore, reserve this flags - *just for preventing such bit - *to be reused.*/ #define OBD_CONNECT_ATTRFID 0x4000ULL /*Server can GetAttr By Fid*/ #define OBD_CONNECT_NODEVOH 0x8000ULL /*No open hndl on specl nodes*/ #define OBD_CONNECT_RMT_CLIENT 0x10000ULL /*Remote client */ #define OBD_CONNECT_RMT_CLIENT_FORCE 0x20000ULL /*Remote client by force */ #define OBD_CONNECT_BRW_SIZE 0x40000ULL /*Max bytes per rpc */ #define OBD_CONNECT_QUOTA64 0x80000ULL /*Not used since 2.4 */ -#define OBD_CONNECT_MDS_CAPA 0x100000ULL /*MDS capability */ -#define OBD_CONNECT_OSS_CAPA 0x200000ULL /*OSS capability */ #define OBD_CONNECT_CANCELSET 0x400000ULL /*Early batched cancels. */ #define OBD_CONNECT_SOM 0x800000ULL /*Size on MDS */ #define OBD_CONNECT_AT 0x1000000ULL /*client uses AT */ @@ -116,31 +118,63 @@ get used. The [[connect-flags]] flags are: Each flag corresponds to a particular capability that the client and target together will honor. A client will send a message including some subset of these capabilities during a connection request to a -specific target. It tells the server what capabilities it has. The +specific target. This tells the server what capabilities it has. The server then replies with the subset of those capabilities it agrees to honor (for the given target). -If the OBD_CONNECT_VERSION flag is set then the 'ocd_version' field is -honored. The 'ocd_version' gives an encoding of the Lustre -version. For example, Version 2.7.32 would be hexadecimal number -0x02073200. +If the 'OBD_CONNECT_VERSION' flag is set then the 'ocd_version' field is +valid. The 'ocd_version' gives an encoding of the Lustre +version. For example, Version 2.7.55 would be hexadecimal number +0x02075500. If the OBD_CONNECT_GRANT flag is set then the 'ocd_grant' field is -honored. The 'ocd_grant' value in a reply (to a connection request) +valid. The 'ocd_grant' value in a reply (to a connection request) sets the client's grant. If the OBD_CONNECT_INDEX flag is set then the 'ocd_index' field is -honored. The 'ocd_index' value is set in a reply to a connection -request. It holds the LOV index of the target. +valid. The 'ocd_index' value is set in a request to hold the LOV +index of the OST or the LMV index of the MDT. The server should +refuse connections to targets for which the 'ocd_index' does not +match the actual target index. If the OBD_CONNECT_BRW_SIZE flag is set then the 'ocd_brw_size' field -is honored. The 'ocd_brw_size' value sets the size of the maximum -supported RPC. The client proposes a value in its connection request, -and the server's reply will either agree or further limit the size. +is valid. The 'ocd_brw_size' value sets the maximum supported bulk +transfer size. The client proposes a value in its connection request, +and the server's reply will either accept the requested size or +further limit the size. The server will not increase the client's +requested maximum bulk transfer size. If the OBD_CONNECT_IBITS flag is set then the 'ocd_ibits_known' field -is honored. The 'ocd_ibits_known' value determines the handling of -locks on inodes. See the discussion of inodes and extended attributes. +is valid. The 'ocd_ibits_known' flags determine the handling of +locks on MDS inodes. The OBD_CONNECT_IBITS flag was introduced in +Lustre 1.4 and is mandatory for MDS_CONNECT RPCs. See also the discussion +of inodes and extended attributes. +[[mds-inode-bits-locks]] + +---- +#define MDS_INODELOCK_LOOKUP 0x000001 /* For namespace, dentry etc, and also + * was used to protect permission (mode, + * owner, group etc) before 2.4. */ +#define MDS_INODELOCK_UPDATE 0x000002 /* size, links, timestamps */ +#define MDS_INODELOCK_OPEN 0x000004 /* For opened files */ +#define MDS_INODELOCK_LAYOUT 0x000008 /* for layout */ +#define MDS_INODELOCK_PERM 0x000010 /* separate permission bits */ +#define MDS_INODELOCK_XATTR 0x000020 /* extended attributes */ +---- + +////////////////////////////////////////////////////////////////////// +/* The PERM bit is added int 2.4, and it is used to protect permission(mode, + * owner, group, acl etc), so to separate the permission from LOOKUP lock. + * Because for remote directories(in DNE), these locks will be granted by + * different MDTs(different ldlm namespace). + * + * For local directory, MDT will always grant UPDATE_LOCK|PERM_LOCK together. + * For Remote directory, the master MDT, where the remote directory is, will + * grant UPDATE_LOCK|PERM_LOCK, and the remote MDT, where the name entry is, + * will grant LOOKUP_LOCK. */ +#define MDS_INODELOCK_PERM 0x000010 +#define MDS_INODELOCK_XATTR 0x000020 /* extended attributes */ +////////////////////////////////////////////////////////////////////// If the OBD_CONNECT_GRANT_PARAM flag is set then the 'ocd_blocksize', 'ocd_inodespace', and 'ocd_grant_extent' fields are honored. A server @@ -164,22 +198,23 @@ inform the client of the transaction number at which it should begin replay. If the OBD_CONNECT_MDS flag is set then the 'ocd_group' field is -honored. When an MDT connects to an OST the 'ocd_group' field informs +valid. When an MDT connects to an OST the 'ocd_group' field informs the OSS of the MDT's index. Objects on that OST for that MDT will be in a common namespace served by that MDT. If the OBD_CONNECT_CKSUM flag is set then the 'ocd_cksum_types' field -is honored. The client uses the 'ocd_checksum_types' field to propose -to the server the client's available (presumably hardware assisted) +is valid. The client uses the 'ocd_cksum_types' field to propose +to the server the client's available (possibly hardware assisted) checksum mechanisms. The server replies with the checksum types it has -available. Finally, the client will employ the fastest of the agreed -mechanisms. +available and that are most efficient on the server. The client may +employ any of the replied checksum algorithms for a given bulk transfer, +but will typically select the fastest of the agreed algorithms. If the OBD_CONNECT_MAX_EASIZE flag is set then the 'ocd_max_easize' -field is honored. The server uses 'ocd_max_easize' to inform the +field is valid. The server uses 'ocd_max_easize' to inform the client about the amount of space that can be allocated in each inode for extended attributes. The 'ocd_max_easize' specifically refers to -the space used for striping information. This allows the client to +the space used for layout information. This allows the client to determine the maximum layout size (and hence stripe count) that can be stored on the MDT. @@ -192,9 +227,10 @@ supports imperative recovery. If the OBD_CONNECT_MAXBYTES flag is set then the 'ocd_maxbytes' field is honored. An OSS uses the 'ocd_maxbytes' value to inform the client -of the maximum OST object size for this target. A stripe on any OST -for a multi-striped file cannot be larger than the minimum maxbytes -value. +of the maximum OST object size for this target. A file that is striped +uniformly across multiple OST objects (RAID-0) cannot be larger than the +number of stripes times the minimum 'ocd_maxbytes' value from any of its +consituent objects. The additional space in the 'obd_connect_data' structure is unused and reserved for future use. @@ -209,127 +245,119 @@ from this client. If the OBD_CONNECT_SRVLOCK flag is set then the client and server support lockless IO. The server will take locks for client IO requests with the OBD_BRW_SRVLOCK flag in the 'niobuf_remote' structure -flags. This is used for Direct IO. The client takes no LDLM lock and +flags. This is used for Direct IO or when there is significant lock +contention on a single OST object. The client takes no LDLM lock and delegates locking to the server. If the OBD_CONNECT_ACL flag is set then the server supports the ACL -mount option for its filesystem. The client supports this mount option -as well, in that case. +mount option for its filesystem. If the server is mounted with ACL +support but the client does not pass OBD_CONNECT_ACL then the client +mount is refused. If the OBD_CONNECT_XATTR flag is set then the server supports user -extended attributes. This is defined by the mount options of the -servers of the backend file systems and is reflected on the client -side by the same mount option but for the Lustre file system itself. +extended attributes. This is requested by the client if mounted +with the appropriate mount option, but is enabled or disabled by the +mount options of the backend file system of MDT0000. If the OBD_CONNECT_TRUNCLOCK flag is set then the client and the server support lockless truncate. This is realized in an OST_PUNCH RPC -by setting the 'obdo' sturcture's 'o_flag' field to include the +by setting the 'obdo' structure's 'o_flag' field to include the OBD_FL_SRVLOCK. In that circumstance the client takes no lock, and the -server must take a lock on the resource. +server must take a lock on the resource while performing the truncate. If the OBD_CONNECT_ATTRFID flag is set then the server supports getattr requests by FID of file instead of name. This reduces -unnecessary RPCs for DNE. +unnecessary RPCs for DNE and for file attribute revalidation after +a lock cancellation. If the OBD_CONNECT_NODEVOH flag is set then the server provides no -open handle for special inodes. +open handle for block and character special inodes. If the OBD_CONNECT_RMT_CLIENT is set then the client is set as 'remote' with respect to the server. The client is considered as 'local' if the user/group database on the client is identical to that on the server, otherwise the client is set as 'remote'. This terminology is part of Lustre Kerberos feature which is not supported -now. +as of Lustre 2.7. If the OBD_CONNECT_RMT_CLIENT_FORCE is set then client is set as remote client forcefully. If the server security level doesn't support remote clients then this client connect reply will return an -EACCESS error. -If the OBD_CONNECT_MDS_CAPA is set then MDS supports capability. -Capabilities are part of Lustre Kerberos. The MDS prepares the -capability when a file is opened and sends it to a client. A client -has to present a capability when it wishes to perform an operation on -that object. - -If the OBD_CONNECT_OSS_CAPA is set then OSS supports capability. -Capabilities are part of Lustre Kerberos. When the clients request the -OSS to perform a modification operations on objects the capability -authorizes these operations. - If the OBD_CONNECT_CANCELSET is set then early batched cancels are enabled. The ELC (Early Lock Cancel) feature allows client locks to be cancelled prior the cancellation callback if it is clear that lock is not needed anymore, for example after rename, after removing file or directory, link, etc. This can reduce amount of RPCs significantly. -If the OBD_CONNECT_AT is set then client and server use Adaptive -Timeout while request processing. Servers keep track of the RPCs +If the OBD_CONNECT_AT is set then client and server use 'Adaptive +Timeouts' during request processing. Servers keep track of the RPC processing time and report this information back to clients to estimate the time needed for future requests and set appropriate RPC timeouts. If the OBD_CONNECT_LRU_RESIZE is set then the LRU self-adjusting is -enabled. This is set by the Lustre configurable option ---enable-lru-resize, and is enabled by default. +enabled. -If the OBD_CONNECT_FID is set then FID support is required by -server. This compatibility flag was introduced in Lustre 2.0. All -servers and clients are using FIDs nowadays. This flag is always set -on server and used to filter out clients without FID support. +If the OBD_CONNECT_FID is set then FID support is understood by the +client and server. This flag was introduced in Lustre 2.0 and +is required when connecting to any 2.0 or later server. It is +understood by Lustre 1.8 clients. If the OBD_CONNECT_VBR is set then version based recovery is used on -server. The VBR uses an object version to track its changes on the -server and to decide if the replay can be applied during recovery -based on that version. This helps to complete recovery even if some -clients were missed or evicted. That flag is always set on server -since Lustre 1.8 and is used just to notify the server if client -doesn't support VBR. - -If the OBD_CONNECT_LOV_V3 is set then the client supports LOV vs -EA. This type of the LOV extended attribute was introduced along with -OST pools support and changed the internal structure of that EA. The -OBD_CONNECT_LOV_V3 flag notifies a server if client doesn't support +the server. VBR stores the most recent transaction in which each +object was modified to track its changes on the server and to decide +if a replayed RPC can be applied during recovery or not. This helps +to complete recovery even if some clients were missed or evicted. +That flag is always set by clients since Lustre 1.8 and is used to +notify the server if client supports VBR. + +If the OBD_CONNECT_LOV_V3 is set if the client supports LOV_MAGIC_V3 +(0x0BD30BD0) style layouts. This type of the layout was introduced +along with OST pools support and added the 'lov_mds_md_v3' layout. The +OBD_CONNECT_LOV_V3 flag notifies a server if client supports this type of LOV EA to handle requests from it properly. If the OBD_CONNECT_GRANT_SHRINK is set then the client can release grant space when idle. If the OBD_CONNECT_SKIP_ORPHAN is set then OST doesn't reuse orphan -object ids after recovery. This connection flag is used between MDS +object IDs after recovery. This connection flag is used between MDS and OST to agree about an object pre-creation policy after MDS -recovery. If some of precreated objects weren't used but an MDT was -restarted then an OST may re-use not used objects for new pre-create -request or may not. The latter is preferred and is used by default -nowadays. +recovery. If set, then if some of precreated objects weren't used +when an MDT was restarted then an OST must destroy any unused objects +rather than re-use those objects. If the OBD_CONNECT_FULL20 is set then the client is Lustre 2.x client. Clients that are using old 1.8 format protocol conventions are not -allowed to connect. This flag should be set on all connections since -2.0, it is no longer affects behaviour and will be disabled completely -once Lustre interoperation with old clients is no longer needed. +allowed to connect to 2.x servers. This flag should be set on all +connections since Lustre 1.8.5. If the OBD_CONNECT_LAYOUTLOCK is set then the client supports layout -lock. The server will not grant a layout lock to the old clients -having no such flag. +lock, which allows the server to revoke the layout of a file from +a client if the layout has changed (e.g. migration between OSTs or +changes in replication state). The server will not grant a layout +lock to the old clients that do not support this feature. If the OBD_CONNECT_64BITHASH is set then the client supports 64-bit -directory hash. The server will also use 64-bit hash mode while -working with ldiskfs. +directory readdir cookie. The server will also use 64-bit hash mode +while working with ldiskfs. If the OBD_CONNECT_JOBSTATS is set then the client fills jobid in -'ptlrpc_body' so server can provide extended statistics per jobid. +'ptlrpc_body' so server can provide extended RPC statistics per jobid. If the OBD_CONNECT_UMASK is set then create uses client umask. This is -default flag for MDS but not for OST. +default flag for MDS but is not relevant for OSTs. -If the OBD_CONNECT_LVB_TYPE is set then the variable type of LVB is -supported by a client. This flag was introduced along with DNE to -recognize DNE-aware clients. +If the OBD_CONNECT_LVB_TYPE is set then the variable type of lock +value block (LVB) is supported by a client. This flag was introduced +to allow the MDS to return data with an IBITS lock, in addition to +the OST object attributes returned with an EXTENT lock. If the OBD_CONNECT_LIGHTWEIGHT is set then this connection is the 'lightweight' one. A lightweight connection has no entry in last_rcvd -file, so no recovery is possible, at the same time a lightweight +file, so no recovery is possible. A new lightweight connection can be set up while the target is in recovery, locks can still be acquired through this connection, although they won't be replayed. Such type of connection is used by services like quota @@ -350,15 +378,16 @@ stripe' disposition for open request from the client. This helps to optimize a recovery of open requests. If the OBD_CONNECT_OPEN_BY_FID is set then an open by FID won't pack -the name in a request. This is used by DNE. +the name in a request. This is used by HSM or other ChangeLog consumers +for accessing objects by their FID via .lustre/fid/ instead of by name. -If the OBD_CONNECT_MDS_MDS is set then the current connection is a -MDS-MDS one. Such connections are distinguished because they provide -more functionality specific to MDS-MDS interoperation. +If the OBD_CONNECT_MDS_MDS is set then the current connection is an +intra-MDS one. Such connections are distinguished because they provide +more functionality specific to MDS-MDS interoperation such as OUT RPCs. If the OBD_CONNECT_IMP_RECOV is set then the Imperative Recovery is supported. Imperative recovery means the clients are notified -explicitly when and where a failed target has restarted. +explicitly when and where a target has restarted after failure. The OBD_CONNECT_REQPORTAL was used to specify that client may use OST_REQUEST_PORTAL for requests to don't interfere with IO portal, @@ -366,21 +395,13 @@ e.g. for MDS-OST interaction. Now it is default request portal for OSC and this flag does nothing though it is still set on client side during connection process. -The OBD_CONNECT_CROW flag was used for create-on-write functionality -on OST, when data objects were created upon first write from the -client. This wasn't implemented because of complex recovery problems. - The OBD_CONNECT_SOM flag was used to signal that MDS is capable to store file size in file attributes, so client may get it directly from MDS avoiding glimpse request to OSTs. This feature was implemented as -demo feature and wasn't enabled by default. Finally it was disabled in +demo feature and wasn't enabled by default. Finally it was removed in Lustre 2.7 because it causes quite complex recovery cases to handle with relatevely small benefits. -The OBD_CONNECT_JOIN flag was used for the 'join files' feature, which -allowed files to be concatenated. Lustre no longer supports that -feature. - The OBD_CONNECT_QUOTA64 was used prior Lustre 2.4 for quota purposes, it is obsoleted due to new quota design. @@ -388,8 +409,8 @@ The OBD_CONNECT_REAL is not real connection flag but used locally on client to distinguish real connection from local connections between layers. -The OBD_CONNECT_CHANGE_QS was used prior Lustre 2.4 for quota needs -and it is obsoleted now due to new quota design. +The OBD_CONNECT_CHANGE_QS was used prior Lustre 2.4 for quota, but +it is obsoleted now due to new quota design. If the OBD_CONNECT_EINPROGRESS is set then client handles -EINPROGRESS RPC error properly. The quota design requires that client must resend @@ -403,9 +424,9 @@ policy and 2.x servers recognize them correctly. Meanwhile this flag is not checked anywhere, so does nothing. If the OBD_CONNECT_NANOSEC_TIME is set then nanosecond timestamps are -enabled. This flag is not used nowadays, but reserved for future use. +enabled. This flag is not used yet, but reserved for future use. If the OBD_CONNECT_SHORTIO is set then short IO feature is enabled on server. The server will avoid bulk IO for small amount of data but -data is incapsulated into ptlrpc request/reply. This flag is just -reserved for future use and does nothing nowadays. +data is encapsulated into ptlrpc request/reply. This flag is reserved +for future use and does nothing yet. diff --git a/obd_statfs.txt b/obd_statfs.txt index 2b25d8b..1559099 100644 --- a/obd_statfs.txt +++ b/obd_statfs.txt @@ -42,7 +42,7 @@ system: [options="header"] |==== | f_type | value -| EXT?_SUPER_MAGIC (ldiskfs) | 0xEF53 +| EXT4_SUPER_MAGIC (ldiskfs) | 0xEF53 | ZFS_SUPER_MAGIC | 0x2fc12fc1 |==== @@ -52,7 +52,9 @@ units of os_bsize. The 'os_bfree' field is the number of blocks not currently in use. The 'os_bavail' is the number of blocks available to be allocated to -new files. +new files. The number of available blocks is typically lower than +the number of free blocks due to <> and blocks reserved for +target internal usage such as metadata. The 'os_files' field is the total number of files on the target, both allocated and free. For some OSD types this is a static number, and @@ -66,15 +68,15 @@ space changes. The 'os_fsid' is intended to be the target backing device UUID in ASCII format. The current osd-ldiskfs and osd-zfs implementations -don't fill this in. +do not fill this field in. The 'os_bsize' field is the block size in bytes. This is for computing -the total, free, and available space in combination with os_blocks, -os_bfree, and os_bavail respectively. It does not necessarily +the total, free, and available space in combination with 'os_blocks', +'os_bfree', and 'os_bavail' respectively. It does not necessarily represent the minimum or optimal IO size. The 'os_namelen' field gives the maximum name length for files on the -back-end file system. +back-end file system in bytes. The 'os_maxbytes' field is the maximum size of a single object (i.e. the maximum byte offset that can be written to). This is the @@ -95,9 +97,10 @@ file system. It can be: In normal operation the 'os_state' value is returned as 0x0. If the back-end file system has a RAID configuration that is degraded or rebuilding the state is returned with the OS_STATE_DEGRADED (0x1) flag -set. If the file system has been set to read-only, for whatever -reason, then the state is returned with the OS_STATE_READONLY (0x2) -flag set, for example if it was explicitly mounted read-only, or +set. If the file system has been set to read-only, either manually at +mount or automatcially due to detected corruption of the underlying +target filesystem, then 'os_state' is returned with OS_STATE_READONLY (0x2) +set, for example if it was explicitly mounted read-only, or corruption has been detected at runtime in the backing filesystem. The 'os_fprecreated' field counts the number of pre-created objects diff --git a/ost_connect.txt b/ost_connect.txt index 0d05db2..484e622 100644 --- a/ost_connect.txt +++ b/ost_connect.txt @@ -39,8 +39,8 @@ last transaction value would just be zero. If there were previous transactions for the client, then the transaction number for the last such committed transaction is put in the 'pb_last_committed' field. -In a connection request the operation is not file system modifying, so -the 'pb_transno' value will be zero in the reply as well. +In a connection request the operation does not modify the filesystem, +so the 'pb_transno' value will be zero in the reply as well. fixme: there is still some work to be done about how the fields are managed. diff --git a/ost_disconnect.txt b/ost_disconnect.txt index b7d1716..37592ab 100644 --- a/ost_disconnect.txt +++ b/ost_disconnect.txt @@ -9,6 +9,7 @@ RPC 9: OST DISCONNECT - Disconnect client from an OST | empty | empty |==== -The information exchanged in a DISCONNECT message is that normally -conveyed in the RPC descriptor. +The information exchanged in an OST_DISCONNECT request message is +only that normally conveyed in the RPC descriptor, as described in +<>. diff --git a/ost_lvb.txt b/ost_lvb.txt index 60d29b6..845fd9e 100644 --- a/ost_lvb.txt +++ b/ost_lvb.txt @@ -4,7 +4,7 @@ OST Lock Value Block The 'ost_lvb' structure is a "lock value block", and encompasses attribute data for resources on the OST. It is an optional part of an -LDLM_ENQUEUE reply RPC . +LDLM_ENQUEUE reply RPC for an MDT as well. ---- struct ost_lvb { diff --git a/ost_setattr_structs.txt b/ost_setattr_structs.txt index 441b54f..022500d 100644 --- a/ost_setattr_structs.txt +++ b/ost_setattr_structs.txt @@ -117,22 +117,3 @@ for the 'mdt_body' field 'mbo_valid' (see <>). The two fields here that are specific to the 'ost_body' case are the 'o_oi' and the 'o_stripe_idx', which give the identity of the OST and the stripe index assigned to it. - -OST ID -^^^^^^ -[[struct-ost-id]] - -The 'ost_id' identifies an object on a particular OST. - ----- -struct ost_id { - union { - struct { - __u64 oi_id; - __u64 oi_seq; - } oi; - struct lu_fid oi_fid; - }; -}; ----- - diff --git a/ost_statfs.txt b/ost_statfs.txt index 3df95d3..dda2a61 100644 --- a/ost_statfs.txt +++ b/ost_statfs.txt @@ -7,8 +7,7 @@ underlying file system for a given OST. It's form and use are nearly identical to the MDS_STATFS RPC. Refer to <> for details. The only differences in OST_STATFS are that it has a distinct 'pb_opc' value, it carries information about an OST (instead of an -MDT), and it can be generated in this one additional way: There is a -regularly scheduled process that 'pings' each OST from each MDT using -OST_STATFS, so that the MDT can remain current on the available space -on the OSTs. +MDT). The MDT may send regular OST_STATFS RPCs to each OST in order +to keep its information about free space and utilization updated. +That allows the MDS to make more optimal file allocation decisions. diff --git a/ptlrpc_body.txt b/ptlrpc_body.txt index 1f5999a..f8585f5 100644 --- a/ptlrpc_body.txt +++ b/ptlrpc_body.txt @@ -125,10 +125,11 @@ layers supporting the exchange of RPCs can be in good working order when 'pb_status' = -ENOTCONN is returned in an RPC reply message. The connection refered to by that status is the Lustre connection. That connection is part of the shared state between Lustre clients and -servers that gets established via MDS_CONNECT and OST_CONNECT RPCs, -and can be lost due to an 'eviction'. So, even when that Lusre -connection is lost (or has not been established, yet), RPC messages -can be exchanged. +servers that gets established via *_CONNECT RPCs, +and can be lost due to an 'eviction' in the face of temporary connection +failure or in case of an unresponsive client (from the server's point +of view). So, even when that Lusre connection is lost (or has not been +established, yet), RPC messages such as *_CONNECT can be exchanged. The 'pb_version' identifies the version of the Lustre protocol and is derived from the following constants. The lower two bytes give the @@ -274,7 +275,8 @@ set in any kind of reply message including pings and non-modifying transactions. If 'pb_last_committed' is larger than, or equal to, any of the client's uncommitted requests (see 'pb_transno' below) then the server is confirming those requests have been committed to stable -storage. At that point the client will free the request structures. +storage. At that point the client may free those request structures +as they will no longer be necessary for replay during recovery. The 'pb_transno' value is always zero in a new request. It is also zero for replies to operations that do not modify the file system. For @@ -363,8 +365,9 @@ allowed during server recovery. MGS_CONNECT_ASYNC is currently unused. MSG_CONNECT_NEXT_VER indicates that the client can understand the next -higher protocol version, and the server can reply to the connect with -that RPC version if it is supported, otherwise it will reply with the +higher protocol version in addition to the currently used protocol, and +the server can reply to the connect with that higher RPC version if +it is supported, otherwise it will reply with the same RPC version as the request. This allows RPC protocol versions to be negotiated during a transition period (e.g. upgrade from RPC from LUSTRE_MSG_MAGIC_V1 to LUSTRE_MSG_MAGIC_V2). diff --git a/security.txt b/security.txt index 2928162..5a8aefc 100644 --- a/security.txt +++ b/security.txt @@ -4,5 +4,4 @@ Security The discussion of security is deferred until later. It is a side issue to most of what is being documented here. It is seldom employed, and -may have older and buggy code. At least I think that's what Mike was -telling me. +may have older and buggy code. -- 1.8.3.1