From: Andrew C. Uselton Date: Mon, 2 Mar 2015 18:00:28 +0000 (-0800) Subject: LUDOC-270 protocol: Update the outline and add support files X-Git-Url: https://git.whamcloud.com/?a=commitdiff_plain;h=refs%2Fchanges%2F38%2F13938%2F8;p=doc%2Fprotocol.git LUDOC-270 protocol: Update the outline and add support files This patch modifies the outline to the document: - Orgainizing the presentation a little better - Removing empty sections until they are ready to be included - Breaking main topics out into separate files The organization is bottom-up in that it starts with a Dtat Types Section, builds on that by showing how messages are structured, then addresses how the Lustre operations use those messages, and finally how file system activitiy uses Lustre oparations to carry out its activities. Signed-off-by: Andrew C. Uselton Change-Id: I46cb03351ad0a66540b1723b864ec6ca158000e3 Reviewed-on: http://review.whamcloud.com/13938 Tested-by: Jenkins --- diff --git a/Makefile b/Makefile index ed552c0..193d015 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,18 @@ FIGURES = figures/mkdir1.png +TEXT = protocol.txt \ + introduction.txt \ + data_types.txt \ + connection.txt \ + timeouts.txt \ + file_id.txt \ + ldlm.txt \ + llog.txt \ + recovery.txt \ + security.txt \ + lustre_messages.txt \ + lustre_operations.txt \ + file_system_operations.txt \ + glossary.txt .SUFFIXES : .gnuplot .gv .pdf .png @@ -6,14 +20,14 @@ FIGURES = figures/mkdir1.png all: protocol.html protocol.pdf .PHONY: check -check: protocol.txt +check: $(TEXT) @echo "Are there lines with trailing white space?" build/whitespace.sh $< -protocol.html: $(FIGURES) protocol.txt +protocol.html: $(FIGURES) $(TEXT) asciidoc protocol.txt -protocol.pdf: $(FIGURES) protocol.txt +protocol.pdf: $(FIGURES) $(TEXT) a2x -f pdf --fop protocol.txt .gv.png: diff --git a/connection.txt b/connection.txt new file mode 100644 index 0000000..0a7d1c2 --- /dev/null +++ b/connection.txt @@ -0,0 +1,838 @@ +Connections Between Lustre Entities +----------------------------------- +[[connection]] + +The Lustre protocol is connection-based in that each two entities +maintain shared, coordinated state information. The most common +example of two such entities are a client and a target on some +server. The target is identified by name to the client through an +interaction with the management server. The client then 'connects' to +the given target on the indicated server by sending the appropriate +version of the *_CONNECT message (MGS_CONNECT, MDS_CONNECT, or +OST_CONNECT - colectively *_CONNECT) and receiving back the +corresponding *_CONNECT reply. The server creates an 'export' for the +connection between the target and the client, and the export holds the +server state information for that connection. When the client gets the +reply it creates an 'import', and the import holds the client state +information for that connection. Note that if a server has N targets +and M clients have connected to them, the server will have N x M +exports and each client will have N imports. + +There are also connections between the servers: Each MDS and OSS has a +connection to the MGS, where the MDS (respectively the OSS) plays the +role of the client in the above discussion. That is, the MDS initiates +the connection and has an import for the MGS, while the MGS has an +export for each MDS. Each MDS connects to each OST, with an import on +the MDS and an export on the OSS. This connection supports requests +from the MDS to the OST for 'statfs' information such as size and +access time values. Each OSS also connects to the first MDS to get +access to auxiliary services, with an import on the OSS and an export +on the first MDS. The auxiliary services are: the File ID Location +Database (FLDB), the quota master service, and the sequence +controller. + +Finally, for some communications the roles of message initiation and +message reply are reversed. This is the case, for instance, with +call-back operations. In that case the entity which would normally +have an import has, instead, a 'reverse-export' and the +other end of the connection maintains a 'reverse-import'. The +reverse-import uses the same structure as a regular import, and the +reverse-export uses the same structure as a regular export. + +Connection Structures +~~~~~~~~~~~~~~~~~~~~~ + +Connect Data +^^^^^^^^^^^^ + +An 'obd_connect_data' structure accompanies every connect operation in +both the request message and in the reply message. + +---- +struct obd_connect_data { + __u64 ocd_connect_flags; + __u32 ocd_version; /* OBD_CONNECT_VERSION */ + __u32 ocd_grant; /* OBD_CONNECT_GRANT */ + __u32 ocd_index; /* OBD_CONNECT_INDEX */ + __u32 ocd_brw_size; /* OBD_CONNECT_BRW_SIZE */ + __u64 ocd_ibits_known; /* OBD_CONNECT_IBITS */ + __u8 ocd_blocksize; /* OBD_CONNECT_GRANT_PARAM */ + __u8 ocd_inodespace; /* OBD_CONNECT_GRANT_PARAM */ + __u16 ocd_grant_extent; /* OBD_CONNECT_GRANT_PARAM */ + __u32 ocd_unused; + __u64 ocd_transno; /* OBD_CONNECT_TRANSNO */ + __u32 ocd_group; /* OBD_CONNECT_MDS */ + __u32 ocd_cksum_types; /* OBD_CONNECT_CKSUM */ + __u32 ocd_max_easize; /* OBD_CONNECT_MAX_EASIZE */ + __u32 ocd_instance; + __u64 ocd_maxbytes; /* OBD_CONNECT_MAXBYTES */ + __u64 padding1; + __u64 padding2; + __u64 padding3; + __u64 padding4; + __u64 padding5; + __u64 padding6; + __u64 padding7; + __u64 padding8; + __u64 padding9; + __u64 paddingA; + __u64 paddingB; + __u64 paddingC; + __u64 paddingD; + __u64 paddingE; + __u64 paddingF; +}; +---- + +The 'ocd_connect_flags' field encodes the connect flags giving the +capabilities of a connection between client and target. Several of +those flags (noted in comments above and the discussion below) +actually control whether the remaining fields of 'obd_connect_data' +get used. The [[connect-flags]] flags are: + +---- +#define OBD_CONNECT_RDONLY 0x1ULL /*client has read-only access*/ +#define OBD_CONNECT_INDEX 0x2ULL /*connect specific LOV idx */ +#define OBD_CONNECT_MDS 0x4ULL /*connect from MDT to OST */ +#define OBD_CONNECT_GRANT 0x8ULL /*OSC gets grant at connect */ +#define OBD_CONNECT_SRVLOCK 0x10ULL /*server takes locks for cli */ +#define OBD_CONNECT_VERSION 0x20ULL /*Lustre versions in ocd */ +#define OBD_CONNECT_REQPORTAL 0x40ULL /*Separate non-IO req portal */ +#define OBD_CONNECT_ACL 0x80ULL /*access control lists */ +#define OBD_CONNECT_XATTR 0x100ULL /*client use extended attr */ +#define OBD_CONNECT_CROW 0x200ULL /*MDS+OST create obj on write*/ +#define OBD_CONNECT_TRUNCLOCK 0x400ULL /*locks on server for punch */ +#define OBD_CONNECT_TRANSNO 0x800ULL /*replay sends init transno */ +#define OBD_CONNECT_IBITS 0x1000ULL /*support for inodebits locks*/ +#define OBD_CONNECT_JOIN 0x2000ULL /*files can be concatenated. + *We do not support JOIN FILE + *anymore, reserve this flags + *just for preventing such bit + *to be reused.*/ +#define OBD_CONNECT_ATTRFID 0x4000ULL /*Server can GetAttr By Fid*/ +#define OBD_CONNECT_NODEVOH 0x8000ULL /*No open hndl on specl nodes*/ +#define OBD_CONNECT_RMT_CLIENT 0x10000ULL /*Remote client */ +#define OBD_CONNECT_RMT_CLIENT_FORCE 0x20000ULL /*Remote client by force */ +#define OBD_CONNECT_BRW_SIZE 0x40000ULL /*Max bytes per rpc */ +#define OBD_CONNECT_QUOTA64 0x80000ULL /*Not used since 2.4 */ +#define OBD_CONNECT_MDS_CAPA 0x100000ULL /*MDS capability */ +#define OBD_CONNECT_OSS_CAPA 0x200000ULL /*OSS capability */ +#define OBD_CONNECT_CANCELSET 0x400000ULL /*Early batched cancels. */ +#define OBD_CONNECT_SOM 0x800000ULL /*Size on MDS */ +#define OBD_CONNECT_AT 0x1000000ULL /*client uses AT */ +#define OBD_CONNECT_LRU_RESIZE 0x2000000ULL /*LRU resize feature. */ +#define OBD_CONNECT_MDS_MDS 0x4000000ULL /*MDS-MDS connection */ +#define OBD_CONNECT_REAL 0x8000000ULL /*real connection */ +#define OBD_CONNECT_CHANGE_QS 0x10000000ULL /*Not used since 2.4 */ +#define OBD_CONNECT_CKSUM 0x20000000ULL /*support several cksum algos*/ +#define OBD_CONNECT_FID 0x40000000ULL /*FID is supported by server */ +#define OBD_CONNECT_VBR 0x80000000ULL /*version based recovery */ +#define OBD_CONNECT_LOV_V3 0x100000000ULL /*client supports LOV v3 EA */ +#define OBD_CONNECT_GRANT_SHRINK 0x200000000ULL /* support grant shrink */ +#define OBD_CONNECT_SKIP_ORPHAN 0x400000000ULL /* don't reuse orphan objids */ +#define OBD_CONNECT_MAX_EASIZE 0x800000000ULL /* preserved for large EA */ +#define OBD_CONNECT_FULL20 0x1000000000ULL /* it is 2.0 client */ +#define OBD_CONNECT_LAYOUTLOCK 0x2000000000ULL /* client uses layout lock */ +#define OBD_CONNECT_64BITHASH 0x4000000000ULL /* client supports 64-bits + * directory hash */ +#define OBD_CONNECT_MAXBYTES 0x8000000000ULL /* max stripe size */ +#define OBD_CONNECT_IMP_RECOV 0x10000000000ULL /* imp recovery support */ +#define OBD_CONNECT_JOBSTATS 0x20000000000ULL /* jobid in ptlrpc_body */ +#define OBD_CONNECT_UMASK 0x40000000000ULL /* create uses client umask */ +#define OBD_CONNECT_EINPROGRESS 0x80000000000ULL /* client handles -EINPROGRESS + * RPC error properly */ +#define OBD_CONNECT_GRANT_PARAM 0x100000000000ULL/* extra grant params used for + * finer space reservation */ +#define OBD_CONNECT_FLOCK_OWNER 0x200000000000ULL /* for the fixed 1.8 + * policy and 2.x server */ +#define OBD_CONNECT_LVB_TYPE 0x400000000000ULL /* variable type of LVB */ +#define OBD_CONNECT_NANOSEC_TIME 0x800000000000ULL /* nanosecond timestamps */ +#define OBD_CONNECT_LIGHTWEIGHT 0x1000000000000ULL/* lightweight connection */ +#define OBD_CONNECT_SHORTIO 0x2000000000000ULL/* short io */ +#define OBD_CONNECT_PINGLESS 0x4000000000000ULL/* pings not required */ +#define OBD_CONNECT_FLOCK_DEAD 0x8000000000000ULL/* deadlock detection */ +#define OBD_CONNECT_DISP_STRIPE 0x10000000000000ULL/* create stripe disposition*/ +#define OBD_CONNECT_OPEN_BY_FID 0x20000000000000ULL /* open by fid won't pack + name in request */ +---- + +Each flag corresponds to a particular capability that the client and +target together will honor. A client will send a message including +some subset of these capabilities during a connection request to a +specific target. It tells the server what capabilities it has. The +server then replies with the subset of those capabilities it agrees to +honor (for the given target). + +If the OBD_CONNECT_VERSION flag is set then the 'ocd_version' field is +honored. The 'ocd_version' gives an encoding of the Lustre +version. For example, Version 2.7.32 would be hexadecimal number +0x02073200. + +If the OBD_CONNECT_GRANT flag is set then the 'ocd_grant' field is +honored. The 'ocd_grant' value in a reply (to a connection request) +sets the client's grant. + +If the OBD_CONNECT_INDEX flag is set then the 'ocd_index' field is +honored. The 'ocd_index' value is set in a reply to a connection +request. It holds the LOV index of the target. + +If the OBD_CONNECT_BRW_SIZE flag is set then the 'ocd_brw_size' field +is honored. The 'ocd_brw_size' value sets the size of the maximum +supported RPC. The client proposes a value in its connection request, +and the server's reply will either agree or further limit the size. + +If the OBD_CONNECT_IBITS flag is set then the 'ocd_ibits_known' field +is honored. The 'ocd_ibits_known' value determines the handling of +locks on inodes. See the discussion of inodes and extended attributes. + +If the OBD_CONNECT_GRANT_PARAM flag is set then the 'ocd_blocksize', +'ocd_inodespace', and 'ocd_grant_extent' fields are honored. A server +reply uses the 'ocd_blocksize' value to inform the client of the log +base two of the size in bytes of the backend file system's blocks. + +A server reply uses the 'ocd_inodespace' value to inform the client of +the log base two of the size of an inode. + +Under some circumstances (for example when ZFS is the back end file +system) there may be additional overhead in handling writes for each +extent. The server uses the 'ocd_grant_extent' value to inform the +client of the size in bytes consumed from its grant on the server when +creating a new file. The client uses this value in calculating how +much dirty write cache it has and whether it has reached the limit +established by the target's grant. + +If the OBD_CONNECT_TRANSNO flag is set then the 'ocd_transno' field is +honored. A server uses the 'ocd_transno' value during recovery to +inform the client of the transaction number at which it should begin +replay. + +If the OBD_CONNECT_MDS flag is set then the 'ocd_group' field is +honored. When an MDT connects to an OST the 'ocd_group' field informs +the OSS of the MDT's index. Objects on that OST for that MDT will be +in a common namespace served by that MDT. + +If the OBD_CONNECT_CKSUM flag is set then the 'ocd_cksum_types' field +is honored. The client uses the 'ocd_checksum_types' field to propose +to the server the client's available (presumably hardware assisted) +checksum mechanisms. The server replies with the checksum types it has +available. Finally, the client will employ the fastest of the agreed +mechanisms. + +If the OBD_CONNECT_MAX_EASIZE flag is set then the 'ocd_max_easize' +field is honored. The server uses 'ocd_max_easize' to inform the +client about the amount of space that can be allocated in each inode +for extended attributes. The 'ocd_max_easize' specifically refers to +the space used for striping information. This allows the client to +determine the maximum layout size (and hence stripe count) that can be +stored on the MDT. + +The 'ocd_instance' field (alone) is not governed by an OBD_CONNECT_* +flag. The MGS uses the 'ocd_instance' value in its reply to a +connection request to inform the server and target of the "era" of its +connection. The MGS initializes the era value for each server to zero +and increments that value every time the target connects. This +supports imperative recovery. + +If the OBD_CONNECT_MAXBYTES flag is set then the 'ocd_maxbytes' field +is honored. An OSS uses the 'ocd_maxbytes' value to inform the client +of the maximum OST object size for this target. A stripe on any OST +for a multi-striped file cannot be larger than the minimum maxbytes +value. + +The additional space in the 'obd_connect_data' structure is unused and +reserved for future use. + +fixme: Discuss the meaning of the rest of the OBD_CONNECT_* flags. + +Import +^^^^^^ + +---- +#define IMP_STATE_HIST_LEN 16 +struct import_state_hist { + enum lustre_imp_state ish_state; + time_t ish_time; +}; +struct obd_import { + struct portals_handle imp_handle; + atomic_t imp_refcount; + struct lustre_handle imp_dlm_handle; + struct ptlrpc_connection *imp_connection; + struct ptlrpc_client *imp_client; + cfs_list_t imp_pinger_chain; + cfs_list_t imp_zombie_chain; + cfs_list_t imp_replay_list; + cfs_list_t imp_sending_list; + cfs_list_t imp_delayed_list; + cfs_list_t imp_committed_list; + cfs_list_t *imp_replay_cursor; + struct obd_device *imp_obd; + struct ptlrpc_sec *imp_sec; + struct mutex imp_sec_mutex; + cfs_time_t imp_sec_expire; + wait_queue_head_t imp_recovery_waitq; + atomic_t imp_inflight; + atomic_t imp_unregistering; + atomic_t imp_replay_inflight; + atomic_t imp_inval_count; + atomic_t imp_timeouts; + enum lustre_imp_state imp_state; + struct import_state_hist imp_state_hist[IMP_STATE_HIST_LEN]; + int imp_state_hist_idx; + int imp_generation; + __u32 imp_conn_cnt; + int imp_last_generation_checked; + __u64 imp_last_replay_transno; + __u64 imp_peer_committed_transno; + __u64 imp_last_transno_checked; + struct lustre_handle imp_remote_handle; + cfs_time_t imp_next_ping; + __u64 imp_last_success_conn; + cfs_list_t imp_conn_list; + struct obd_import_conn *imp_conn_current; + spinlock_t imp_lock; + /* flags */ + unsigned long + imp_no_timeout:1, + imp_invalid:1, + imp_deactive:1, + imp_replayable:1, + imp_dlm_fake:1, + imp_server_timeout:1, + imp_delayed_recovery:1, + imp_no_lock_replay:1, + imp_vbr_failed:1, + imp_force_verify:1, + imp_force_next_verify:1, + imp_pingable:1, + imp_resend_replay:1, + imp_no_pinger_recover:1, + imp_need_mne_swab:1, + imp_force_reconnect:1, + imp_connect_tried:1; + __u32 imp_connect_op; + struct obd_connect_data imp_connect_data; + __u64 imp_connect_flags_orig; + int imp_connect_error; + __u32 imp_msg_magic; + __u32 imp_msghdr_flags; /* adjusted based on server capability */ + struct ptlrpc_request_pool *imp_rq_pool; /* emergency request pool */ + struct imp_at imp_at; /* adaptive timeout data */ + time_t imp_last_reply_time; /* for health check */ +}; +---- + +The 'imp_handle' value is the unique id for the import, and is used as +a hash key to gain access to it. It is not used in any of the Lustre +protocol messages, but rather is just for internal reference. + +The 'imp_refcount' is also for internal use. The value is incremented +with each RPC created, and decremented as the request is freed. When +the reference count is zero the import can be freed, as when the +target is being disconnected. + +The 'imp_dlm_handle' is a reference to the LDLM export for this +client. + +There can be multiple paths through the network to a given +target, in which case there would be multiple 'obd_import_conn' items +on the 'imp_conn_list'. Each 'obd_imp_conn' includes a +'ptlrpc_connection', so 'imp_connection' points to the one that is +actually in use. + +The 'imp_client' identifies the (local) portals for sending and +receiving messages as well as the client's name. The information is +specific to either an MDC or an OSC. + +The 'imp_ping_chain' places the import on a linked list of imports +that need periodic pings. + +The 'imp_zombie_chain' places the import on a list ready for being +freed. Unused imports ('imp_refcount' is zero) are deleted +asynchronously by a garbage collecting process. + +In order to support recovery the client must keep requests that are in +the process of being handled by the target. The target replies to a +request as soon as the target has made its local update to +memory. When the client receives that reply the request is put on the +'imp_replay_list'. In the event of a failure (target crash, lost +message) this list is then replayed for the target during the recovery +process. When a request has been sent but has not yet received a reply +it is placed on the 'imp_sending_list'. In the event of a failure +those will simply be replayed after any recovery has been +completed. Finally, there may be requests that the client is delaying +before it sends them. This can happen if the client is in a degraded +mode, as when it is in recovery after a failure. These requests are +put on the 'imp_delayed_list' and not processed until recovery is +complete and the 'imp_sending_list' has been replayed. + +In order to support recovery 'open' requests must be preserved even +after they have completed. Those requests are placed on the +'imp_committed_list' and the 'imp_replay_cursor' allows for +accelerated access to those items. + +The 'imp_obd' is a reference to the details about the target device +that is the subject of this import. There is a lot of state info in +there along with many implementation details that are not relevant to +the actual Lustre protocol. fixme: I'll want to go through all of the +fields in that structure to see which, if any need more +documentation. + +The security policy and settings are kept in 'imp_sec', and +'imp_sec_mutex' helps manage access to that info. The 'imp_sec_expire' +setting is in support of security policies that have an expiration +strategy. + +Some processes may need the import to be in a fully connected state in +order to proceed. The 'imp_recovery_waitq' is where those threads will +wait during recovery. + +The 'imp_inflight' field counts the number of in-flight requests. It +is incremented with each request sent and decremented with each reply +received. + +The client reserves buffers for the processing of requests and +replies, and then informs LNet about those buffers. Buffers may get +reused during subsequent processing, but then a point may come when +the buffer is no longer going to be used. The client increments the +'imp_unregistering' counter and informs LNet the buffer is no longer +needed. When LNet has freed the buffer it will notify the client and +then the 'imp_unregistering' can be decremented again. + +During recovery the 'imp_reply_inflight' counts the number of requests +from the reply list that have been sent and have not been replied to. + +The 'imp_inval_count' field counts how many threads are in the process +of cleaning up this connection or waiting for cleanup to complete. The +cleanup itself may be needed in the case there is an eviction or other +problem (fixme what other problem?). The cleanup may involve freeing +allocated resources, updating internal state, running replay lists, +and invalidating cache. Since it could take a while there may end up +multiple threads waiting on this process to complete. + +The 'imp_timeout' field is a counter that is incremented every time +there is a timeout in communication with the target. + +The 'imp_state' tracks the state of the import. It draws from the +enumerated set of values: + +.enum_lustre_imp_state +[options="header"] +|===== +| state name | value +| LUSTRE_IMP_CLOSED | 1 +| LUSTRE_IMP_NEW | 2 +| LUSTRE_IMP_DISCON | 3 +| LUSTRE_IMP_CONNECTING | 4 +| LUSTRE_IMP_REPLAY | 5 +| LUSTRE_IMP_REPLAY_LOCKS | 6 +| LUSTRE_IMP_REPLAY_WAIT | 7 +| LUSTRE_IMP_RECOVER | 8 +| LUSTRE_IMP_FULL | 9 +| LUSTRE_IMP_EVICTED | 10 +|===== +fixme: what are the transitions between these states? The +'imp_state_hist' array maintains a list of the last 16 +(IMP_STATE_HIST_LEN) states the import was in, along with the time it +entered each (fixme: or is it when it left that state?). The list is +maintained in a circular manner, so the 'imp_state_hist_idx' points to +the entry in the list for the most recently visited state. + +The 'imp_generation' and 'imp_conn_cnt' fields are monotonically +increasing counters. Every time a connection request is sent to the +target the 'imp_conn_cnt' counter is incremented, and every time a +reply is received for the connection request the 'imp_generation' +counter is incremented. + +The 'imp_last_generation_checked' implements an optimization. When a +replay process has successfully traversed the reply list the +'imp_generation' value is noted here. If the generation has not +incremented then the replay list does not need to be traversed again. + +During replay the 'imp_last_replay_transno' is set to the transaction +number of the last request being replayed, and +'imp_peer_committed_transno is set to the 'pb_last_committed' value +(of the 'ptlrpc_body') from replies if that value is higher than the +previous 'imp_peer_committed_transno'. The 'imp_last_transno_checked' +field implements an optimization. It is set to the +'imp_last_replay_transno' as its replay is initiated. If +'imp_last_transno_checked' is still 'imp_last_replay_transno' and +'imp_generation' is still 'imp_last_generation_checked' then there +are no additional requests ready to be removed from the replay +list. Furthermore, 'imp_last_transno_checked' may no longer be needed, +since the committed transactions are now maintained on a separate list. + +The 'imp_remote_handle' is the handle sent by the target in a +connection reply message to uniquely identify the export for this +target and client that is maintained on the server. This is the handle +used in all subsequent messages to the target. + +There are two separate ping intervals (fixme: what are the +values?). If there are no uncommitted messages for the target then the +default ping interval is used to set the 'imp_next_ping' to the time +the next ping needs to be sent. If there are uncommitted requests then +a "short interval" is used to set the time for the next ping. + +The 'imp_last_success_conn' value is set to the time of the last +successful connection. fixme: The source says it is in 64 bit +jiffies, but does not further indicate how that value is calculated. + +Since there can actually be multiple connection paths for a target +(due to failover or multihomed configurations) the import maintains a +list of all the possible connection paths in the list pointed to by +the 'imp_conn_list' field. The 'imp_conn_current' points to the one +currently in use. Compare with the 'imp_connection' fields. They point +to different structures, but each is reachable from the other. + +Most of the flag, state, and list information in the import needs to +be accessed atomically. The 'imp_lock' is used to maintain the +consistency of the import while it is manipulated by multiple threads. + +The various flags are documented in the source code and are largely +obvious from those short comments, reproduced here: + +.import flags +[options="header"] +|===== +| flag | explanation +| imp_no_timeout | timeouts are disabled +| imp_invalid | client has been evicted +| imp_deactive | client administratively disabled +| imp_replayable | try to recover the import +| imp_dlm_fake | don't run recovery (timeout instead) +| imp_server_timeout | use 1/2 timeout on MDSs and OSCs +| imp_delayed_recovery | VBR: imp in delayed recovery +| imp_no_lock_replay | VBR: if gap was found then no lock replays +| imp_vbr_failed | recovery by versions was failed +| imp_force_verify | force an immidiate ping +| imp_force_next_verify | force a scheduled ping +| imp_pingable | target is pingable +| imp_resend_replay | resend for replay +| imp_no_pinger_recover | disable normal recovery, for test only. +| imp_need_mne_swab | need IR MNE swab +| imp_force_reconnect | import must be reconnected, not new connection +| imp_connect_tried | import has tried to connect with server +|===== +A few additional notes are in order. The 'imp_dlm_fake' flag signifies +that this is not a "real" import, but rather it is a "reverse"import +in support of the LDLM. When the LDLM invokes callback operations the +messages are initiated at the other end, so there need to a fake +import to receive the replies from the operation. Prior to the +introduction of adaptive timeouts the servers were given fixed timeout +value that were half those used for the clients. The +'imp_server_timeout' flag indicated that the import should use the +half-sized timeouts, but with the introduction of adaptive timeouts +this facility is no longer used. "VBR" is "version based recovery", +and it introduces a new possibility for handling requests. Previously, +f there were a gap in the transaction number sequence the the requests +associated with the missing transaction numbers would be +discarded. With VBR those transaction only need to be discarded if +there is an actual dependency between the ones that were skipped and +the currently latest committed transaction number. fixme: What are the +circumstances that would lead to setting the 'imp_force_next_verify' +or 'imp_pingable' flags? During recovery, the client sets the +'imp_no_pinger_recover' flag, which tells the process to proceed from +the current value of 'imp_replay_last_transno'. The +'imp_need_mne_swab' flag indicates a version dependent circumstance +where swabbing was inadvertently left out of one processing step. + + +Export +^^^^^^ + +An 'obd_export' structure for a given target is created on a server +for each client that connects to that target. The exports for all the +clients for a given target are managed together. The export represents +the connection state between the client and target as well as the +current state of any ongoing activity. Thus each pending request will +have a reference to the export. The export is discarded if the +connection goes away, but only after all the references to it have +been cleaned up. The state information for each export is also +maintained on disk. In the event of a server failure, that or another +server can read the export date from disk to enable recovery. + +---- +struct obd_export { + struct portals_handle exp_handle; + atomic_t exp_refcount; + atomic_t exp_rpc_count; + atomic_t exp_cb_count; + atomic_t exp_replay_count; + atomic_t exp_locks_count; +#if LUSTRE_TRACKS_LOCK_EXP_REFS + cfs_list_t exp_locks_list; + spinlock_t exp_locks_list_guard; +#endif + struct obd_uuid exp_client_uuid; + cfs_list_t exp_obd_chain; + cfs_hlist_node_t exp_uuid_hash; + cfs_hlist_node_t exp_nid_hash; + cfs_list_t exp_obd_chain_timed; + struct obd_device *exp_obd; + struct obd_import *exp_imp_reverse; + struct nid_stat *exp_nid_stats; + struct ptlrpc_connection *exp_connection; + __u32 exp_conn_cnt; + cfs_hash_t *exp_lock_hash; + cfs_hash_t *exp_flock_hash; + cfs_list_t exp_outstanding_replies; + cfs_list_t exp_uncommitted_replies; + spinlock_t exp_uncommitted_replies_lock; + __u64 exp_last_committed; + cfs_time_t exp_last_request_time; + cfs_list_t exp_req_replay_queue; + spinlock_t exp_lock; + struct obd_connect_data exp_connect_data; + enum obd_option exp_flags; + unsigned long + exp_failed:1, + exp_in_recovery:1, + exp_disconnected:1, + exp_connecting:1, + exp_delayed:1, + exp_vbr_failed:1, + exp_req_replay_needed:1, + exp_lock_replay_needed:1, + exp_need_sync:1, + exp_flvr_changed:1, + exp_flvr_adapt:1, + exp_libclient:1, + exp_need_mne_swab:1; + enum lustre_sec_part exp_sp_peer; + struct sptlrpc_flavor exp_flvr; + struct sptlrpc_flavor exp_flvr_old[2]; + cfs_time_t exp_flvr_expire[2]; + spinlock_t exp_rpc_lock; + cfs_list_t exp_hp_rpcs; + cfs_list_t exp_reg_rpcs; + cfs_list_t exp_bl_list; + spinlock_t exp_bl_list_lock; + union { + struct tg_export_data eu_target_data; + struct mdt_export_data eu_mdt_data; + struct filter_export_data eu_filter_data; + struct ec_export_data eu_ec_data; + struct mgs_export_data eu_mgs_data; + } u; + struct nodemap *exp_nodemap; +}; +---- + +The 'exp_handle' is a little extra information as compared with a +'struct lustre_handle', which is just the cookie. The cookie that the +server generates to uniquely identify this connection gets put into +this structure along with their information about the device in +question. This is the cookie the *_CONNECT reply sends back to the +client and is then stored int he client's import. + +The 'exp_refcount' gets incremented whenever some aspect of the export +is "in use". The arrival of an otherwise unprocessed message for this +target will increment the refcount. A reference by an LDLM lock that +gets taken will increment the refcount. Callback invocations and +replay also lead to incrementing the ref_count. The next for fields - +'exp_rpc_count', exp_cb_count', and 'exp_replay_count', and +'exp_locks_count' - all subcategorize the 'exp_refcount' for debug +purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard' +are further debug info that lists the actual locks accounted in +'exp_locks_count'. + +The 'exp_client_uuid' gives the UUID of the client connected to this +export. Fixme: when and how does the UUID get generated? + +The server maintains all the exports for a given target on a circular +list. Each export's place on that list is maintained in the +'exp_obd_chain'. A common activity is to look up the export based on +the UUID or the nid of the client, and the 'exp_uuid_hash' and +'exp_nid_hash' fields maintain this export's place in hashes +constructed for that purpose. + +Exports are also maintained on a list sorted by the last time the +corresponding client was heard from. The 'exp_obd_chain_timed' field +maintains the export's place on that list. When a message arrives from +the client the time is "now" so the export gets put at the end of the +list. Since it is circular, the next export is then the oldest. If it +has not been heard of within its timeout interval that export is +marked for later eviction. + +The 'exp_obd' points to the 'obd_device' structure for the device that +is the target of this export. + +In the event of a call-back the export needs to have a the ability to +initiate messages back to the client. The 'exp_imp_reverse' provides a +"reverse" import that manages this capability. + +The '/proc' stats for the export (and the target) get updated via the +'exp_nid_stats'. + +The 'exp_connection' points to the connection information for this +export. This is the information about the actual networking pathway(s) +that get used for communication. + + +The 'exp_conn_cnt' notes the connection count value from the client at +the time of the connection. In the event that more than one connection +request is issued before the connection is established then the +'exp_conn_cnt' will list the highest value. If a previous connection +attempt (with a lower value) arrives later it may be safely +discarded. Every request lists its connection count, so non-connection +requests with lower connection count values can also be discarded. +Note that this does not count how many times the client has connected +to the target. If a client is evicted the export is deleted once it +has been cleaned up and its 'exp_ref_count' reduced to zero. A new +connection from the client will get a new export. + +The 'exp_lock_hash' provides access to the locks granted to the +corresponding client for this target. If a lock cannot be granted it +is discarded. A file system lock ("flock") is also implemented through +the LDLM lock system, but not all LDLM locks are flocks. The ones that +are flocks are gathered in a hash 'exp_flock_hash'. This supports +deadlock detection. + +For those requests that initiate file system modifying transactions +the request and its attendant locks need to be preserved until either +a) the client acknowleges recieving the reply, or b) the transaction +has been committed locally. This ensures a request can be replayed in +the event of a failure. The reply is kept on the +'exp_outstanding_replies' list until the LNet layer notifies the +server that the reply has been acknowledged. A reply is kept on the +'exp_uncommitted_replies' list until the transaction (if any) has been +committed. + +The 'exp_last_committed' value keeps the transaction number of the +last committed transaction. Every reply to a client includes this +value as a means of early-as-possible notification of transactions that +have been committed. + +The 'exp_last_request_time' is self explanatory. + +During reply a request that is waiting for reply is maintained on the +list 'exp_req_replay_queue'. + +The 'exp_lock' spin-lock is used for access control to the exports +flags, as well as the 'exp_outstanding_replies' list and the revers +import, if any. + +The 'exp_connect_data' refers to an 'obd_connect_data' structure for +the connection established between this target and the client this +export refers to. See also the corresponding entry in the import and +in the connect messages passed between the hosts. + +The 'exp_flags' field encodes three directives as follows: +---- +enum obd_option { + OBD_OPT_FORCE = 0x0001, + OBD_OPT_FAILOVER = 0x0002, + OBD_OPT_ABORT_RECOV = 0x0004, +}; +---- +fixme: Are the set for some exports and a condition of their +existence? or do they reflect a transient state the export is passing +through? + +The 'exp_failed' flag gets set whenever the target has failed for any +reason or the export is otherwise due to be cleaned up. Once set it +will not be unset in this export. Any subsequent connection between +the client and the target would be governed by a new export. + +After a failure export data is retrieved from disk and the exports +recreated. Exports created in this way will have their +'exp_in_recovery' flag set. Once any outstanding requests and locks +have been recovered for the client, then the export is recovered and +'exp_in_recovery' can be cleared. When all the client exports for a +given target have been recovered then the target is considered +recovered, and when all targets have been recovered the server is +considered recovered. + +A *_DISCONNECT message from the client will set the 'exp_disconnected' +flag, as will any sort of failure of the target. Once set the export +will be cleaned up and deleted. + +When a *_CONNECT message arrives the 'exp_connecting' flag is set. If +for some reason a second *_CONNECT request arrives from the client it can +be discarded when this flag is set. + +The 'exp_delayed' flag is no longer used. In older code it indicated +that recovery had not completed in a timely fashion, but that a tardy +recovery would still be possible, since there were no dependencies on +the export. + +The 'exp_vbr_failed' flag indicates a failure during the recovery +process. See <> for a more detailed discussion of recovery +and transaction replay. For a file system modifying request, the +server composes its reply including the 'pb_pre_versions' entries in +'ptlrpc_body', which indicate the most recent updates to the +object. The client updates the request wth teh 'pb_transno' and +'pb_pre_versions' from the reply, and keeps that request until the +target signals that the transaction has been committed to disk. If the +client times-out without that confirmation then it will 'replay' the +request, which now includes the 'pb_pre_versions' information. During +a replay the target checks that the object has not been further +modified beyond those 'pb_pre_versions'. If this check fails then the +request is out of date, and the recovery process fails for the +connection between this client and this target. At that point the +'exp_vbr_failed' flag is set to indicate version based recovery +failed. This will lead to the client being evicted and this export +being cleaned up and deleted. + +At the start of recovery both the 'exp_req_replay_needed' and +'exp_lock_replay_needed' flags are set. As request replay is completed +the 'exp_req_replay_needed' flag is cleared. As lock replay is +completed the 'exp_lock_replay_needed' flag is cleared. Once both are +cleared the 'exp_in_recovery' flag can be cleared. + +The 'exp_need_sync' supports an optimization. At mount time it is +likely that every client (potentially thousands) will create an export +and that export will need to be saved to disk synchronously. This can +lead to an unusually high and poorly performing interaction with the +disk. When the export is created the 'exp_need_sync' flag is set and +the actual writing to disk is delayed. As transactions arrive from +clients (in a much less coordinated fashion) the 'exp_need_sync' flag +indicates a need to save the export as well as the transaction. At +that point the flag is cleared (except see below). + +In DNE (phase I) the export for an MDT managing the connection from +another MDT will want to always keep the 'exp_need_sync' flag set. For +that special case such an export sets the 'exp_keep_sync', which then +prevents the 'exp_need_sync' flag from ever being cleared. This will +no longer be needed in DNE Phase II. + +The 'exp_flvr_changed' and 'exp_flvr_adapt' flags along with +'exp_sp_peer', 'exp_flvr', 'exp_flvr_old', and 'exp_flvr_expire' +fields are all used to manage the security settings for the +connection. Security is discussed in the <> section. (fixme: +or will be.) + +The 'exp_libclient' flag indicates that the export is for a client +based on "liblustre". This allows for simplified handling on the +server. (fixme: how is processing simplified? It sounds like I may +need a whole special section on liblustre.) + +The 'exp_need_mne_swab' flag indicates the presence of an old bug that +affected one special case of failed swabbing. It is not part of +current processing. + +As RPCs arrive they are first subjected to triage. Each request is +placed on the 'exp_hp_rpcs' list and examined to see if it is high +priority (fixme: what constitutes high priority? PING, truncate, bulk +I/O, ... others?). If it is not high priority then it is moved to the +'exp_reg_prcs' list. The 'exp_rpc_lock' protects both lists from +concurrent access. + +All arriving LDLM requests get put on the 'exp_bl_list' and access to +that list is controlled via the 'exp_bl_list_lock'. + +The union provides for target specific data. The 'eu_target_data' is +for a common core of fields for a generic target. The others are +specific to particular target types: 'eu_mdt_data' for MDTs, +'eu_filter_data' for OSTs, 'eu_ec_data' for an "echo client" (fixme: +describe what an echo client is somewhere), and 'eu_mgs_data' is for +an MGS. + +The 'exp_bl_lock_at' field supports adaptive timeouts which will be +discussed separately. (fixme: so discuss it somewhere.) + +Connection Count +^^^^^^^^^^^^^^^^ + +Each export maintains a connection count. Or is it just the management +server? diff --git a/data_types.txt b/data_types.txt new file mode 100644 index 0000000..aac0f1f --- /dev/null +++ b/data_types.txt @@ -0,0 +1,606 @@ +Data Structures and Defines +--------------------------- +[[data-structs]] + +The following data types are used in the Lustre protocol description. + +Basic Data Types +~~~~~~~~~~~~~~~~ + +.Basic Data Types +[options="header"] +|===== +| data types | size +| __u8 | an 8-bit unsigned integer +| __u16 | a 16-bit unsigned integer +| __u32 | a 32-bit unsigned integer +| __u64 | a 64-bit unsigned integer +| __s64 | a 64-bit signed integer +| obd_time | an __s64 +|===== + + +Other Data Types +~~~~~~~~~~~~~~~~ + +The following topics introduce the various kinds of data that are +represented and manipulated in Lustre messages and representations of +the shared state on clients and servers. + +Grant +^^^^^ +[[grant]] + +A grant value is part of a client's state for a given target. It +provides an upper bound to the amount of dirty cache data the client +will allow that is destined for the target. The value is established +by agreement between the server and the client and represents a +guarantee by the server that the target storage has space for the +dirty data. The client can ask for additional grant, which the server +may provide depending on how full the target is. + +LOV Index +^^^^^^^^^ +[[lov-index]] + +Each target is assigned an LOV index (by the 'mkfs' command line) as +the target is added to the file system. This value is stored in the +MGS in order to identify its role in the file system. + +Transaction Number +^^^^^^^^^^^^^^^^^^ +[[transno]] + +For each target there is a sequence of values (a strictly increasing +series of numbers) where each operation that can modify the file +system is assigned the next number in the series. This is the +transaction number, and it imposes a strict serial ordering to all of +the file system modifying operations. For file system modifying +requests the server generates the next value in the sequence and +informs the client of the value in the 'pb_transno' field of the +'ptlrpc_body' of its reply to the client's request. For replys to +requests that do not modify the file system the 'pb_transno' field in +the 'ptlrpc_body' is just set to 0. + +Structured Data Types +~~~~~~~~~~~~~~~~~~~~~ + +Extended Attributes +^^^^^^^^^^^^^^^^^^^ + +I have not figured out how so called 'eadata' buffers are handled, +yet. I am told that this is not just for extended attributes, but is a +generic structure. + +Lustre Capabilities +^^^^^^^^^^^^^^^^^^^ + +A 'lustre_capa' structure conveys details about the capabilities +supported (or requested) between a client and a given target. + +---- +#define CAPA_HMAC_MAX_LEN 64 +struct lustre_capa { + struct lu_fid lc_fid; + __u64 lc_opc; + __u64 lc_uid; + __u64 lc_gid; + __u32 lc_flags; + __u32 lc_keyid; + __u32 lc_timeout; + __u32 lc_expiry; + __u8 lc_hmac[CAPA_HMAC_MAX_LEN]; +} +---- + +MDT Data +^^^^^^^^ + +An 'mdt_body' structure holds details about a given MDT. + +---- +struct mdt_body { + struct lu_fid fid1; + struct lu_fid fid2; + struct lustre_handle handle; + __u64 valid; + __u64 size; + obd_time mtime; + obd_time atime; + obd_time ctime; + __u64 blocks; + __u64 ioepoch; + __u64 t_state; + __u32 fsuid; + __u32 fsgid; + __u32 capability; + __u32 mode; + __u32 uid; + __u32 gid; + __u32 flags; + __u32 rdev; + __u32 nlink; + __u32 unused2; + __u32 suppgid; + __u32 eadatasize; + __u32 aclsize; + __u32 max_mdsize; + __u32 max_cookiesize; + __u32 uid_h; + __u32 gid_h; + __u32 padding_5; + __u64 padding_6; + __u64 padding_7; + __u64 padding_8; + __u64 padding_9; + __u64 padding_10; +}; /* 216 */ +---- + +MGS Configuration Reference +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +---- +#define MTI_NAME_MAXLEN 64 +struct mgs_config_body { + char mcb_name[MTI_NAME_MAXLEN]; /* logname */ + __u64 mcb_offset; /* next index of config log to request */ + __u16 mcb_type; /* type of log: CONFIG_T_[CONFIG|RECOVER] */ + __u8 mcb_reserved; + __u8 mcb_bits; /* bits unit size of config log */ + __u32 mcb_units; /* # of units for bulk transfer */ +}; +---- + +The 'mgs_config_body' structure has information identifying to the MGS +which Lustre file system the client is asking about. + +MGS Configuration Data +^^^^^^^^^^^^^^^^^^^^^^ + +---- +struct mgs_config_res { + __u64 mcr_offset; /* index of last config log */ + __u64 mcr_size; /* size of the log */ +}; +---- + +The 'mgs_config_res' structure returns information about the Lustre +file system. + +Lustre Handle +^^^^^^^^^^^^^ + +---- +struct lustre_handle { + __u64 cookie; +}; +---- + +A Lustre handle is a reference to an import or an export. Those +objects maintain state about the connection between a given client +and a given target. The import is on the client and the corresponding +export is on the server. + +Lustre Message Header +^^^^^^^^^^^^^^^^^^^^^ +[[lustre-message-header]] + +Every message has an initial header that informs the receiver about +the size of the rest of the message to follow along with a few other +details. + +---- +#define LUSTRE_MSG_MAGIC_V2 0x0BD00BD3 +#define MSGHDR_AT_SUPPORT 0x1 +struct lustre_msg_v2 { + __u32 lm_bufcount; + __u32 lm_secflvr; + __u32 lm_magic; + __u32 lm_repsize; + __u32 lm_cksum; + __u32 lm_flags; + __u32 lm_padding_2; + __u32 lm_padding_3; + __u32 lm_buflens[0]; +}; +#define lustre_msg lustre_msg_v2 +---- + +The 'lm_buffcount' field gives the number of buffers that will follow +the header. The header and sequence of buffers constitutes one +message. Each of the buffers is a sequence of bytes whose contents +corresponds to one of the structures described in this section. There +will always be at least one, and no message has more than eight. + +The 'lm_secflvr' field gives an indication of whether any sort of +cyptographic encoding of the subsequent buffers will be in force. The +value is zero if there is no "crypto" and gives a code identifying the +"flavor" of crypto if it is employed. Further, if crypto is employed +there will only be one buffer following (i.e. buffcount = 1), and that +buffer is an encoding of what would otherwise have been the sequence +of buffers normally following the header. This document will defer all +discussion of cryptography. An chapter is planned that will address it +separately. + +The 'lm_magic' field is a "magic" value (LUSTRE_MSG_MAGIC_V2) that is +checked in order to positively identify that the message is intended +for the use to which it is being put. That is, we are indeed dealing +with a Lustre message, and not, for example, corrupted memory or a bad +pointer. + +The 'lm_repsize' field is an indication from the sender of an action +request of the maximum available space that has been set aside for +any reply to the request. A reply that attempts to use more than that +much space will be discarded. + +The 'lm_cksum' has to do with the <> settings for the +cluster. Fixme: This may not be in current use. We need to verify. + +The 'lm_flags' field can be set to enable adaptive timeouts support +with the value MSGHDR_AT_SUPPORT. + +The 'lm_padding*' fields are reserved for future use. + +The array of 'lm_bufflens' values has 'lm_bufcount' entries. Each +entry corresponds to, and gives the length of, one of the buffers that +will follow. + +The entire header is required to be a multiple of eight bytes +long. Thus there may need to an extra four bytes of padding after the +'lm_bufflens' array if that array has an odd number of entries. + +OBD statfs +^^^^^^^^^^ + +The 'obd_stafs' structure defines fields that are used for returning +server common 'statfs' data items to a client. It augments that data +with some Lustre-specific information, and also has space allocated +for future use by Lustre. + +---- +struct obd_statfs { + __u64 os_type; + __u64 os_blocks; + __u64 os_bfree; + __u64 os_bavail; + __u64 os_files; + __u64 os_ffree; + __u8 os_fsid[40]; + __u32 os_bsize; + __u32 os_namelen; + __u64 os_maxbytes; + __u32 os_state; /**< obd_statfs_state OS_STATE_* flag */ + __u32 os_fprecreated; /* objs available now to the caller */ + /* used in QoS code to find preferred + * OSTs */ + __u32 os_spare2; + __u32 os_spare3; + __u32 os_spare4; + __u32 os_spare5; + __u32 os_spare6; + __u32 os_spare7; + __u32 os_spare8; + __u32 os_spare9; +}; +---- + +Lustre Message Preamble +^^^^^^^^^^^^^^^^^^^^^^^ +[[lustre-message-preamble]] + +Every Lustre message starts with both the above header and an +additional set of fields (in its first "buffer") given by the 'struct +ptlrpc_body_v3' structure. This preamble has information information +relevant to every message type. In particular, the Lustre message type +is itself encoded in the 'pb_opc' Lustre operation number. The value +of that op code determines what else will be in the message following +the preamble. +---- +#define PTLRPC_NUM_VERSIONS 4 +#define JOBSTATS_JOBID_SIZE 32 +struct ptlrpc_body_v3 { + struct lustre_handle pb_handle; + __u32 pb_type; + __u32 pb_version; + __u32 pb_opc; + __u32 pb_status; + __u64 pb_last_xid; + __u64 pb_last_seen; + __u64 pb_last_committed; + __u64 pb_transno; + __u32 pb_flags; + __u32 pb_op_flags; + __u32 pb_conn_cnt; + __u32 pb_timeout; + __u32 pb_service_time; + __u32 pb_limit; + __u64 pb_slv; + __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS]; + __u64 pb_padding[4]; + char pb_jobid[JOBSTATS_JOBID_SIZE]; +}; +#define ptlrpc_body ptlrpc_body_v3 +---- +In a connection request, sent by a client to server and regarding a +specific target, the 'pb_handle' is 0. In the reply to a connection +request, sent by the server, the handle is a value uniquely +identifying the target. Subsequent messages between this client and +this server regarding this target will use this handle to to gain +access to their shared state. The handle is persistent across +reconnects. + +The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are +initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is +PTL_RPC_MSG_ERR to convey that a message was received that could not +be interpreted, that is, if it was corrupt or incomplete. The encoding +of those type values is given by: +---- +#define PTL_RPC_MSG_REQUEST 4711 +#define PTL_RPC_MSG_ERR 4712 +#define PTL_RPC_MSG_REPLY 4713 +---- +The error message type is only for responding to a message that failed +to be interpreted as an actual message. Note that other errors, such +as those that emerge from processing the actual message content, do +not use the PTL_RPC_MSG_ERR type. + +The 'pb_version' identifies the version of the Lustre protocol and is +derived from the following constants. The lower two bytes give the +version of PtlRPC being employed in the message, and the upper two +bytes encode the role of the host for the service being +requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS. +---- +#define PTLRPC_MSG_VERSION 0x00000003 +#define LUSTRE_VERSION_MASK 0xffff0000 +#define LUSTRE_OBD_VERSION 0x00010000 +#define LUSTRE_MDS_VERSION 0x00020000 +#define LUSTRE_OST_VERSION 0x00030000 +#define LUSTRE_DLM_VERSION 0x00040000 +#define LUSTRE_LOG_VERSION 0x00050000 +#define LUSTRE_MGS_VERSION 0x00060000 +---- + +The 'pb_opc' value (operation code) gives the actual Lustre operation +that is the subject of this message. For example, MDS_CONNECT is a +Lustre operation (number 38). The following list gives the name used +and the value for each operation. +---- +typedef enum { + OST_REPLY = 0, + OST_GETATTR = 1, + OST_SETATTR = 2, + OST_READ = 3, + OST_WRITE = 4, + OST_CREATE = 5, + OST_DESTROY = 6, + OST_GET_INFO = 7, + OST_CONNECT = 8, + OST_DISCONNECT = 9, + OST_PUNCH = 10, + OST_OPEN = 11, + OST_CLOSE = 12, + OST_STATFS = 13, + OST_SYNC = 16, + OST_SET_INFO = 17, + OST_QUOTACHECK = 18, + OST_QUOTACTL = 19, + OST_QUOTA_ADJUST_QUNIT = 20, + MDS_GETATTR = 33, + MDS_GETATTR_NAME = 34, + MDS_CLOSE = 35, + MDS_REINT = 36, + MDS_READPAGE = 37, + MDS_CONNECT = 38, + MDS_DISCONNECT = 39, + MDS_GETSTATUS = 40, + MDS_STATFS = 41, + MDS_PIN = 42, + MDS_UNPIN = 43, + MDS_SYNC = 44, + MDS_DONE_WRITING = 45, + MDS_SET_INFO = 46, + MDS_QUOTACHECK = 47, + MDS_QUOTACTL = 48, + MDS_GETXATTR = 49, + MDS_SETXATTR = 50, + MDS_WRITEPAGE = 51, + MDS_IS_SUBDIR = 52, + MDS_GET_INFO = 53, + MDS_HSM_STATE_GET = 54, + MDS_HSM_STATE_SET = 55, + MDS_HSM_ACTION = 56, + MDS_HSM_PROGRESS = 57, + MDS_HSM_REQUEST = 58, + MDS_HSM_CT_REGISTER = 59, + MDS_HSM_CT_UNREGISTER = 60, + MDS_SWAP_LAYOUTS = 61, + LDLM_ENQUEUE = 101, + LDLM_CONVERT = 102, + LDLM_CANCEL = 103, + LDLM_BL_CALLBACK = 104, + LDLM_CP_CALLBACK = 105, + LDLM_GL_CALLBACK = 106, + LDLM_SET_INFO = 107, + MGS_CONNECT = 250, + MGS_DISCONNECT = 251, + MGS_EXCEPTION = 252, + MGS_TARGET_REG = 253, + MGS_TARGET_DEL = 254, + MGS_SET_INFO = 255, + MGS_CONFIG_READ = 256, + OBD_PING = 400, + OBD_LOG_CANCEL = 401, + OBD_QC_CALLBACK = 402, + OBD_IDX_READ = 403, + LLOG_ORIGIN_HANDLE_CREATE = 501, + LLOG_ORIGIN_HANDLE_NEXT_BLOCK = 502, + LLOG_ORIGIN_HANDLE_READ_HEADER = 503, + LLOG_ORIGIN_HANDLE_WRITE_REC = 504, + LLOG_ORIGIN_HANDLE_CLOSE = 505, + LLOG_ORIGIN_CONNECT = 506, + LLOG_ORIGIN_HANDLE_PREV_BLOCK = 508, + LLOG_ORIGIN_HANDLE_DESTROY = 509, + QUOTA_DQACQ = 601, + QUOTA_DQREL = 602, + SEQ_QUERY = 700, + SEC_CTX_INIT = 801, + SEC_CTX_INIT_CONT = 802, + SEC_CTX_FINI = 803, + FLD_QUERY = 900, + FLD_READ = 901, + UPDATE_OBJ = 1000, + LAST_OPC +} cmd_t; +---- +The symbols and values above identify the operations Lustre uses in +its protocol. They are examined in detail in the +<> section. Lustre carries out +each of these operations via the exchange of a pair of messages: a +request and a reply. The details of each message are specific to each +operation. The <> chapter discusses +each message and its contents. + +The 'pb_status' value in a request message is set to the 'pid' of the +process making the request. In a reply message, a zero indicates that +the service successfully initiated the requested operation. If for +some reason the operation could not be initiated (eg. "permission +denied") the status will encode the standard Linux kernel (POSIX) +error code (eg. EPERM). + +'pb_last_xid' and 'pb_last_seen' are not used. + +The 'pb_last_committed' value is always zero in a request. In a reply +it is the highest transaction number that has been committed to +storage. The transaction numbers are maintained on a per-target basis +and each series of transaction numbers is a strictly increasing +sequence. This field is set in any kind of reply message including +pings and non-modifying transactions. + +The 'pb_transno' value always zero in a new request. It is also zero +for replies to operations that do not modify the file system. For +replies to operations that do modify the file system it is the +server-assigned value from the sequence of values associated with the +given client and target. That transaction number is copied into the +'pb_trans' field of the 'ptlrpc_body' of the originial request. If the +request has to be replayed it will include the transaction number. + +The 'pb_flags' value governs the client state machine. Fixme: document +what the states and transitions are of this state machine. Currently, +only the bottom two bytes are used, and they encode state according to +the following values: +---- +#define MSG_GEN_FLAG_MASK 0x0000ffff +#define MSG_LAST_REPLAY 0x0001 +#define MSG_RESENT 0x0002 +#define MSG_REPLAY 0x0004 +#define MSG_DELAY_REPLAY 0x0010 +#define MSG_VERSION_REPLAY 0x0020 +#define MSG_REQ_REPLAY_DONE 0x0040 +#define MSG_LOCK_REPLAY_DONE 0x0080 +---- + +The 'pb_op_flags' value governs the client connection status state +machine. Fixme: document what the states and transitions are of this +state machine. +---- +#define MSG_CONNECT_RECOVERING 0x00000001 +#define MSG_CONNECT_RECONNECT 0x00000002 +#define MSG_CONNECT_REPLAYABLE 0x00000004 +#define MSG_CONNECT_LIBCLIENT 0x00000010 +#define MSG_CONNECT_INITIAL 0x00000020 +#define MSG_CONNECT_ASYNC 0x00000040 +#define MSG_CONNECT_NEXT_VER 0x00000080 +#define MSG_CONNECT_TRANSNO 0x00000100 +---- +In normal operation an initial request to connect will set +'pb_op_flags' to MSG_CONNECT_INITIAL and MSG_CONNECT_NEXT_VER. The +reply to that connection request (and all other, non-connect, requests +and replies) will set 'pb_op_flags' to 0. + +The 'pb_conn_cnt' (connection count) value in a request message +reports the client's "era", which is part of the client and server's +shared state. The value of the era is initialized to one when it is +first connected to the MDT. Each subsequent connection (after an +eviction) increments the era for the client. Since the 'pb_conn_cnt' +reflects the client's era at the time the message was composed the +server can use this value to discard late-arriving messages requesting +operations on out-of-date shared state. + +The 'pb_timeout' value in a request indicates how long (in seconds) +the requester plans to wait before timing out the operation. That is, +the corresponding reply for this message should arrive within this +time frame. The service may extend this time frame via an "early +reply", which is a reply to this message that notifies the requester +that it should extend its timeout interval by the value of the +'pb_timeout' field in the reply. The "early reply" does not indicate +the operation has actually been initiated. Clients maintain multiple +request queues, called "portals", and each type of operation is +assigned to one of these queues. There is a timeout value associated +with each queue, and the timeout update affects all the messages +associated with the given queue, not just the specific message that +initiated the request. Finally, in a reply message (one that does +indicate the operation has been initiated) the timeout value updates +the timeout interval for the queue. Is this last point different from +the "early reply" update? + +The 'pb_service_time' value is zero in a request. In a reply it +indicates how long this particular operation actually took from the +time it first arrived in the request queue (at the service) to the +time the server replied. Note that the client can use this value and +the local elapsed time for the operation to calculate network latency. + +The 'pb_limit' value is zero in a request. In a reply it is a value +sent from a lock service to a client to set the maximum number of +locks available to the client. When dynamic lock LRU's are enabled +this allows for managing the size of the LRU. + +The 'pb_slv' value is zero in a request. On a DLM service, the "server +lock volume" is a value that characterizes (estimates) the amount of +traffic, or load, on that lock service. It is calculated as the +product of the number of locks and their age. In a reply, the 'pb_slv' +value indicates to the client the available share of the total lock +load on the server that the client is allowed to consume. The client +is then responsible for reducing its number or (or age) of locks to +stay within this limit. + +The array of 'pb_pre_versions' values has four entries. They are +always zero in a new request message. They are also zero in replies to +operations that do not modify the file system. For an operation that +does modify the file system, the reply encodes the most recent +transaction numbers for the objects modified by this operation, and +the 'pb_pre_versions' values are copied into the original request when +the reply arrives. If the request needs to be replayed then the +updated 'pb_pre_versions' values accompany the replayed request. + +'pb_padding' is reserved for future use. + +The 'pb_jobid' (string) value gives a unique identifier associated +with the process on behalf of which this message was generated. The +identifier is assigned to the user process by a job scheduler, if any. + +Object Based Disk UUID +^^^^^^^^^^^^^^^^^^^^^^ + +---- +#define UUID_MAX 40 +struct obd_uuid { + char uuid[UUID_MAX]; +}; +---- + +OST ID +^^^^^^ + +---- +struct ost_id { + union { + struct ostid { + __u64 oi_id; + __u64 oi_seq; + } oi; + struct lu_fid oi_fid; + } LUSTRE_ANONYMOUS_UNION_NAME; +}; +---- + diff --git a/file_id.txt b/file_id.txt new file mode 100644 index 0000000..34f7a3c --- /dev/null +++ b/file_id.txt @@ -0,0 +1,21 @@ +Lustre File Identifier +---------------------- +[[fids]] + +---- +struct lu_fid { + __u64 f_seq; + __u32 f_oid; + __u32 f_ver; +}; +---- + +File IDs ('fids') are 128-bit numbers that uniquely identify files and +directories on the MDTs and OSTs of a Lustre file system. The fid for +a Lustre file or directory is the fid from the corresponding MDT entry +for the file. Each of the data objects for that file will also have a +fid for each corresponding piece of the file on each of the +OSTs. Encoded in the fid is the target on which that file metadata or +file fragment resides. The map from fid to target is in the File +Location DataBase (FLDB). + diff --git a/file_system_operations.txt b/file_system_operations.txt new file mode 100644 index 0000000..ea3451d --- /dev/null +++ b/file_system_operations.txt @@ -0,0 +1,282 @@ + +File System Operations +---------------------- +[[file-system-operations]] + +Lustre is a POSIX compliant file system that provides namespace and +data storage services to clients. It implements all the usual file +system functionality including creating, writing, reading, and +removing files and directories. These file system operations are +implemented via <>, which carry +out communication and coordination with the servers. In this section +we present the sequence of Lustre Operations, along with their +effects, of a variety of file system operations. + +Mount +~~~~~ + +Before any other interaction can take place between a client and a +Lustre file system the client must 'mount' the file system, and Lustre +services must already be in place (on the servers). A file system +mount may be initiated at the Linux shell command line, which in turn +invokes the 'mount()' system call. Kernel modules for Lustre +exchange a series of messages with the servers, beginning with +messages that retrieve details about the file system from the +management server (MGS). This provides the client with the identities +of all the metadata servers (MDSs) and targets (MDTs) as well as all +the object storage servers (OSSs) and targets (OSTs). The client then +sequences through each of the targets exchanging additional messages +to initiate the connections with them. The following sections present +the details of the Lustre operations that accomplish the file system +mount. + +Messages Between the Client and the MGS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to be able to mount the Lustre file system the client needs +to know the identities of the various servers and targets so that it +can initiate connections to them. The following sequence of operations +accomplishes this. + +---- +MGS_CONNECT +LDLM_ENQUEUE (concurrent read) +LLOG_ORIGIN_HANDLE_CREATE (filename: lfs-sptlrpc) +LDLM_ENQUEUE (concurrent read) +LLOG_ORIGIN_HANDLE_CREATE (filename: lfs-client) +LLOG_ORIGIN_HANDLE_READ_HEADER +LLOG_ORIGIN_HANDLE_NEXT_BLOCK +LDLM_ENQUEUE (concurrent read) +MGS_CONFIG_READ (name: lfs-cliir) +LDLM_ENQUEUE (concurrent read) +LLOG_ORIGIN_HANDLE_CREATE (filename: params) +LLOG_ORIGIN_HANDLE_READ_HEADER +---- + +Prior to any other interaction between a client and a Lustre server +(or between two servers) the client must establish a 'connection'. The +connection establishes shared state between the two hosts. On the +client this connection state information is called an 'import', and +there is an import on the client for each target it connects to. On +the server this connection state is referred to as an 'export', and +again the server has an export for each client that has connected to +it. There a separate export for each client for each target. + +The client begins by carrying out the MGS_CONNECT Lustre operation, +which establishes the connection (creates the import and the export) +between the client and the MGS. The connect message from the client +includes a 'handle' to uniquely identify itself (subsequent messages +to the LDLM will refer to that client-handle). The connection data +from the client also proposes the set of <> appropriate to connecting to an MGS. + +.Flags for the client connection to an MGS +[options="header"] +|==== +| obd_connect_data->ocd_connect_flags +| OBD_CONNECT_VERSION +| OBD_CONNECT_AT +| OBD_CONNECT_FULL20 +| OBD_CONNECT_IMP_RECOV +| OBD_CONNECT_MNE_SWAB +| OBD_CONNECT_PINGLESS +|==== + +The MGS's reply to the connection request will include the handle that +the server and client will both use to identify this connection in +subsequent messages. This is the 'connection-handle' (as opposed to +the client-handle mentioned a moment ago). The MGS also replies with +the same set of connection flags. + +Once the connection is established the client gets configuration +information for the file system from the MGS in four stages. First, +the two exchange messages establishing the file system wide security +policy that will be followed in all subsequent communications. Second, +the client gets a bitmap instructing it as to which among the +configuration records on the MGS it needs. Third, reading those +records from the MGS gives the client the list of all the servers and +targets it will need to communicate with. Fourth, the client reads +cluster wide configuration data (the sort that might be set at the +client command line with a 'lctl conf_param' command). The following +paragraphs go into these four stages in more detail. + +Each time the client is going to read information from server storage +it needs to first acquire the appropriate lock. Since the client is +only reading data, the locks will be 'concurrent read' locks. The +LDLM_ENQUEUE command communicates this lock request to the MGS +target. The request identifies the target via the connection-handle +from the connection reply, and identifies the client (itself) with the +client-handle from its original connection request. The MGS's reply +grants that lock, if appropriate. If other clients were making some +sort of modification to the MGS data then the lock exchange might +result in a delay while the client waits. More details about the +behavior of the <> are in that +section. For now, let's assume the locks are granted for each of these +four operations. The first LLOG_ORIGIN_HANDLE_CREATE operation (the +client is creating its own local handle not the target's file) asks +for the security configuration file ("lfs-sptlrpc"). <> +discusses security, and for now let's assume there is nothing to be +done for security. That is, subsequent messages will all use an "empty +security flavor" and no encryption will take place. In this case the +MGS's reply ('pb_status' == -2, ENOENT) indicated that there was no +such file, so nothing actually gets read. + +Another LDLM_ENQUEUE and LLOG_ORIGIN_HANDLE_CREATE pair of operations +identifies the configuration client data ("lfs-client") file, and in +this case there is data to read. The LLOG_ORIGIN_HANDLE_CREATE reply +identifies the actual object of interest on the MGS via the +'llog_logid' field in the 'struct llogd_body'. The MGS stores +configuration data in log records. A header at the beginning of +"lfs-client" uses a bitmap to identify the log records that are +actually needed. The header includes both which records to retrieve +and how large those records are. The LLOG_ORIGIN_HANDLE_READ_HEADER +request uses the 'llog_logid' to identify desired log file, and the +reply provides the bitmap and size information identifying the +records that are actually needed. The +LLOG_ORIGIN_HANDLE_NEXT_BLOCK operations retrieves the data thus +identified. + +Knowing the specific configuration records it wants, the client then +proceeds to retrieve them. This requires another LDLM_ENQUEUE +operation, followed this time by the MGS_CONFIG_READ operation, which +get the UUIDs for the servers and targets from the configuration log +("lfs-cliir"). + +A final LDLM_ENQUEUE, LLOG_ORIGIN_HANDLE_CREATE, and +LLOG_ORIGIN_HANDLE_READ_HEADER then retrieve the cluster wide +configuration data ("params"). + +Messages Between the Client and the MDSs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After the foregoing interaction with the MGS the client has a list of +the MDSs and MDTs in the file system. Next, the client invokes four +Lustre operations with each MDT on the list. + +---- +MDS_CONNECT +MDS_STATFS +MDS_GETSTATUS +MDS_GETATTR +---- + +The MDS_CONNECT operation establishes a connection between the client +and a specific target (MDT) on an MDS. Thus, if an MDS has multiple +targets, there is a separate MDS_CONNECT operation for each. This +creates an import for the target on the client and an export for the +client and target on the MDS. As with the connect operation for the +MGS, the connect message from the client includes a UUID to uniquely +identify this connection, and subsequent messages to the lock manager +on the server will refer to that UUID. The connection data from the +client also proposes the set of <> +appropriate to connecting to an MDS. The following are the flags +always included. + +.Always included flags for the client connection to an MDS +[options="header"] +|==== +| obd_connect_data->ocd_connect_flags +| OBD_CONNECT_RDONLY +| OBD_CONNECT_VERSION +| OBD_CONNECT_ACL +| OBD_CONNECT_XATTR +| OBD_CONNECT_IBITS +| OBD_CONNECT_NODEVOH +| OBD_CONNECT_ATTRFID +| OBD_CONNECT_CANCELSET +| OBD_CONNECT_AT +| OBD_CONNECT_RMT_CLIENT +| OBD_CONNECT_RMT_CLIENT_FORCE +| OBD_CONNECT_BRW_SIZE +| OBD_CONNECT_MDS_CAPA +| OBD_CONNECT_OSS_CAPA +| OBD_CONNECT_MDS_MDS +| OBD_CONNECT_FID +| LRU_RESIZE_CONNECT_FLAG +| OBD_CONNECT_VBR +| OBD_CONNECT_LOV_V3 +| OBD_CONNECT_SOM +| OBD_CONNECT_FULL20 +| OBD_CONNECT_64BITHASH +| OBD_CONNECT_JOBSTATS +| OBD_CONNECT_EINPROGRESS +| OBD_CONNECT_LIGHTWEIGHT +| OBD_CONNECT_UMASK +| OBD_CONNECT_LVB_TYPE +| OBD_CONNECT_LAYOUTLOCK +| OBD_CONNECT_PINGLESS +| OBD_CONNECT_MAX_EASIZE +| OBD_CONNECT_FLOCK_DEAD +| OBD_CONNECT_DISP_STRIPE +| OBD_CONNECT_LFSCK +| OBD_CONNECT_OPEN_BY_FID +| OBD_CONNECT_DIR_STRIPE +|==== + +.Optional flags for the client connection to an MDS +[options="header"] +|==== +| obd_connect_data->ocd_connect_flags +| OBD_CONNECT_SOM +| OBD_CONNECT_LRU_RESIZE +| OBD_CONNECT_ACL +| OBD_CONNECT_UMASK +| OBD_CONNECT_RDONLY +| OBD_CONNECT_XATTR +| OBD_CONNECT_XATTR +| OBD_CONNECT_RMT_CLIENT_FORCE +|==== + +The MDS replies to the connect message with a subset of the flags +proposed by the client, and the client notes those values in its +import. The MDS's reply to the connection request will include a UUID +that the server and client will both use to identify this connection +in subsequent messages. + +The client next uses an MDS_STATFS operation to request 'statfs' +information from the target, and that data is returned in the reply +message. The actual fields closely resemble the results of a 'statfs' +system call. See the 'obd_statfs' structure in the <>. + +The client uses the MDS_GETSTATUS operation to request information +about the mount point of the file system. fixme: Does MDS_GETSTATUS +only ask about the root (so it would seem)? The server reply contains +the 'fid' of the root directory of the file system being mounted. If +there is a security policy the capabilities of that security policy +are included in the reply. + +The client then uses the MDS_GETATTR operation to get get further +information about the root directory of the file system. The request +message includes the above fid. It will also include the security +capability (if appropriate). The reply also holds the same fid, and in +this case the 'mdt_body' has several additional fields filled +in. These include the mtime, atime, ctime, mode, uid, and gid. It also +includes the size of the extended attributes and the size of the ACL +information. The reply message also includes the extended attributes +and the ACL. From the extended attributes the client can find out +about striping information for the root, if any. + +Messages Between the Client and the OSSs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Additional CONNECT messages flow between the client and each OST +enumerated by the MGS. + +---- +OST_CONNECT +---- + +Unmount +~~~~~~~ + +---- +OST_DISCONNECT +MDS_DISCONNECT +MGS_DISCONNECT +---- + +Create +~~~~~~ + +Further discussion of the 'creat()' system call. diff --git a/glossary.txt b/glossary.txt new file mode 100644 index 0000000..35cb51d --- /dev/null +++ b/glossary.txt @@ -0,0 +1,114 @@ +[glossary] +Glossary +-------- +Here are some common terms used in discussing Lustre, POSIX semantics, +and the protocols used to implement them. + +[glossary] +back end file system:: +Storage on a server in support of the object store or of metadata is +stored on the back end file system. The two available file systems are +currently ldiskfs and ZFS. + +Client:: +A Lustre client is a computer taking advantage of the services +provided by MDSs and OSSs to assemble a POSIX-compliant file system +with its namespace and data storage capabilities. + +Distributed Lock Manager:: +The distributed lock manager (DLM, often referred to as the Lustre +Distributed Lock Manager, or LDLM) is the service that enforces a +consistent (cache-coherent) view of the data objects in the file +system. + +early reply:: +A message returned in response to a request message that allows the +target to extend timeout values without otherwise indicating the +request has progressed. + +export:: +The term for the per-client shared state information held by a server. + +extent:: A range of offsets in a file. + +import:: +The term for the per-target shared state information held by a client. + +Inodes and Extended Attributes:: +Metadata about a Lustre file is encoded in the extended attributes of +an inode in the back-end file system on the MDS. Some of that +information establishes the stripe layout of the file. The size of the +stripe layout information varies among Lustre file systems. The amount +of space reserved for layout information is a small as possible given +the maximum stripe count possible on the file system. Clients, servers, +and the distributed lock manager will all need to be aware of this +size, which is communicated in the 'ocd_ibis_known' filed of the +'obd_connect_data' structure. + +LNet:: +A lower level protocol employed by PtlRPC to abstract the mechanisms +provided by various hardware centric protocols, such as TCP or +Infiniband. + +The Log Subsystem:: + +Fixme: +The log subsystem (LOG) is something I have no idea about right now. + +Logical Object Volume:: +The logical object volume (LOV) layer abstracts the targets on a server. + +Management Server:: +The Management Server (MGS) maintains an understanding of the other +servers, targets, and client in the Lustre file system. It holds their +identities and keeps track of disconnections and reconnections in +order to support imperative recovery. + +Metadata Server:: +A metadata server (MDS) is a computer responsible for running the +Lustre kernel services in support of managing the POSIX-compliant name +space and the indices associating files in that name space with the +locations of their corresponding objects. As of v2.4 there can be +multiple MDSs in a Lustre file system. + +Metadata Target:: +A metadata target (MDT) is the service provided by an MDS that +mediates the management of name space indices on the underlying file +system hardware. As of v2.4 there can be multiple MDTs on an MDS. + +Object Based Disk:: +Object Based Disk (OBD) is the term used for any target, MDT or OST. + +Object Storage Server:: +An object storage server (OSS) is a computer responsible for running +Lustre kernel services in support of managing bulk data objects on the +underlying storage. There can be multiple OSSs in a Lustre file +system. + +Object Storage Target:: +An object storage target (OST) is the service provided by an OSS that +mediates the placement of data objects on the specific underlying file +system hardware. There can be multiple OSTs on a given OSS. + +protocol:: +An agreed upon formalism for communicating between two entities, such +as between two servers or between a client and a server. + +PtlRPC:: +The protocol (or set of protocols) implemented via RPCs that is (are) +employed by Lustre to communicate between its clients and servers. + +Remote Procedure Call:: +A mechanism for implementing operations involving one computer acting +on the behalf of another (RPC). + +server:: +A computer that provides a service. For example, management (MGS), +Metadata (MDS), or object storage (OSS) services in support of a +Lustre file system. + +target:: +Storage available to be served, such as an OST or an MDT. Also the +service being provided. + +UUID:: A universally unique identifier. diff --git a/introduction.txt b/introduction.txt new file mode 100644 index 0000000..bf62966 --- /dev/null +++ b/introduction.txt @@ -0,0 +1,44 @@ +Introduction +------------ + +[NOTE] +I am leaving the introductory content here for now, but it is the last +thing that should be written. The following is just a very early +sketch, and will be revised entirely once the rest of the content has +begun to shape up. + +The Lustre parallel file system provides a global POSIX namespace for +the computing resources of a data center. Lustre runs on Linux-based +hosts via kernel modules, and delegates block storage management to +the back-end servers while providing object-based storage to its +clients. Servers are responsible for both data objects (the contents +of actual files) and index objects (for directory information). Data +objects are gathered on Object Storage Servers (OSSs), and index +objects are stored on Metadata Servers (MDSs). Each back-end +storage volume is a target with Object Storage Targets (OSTs) on OSSs, +and Metadata Targets (MDTs) on MDSs. Clients assemble the +data from the MDTs and OSTs to present a single coherent +POSIX-compliant file system. The clients and servers communicate and +coordinate among themselves via network protocols. A low-level +protocol, LNet, abstracts the details of the underlying networking +hardware and presents a uniform interface, originally based on Sandia +Portals <>, to Lustre clients and servers. Lustre, in turn, +layers its own protocol atop LNet. This document describes the Lustre +protocol. + +Lustre runs across multiple hosts, coordinating the activities among +those hosts via the exchange of messages over a network. On each host, +Lustre is implemented via a collection of Linux processes (often +called "threads"). This discussion will refer to a more formalized +notion of 'processes' that abstract some of the thread-level +details. The description of the activities on each host comprise a +collection of 'abstract processes'. Each abstract process may be +thought of as a state machine, or automaton, following a fixed set of +rules for how it consumes messages, changes state, and produces other +messages. We speak of the 'behavior' of a process as shorthand for the +management of its state and the rules governing what messages it can +consume and produce. Processes communicate with each other via +messages. The Lustre protocol is the collection of messages the +processes exchange along with the rules governing the behavior of +those processes. + diff --git a/ldlm.txt b/ldlm.txt new file mode 100644 index 0000000..93906ea --- /dev/null +++ b/ldlm.txt @@ -0,0 +1,159 @@ +The Lustre Distributed Lock Manager +----------------------------------- +[[ldlm]] + +The discussion of the LDLM is deferred for now. We'll get into it soon +enough. + +LDLM Structures +~~~~~~~~~~~~~~~ + +Lock Modes +^^^^^^^^^^ + +---- +typedef enum { + LCK_MINMODE = 0, + LCK_EX = 1, + LCK_PW = 2, + LCK_PR = 4, + LCK_CW = 8, + LCK_CR = 16, + LCK_NL = 32, + LCK_GROUP = 64, + LCK_COS = 128, + LCK_MAXMODE +} ldlm_mode_t; +---- + +LDLM Extent +^^^^^^^^^^^ + +---- +struct ldlm_extent { + __u64 start; + __u64 end; + __u64 gid; +}; +---- + +LDLM Flock Wire +^^^^^^^^^^^^^^^ + +---- +struct ldlm_flock_wire { + __u64 lfw_start; + __u64 lfw_end; + __u64 lfw_owner; + __u32 lfw_padding; + __u32 lfw_pid; +}; +---- + +LDLM Inode Bits +^^^^^^^^^^^^^^^ + +---- +struct ldlm_inodebits { + __u64 bits; +}; +---- + +LDLM Wire Policy Data +^^^^^^^^^^^^^^^^^^^^^ + +---- +typedef union { + struct ldlm_extent l_extent; + struct ldlm_flock_wire l_flock; + struct ldlm_inodebits l_inodebits; +} ldlm_wire_policy_data_t; +---- + +Resource Descriptor +^^^^^^^^^^^^^^^^^^^ + +---- +struct ldlm_resource_desc { + ldlm_type_t lr_type; + __u32 lr_padding; /* also fix lustre_swab_ldlm_resource_desc */ + struct ldlm_res_id lr_name; +}; +---- + +The 'ldlm_type_t' is given by one of these values: +---- +typedef enum { + LDLM_PLAIN = 10, + LDLM_EXTENT = 11, + LDLM_FLOCK = 12, + LDLM_IBITS = 13 +} ldlm_type_t; +---- + +Lock Descriptor +^^^^^^^^^^^^^^^ + +---- +struct ldlm_lock_desc { + struct ldlm_resource_desc l_resource; + ldlm_mode_t l_req_mode; + ldlm_mode_t l_granted_mode; + ldlm_wire_policy_data_t l_policy_data; +}; +---- + +Lock Request +^^^^^^^^^^^^ + +---- +#define LDLM_LOCKREQ_HANDLES 2 +struct ldlm_request { + __u32 lock_flags; + __u32 lock_count; + struct ldlm_lock_desc lock_desc; + struct lustre_handle lock_handle[LDLM_LOCKREQ_HANDLES]; +}; +---- + +Lock Reply +^^^^^^^^^^ + +---- +struct ldlm_reply { + __u32 lock_flags; + __u32 lock_padding; /* also fix lustre_swab_ldlm_reply */ + struct ldlm_lock_desc lock_desc; + struct lustre_handle lock_handle; + __u64 lock_policy_res1; + __u64 lock_policy_res2; +}; +---- + +Lock Value Block +^^^^^^^^^^^^^^^^ + +A lock value block is part of reply messages from servers when an +LDLM_ENQUEUE command has been issued. There are two varieties. Which +is chosen depends on the target. + +---- +struct ost_lvb_v1 { + __u64 lvb_size; + obd_time lvb_mtime; + obd_time lvb_atime; + obd_time lvb_ctime; + __u64 lvb_blocks; +}; +struct ost_lvb { + __u64 lvb_size; + obd_time lvb_mtime; + obd_time lvb_atime; + obd_time lvb_ctime; + __u64 lvb_blocks; + __u32 lvb_mtime_ns; + __u32 lvb_atime_ns; + __u32 lvb_ctime_ns; + __u32 lvb_padding; +}; +---- diff --git a/llog.txt b/llog.txt new file mode 100644 index 0000000..37df633 --- /dev/null +++ b/llog.txt @@ -0,0 +1,83 @@ +The Lustre Log Facility +----------------------- +[[llog]] + +LLOG Structures +~~~~~~~~~~~~~~~ + +LLOG Log ID +^^^^^^^^^^^ + +---- +struct llog_logid { + struct ost_id lgl_oi; + __u32 lgl_ogen; +}; +---- + +LLog Information +^^^^^^^^^^^^^^^^ + +---- +struct llogd_body { + struct llog_logid lgd_logid; + __u32 lgd_ctxt_idx; + __u32 lgd_llh_flags; + __u32 lgd_index; + __u32 lgd_saved_index; + __u32 lgd_len; + __u64 lgd_cur_offset; +}; +---- + +The lgd_llh_flags are: +---- +enum llog_flag { + LLOG_F_ZAP_WHEN_EMPTY = 0x1, + LLOG_F_IS_CAT = 0x2, + LLOG_F_IS_PLAIN = 0x4, +}; +---- + +LLog Record Header +^^^^^^^^^^^^^^^^^^ + +---- +struct llog_rec_hdr { + __u32 lrh_len; + __u32 lrh_index; + __u32 lrh_type; + __u32 lrh_id; +}; +---- + +LLog Record Tail +^^^^^^^^^^^^^^^^ + +---- +struct llog_rec_tail { + __u32 lrt_len; + __u32 lrt_index; +}; +---- + +LLog Log Header Information +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +---- +struct llog_log_hdr { + struct llog_rec_hdr llh_hdr; + obd_time llh_timestamp; + __u32 llh_count; + __u32 llh_bitmap_offset; + __u32 llh_size; + __u32 llh_flags; + __u32 llh_cat_idx; + /* for a catalog the first plain slot is next to it */ + struct obd_uuid llh_tgtuuid; + __u32 llh_reserved[LLOG_HEADER_SIZE/sizeof(__u32) - 23]; + __u32 llh_bitmap[LLOG_BITMAP_BYTES/sizeof(__u32)]; + struct llog_rec_tail llh_tail; +}; +---- + diff --git a/lustre_messages.txt b/lustre_messages.txt new file mode 100644 index 0000000..ad3b858 --- /dev/null +++ b/lustre_messages.txt @@ -0,0 +1,477 @@ +Lustre Messages +--------------- +[[lustre-messages]] + +A Lustre message is a sequence of bytes. The message begins with a +<> and has between one +and nine subsequences called "buffers". Each buffer has a structure +(the size and meaning of the bytes) that corresponds to the 'struct' +entities in the <> +Section. The header gives the number of buffers in its 'lm_buffcount' +field. The first buffer in any message is always the +<>. The operation +code ('pb_opc' field) and the message type ('pb_type' field: request +or reply?) in the preamble together specify the "format" of the +message, where the format is the number and content of the remaining +buffers. As a shorthand, it is useful to name each of these formats, +and the following list gives all of the formats along with the number +and content of the buffers for messages in that format. Note that +while the combination of 'pb_opc' and 'pb_type' uniquely determines a +message's format, the reverse is not true. A given format may be used +in many different messages. + +The "Empty" Message +~~~~~~~~~~~~~~~~~~~ + +'empty' +^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_STATFS | PTL_RPC_MSG_REQUEST +|==== + + +An 'empty' message is one that consists only of the Lustre message +preamble 'ptlrpc_body'. It occurs as either the request of the reply +(or both) in a variety of Lustre operations. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +|==== + +The LDLM Enqueue Client Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'ldlm_enqueue_client' +^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LDLM_ENQUEUE | PTL_RPC_MSG_REQUEST +|==== + + +An 'ldlm_enqueue_client' message format is used to acquire a lock. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| ldlm_request | details of the lock request +|==== + +An 'ldlm_enqueue_client' message concatenates two data elements into a +single byte-stream. The two elements correspond to structures +detailed in the <>. + +The LDLM Enqueue Server Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'ldlm_enqueue_server' +^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LDLM_ENQUEUE | PTL_RPC_MSG_REPLY +|==== + + +An 'ldlm_enqueue_server' message format is used to inform a client +about the status of its request for a lock. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| ldlm_reply | details of the lock request +|==== + +An 'ldlm_enqueue_server' message concatenates two data elements +into a single byte-stream. The three elements correspond to structures +detailed in the <>. + +The LDLM Enqueue Server Message with LVB +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'ldlm_enqueue_lvb_server' +^^^^^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LDLM_ENQUEUE | PTL_RPC_MSG_REPLY +|==== + + +An 'ldlm_enqueue_lvb_server' message format is used to inform a client +about the status of its request for a lock. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| ldlm_reply | details of the lock request +| ost_lvb | lock value block +|==== + +An 'ldlm_enqueue_lvb_server' message concatenates three data elements +into a single byte-stream. It closely resembles the +'ldlm_enqueue_server' message with the addition of a lock value +block. The three elements correspond to structures detailed in the +<>. + +The LLog Origin Handle Create Client +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'llog_origin_handle_create_client' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LLOG_ORIGIN_HANDLE_CREATE | PTL_RPC_MSG_REQUEST +|==== + + +An 'llog_origin_handle_create_client' message format is used to +ask for the creation of a log entry object. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| llogd_body | LLog description +| string | The name of the desired log +|==== + +Fixme: I don't actually see where the string gets set. + +The LLog Service (body-only) Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'llogd_body_only' +^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LLOG_ORIGIN_HANDLE_CREATE | PTL_RPC_MSG_REPLY +|==== + + +An 'llogd_body_only' message replies with information about the log. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| llogd_body | LLog description +|==== + + +The LLog Log (header-only) Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'llog_log_hdr_only' +^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LLOG_ORIGIN_HANDLE_READ_HEADER | PTL_RPC_MSG_REPLY +|==== + + +An 'llog_log_hdr_only' message replies with header information from +the log. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| llog_log_hdr | LLog log header info +|==== + + +The LLog Return Next Block from Server +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'llog_origin_handle_next_block_server' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| LLOG_ORIGIN_HANDLE_NEXT_BLOCK | PTL_RPC_MSG_REPLY +|==== + + +An 'llog_origin_handle_next_block_server' message replies with the +next block of data from the log. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| llogd_body | LLog description +| eadata | variable length field for extended attributes +|==== + + +The MDS Getattr Server Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'mds_getattr_server' +^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_GETATTR | PTL_RPC_MSG_REPLY +|==== + +An 'mds_getattr_server' message format is used to convey MDT data along +with information about the Lustre capabilities of that MDT. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| mdt_body | Information about the MDT +| MDT_MD | OST stripe and index info +| ACL | ACLs for the fid +| lustre_capa | Security capa info for fid1 +| lustre_capa | Security capa info for fid2 +|==== + +An 'mdt_getattr_server' message concatenates three data elements into +a single byte-stream. The three elements correspond to structures +detailed in the <>. + +The MDT Capability Message +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'mdt_body_capa' +^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_GETSTATUS | PTL_RPC_MSG_REPLY +| MDS_GETATTR | PTL_RPC_MSG_REQUEST +|==== + +An 'mdt_body_capa' message format is used to convey MDT data along +with information about the Lustre capabilities of that MDT. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| mdt_body | Information about the MDT +| lustre_capa | security capa info +|==== + +An 'mdt_body_capa' message concatenates three data elements into +a single byte-stream. The three elements correspond to structures +detailed in the <>. + +The MDT "Body Only" Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'mdt_body_only' +^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_GETSTATUS | PTL_RPC_MSG_REQUEST +|==== + + +An 'mdt_body_only' message format is used to convey MDT data. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| mdt_body | Information about the MDT +|==== + +An 'mdt_body_only' message concatenates two data elements into +a single byte-stream. The two elements correspond to structures +detailed in the <>. + +The MGS Config Read Client Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'mgs_config_read_client' +^^^^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MGS_CONFIG_READ | PTL_RPC_MSG_REQUEST +|==== + +An 'mgs_config_read_client' message requests configuration data from +the MGS. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| mgs_config_body | Information about the MGS supporting the request +|==== + + +The MGS Config Read Server Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'mgs_config_read_server' +^^^^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MGS_CONFIG_READ | PTL_RPC_MSG_REPLY +|==== + +An 'mgs_config_read_server' message returns configuration data from +the MGS. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| mgs_config_body | Information about the MGS supporting the request +|==== + + +The OBD Connect Client Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'obd_connect_client' +^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_CONNECT | PTL_RPC_MSG_REQUEST +|==== + + +An 'obd_connect_client' message format is used to initiate the +connection from one host to a target on another host. Clients will +connect to the MGS, to MDTS on MDSs, and to OSTs on OSSs. Furthermore, +MDSs and OSSs will connect to the MGS, and MDSs will connect to OSTs +on OSSs. In each case the host initiating the the connection request +sends an 'obd_connect_client' message. The reply to this message is +the obd_connect_server message. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| obd_uuid | UUID of the target +| obd_uuid | UUID of the client +| lustre_handle | connection handle +| obd_connect_data | connection data +|==== + +An 'obd_connect_client' message concatenates five data elements into +a single byte-stream. The five elements correspond to structures +detailed in the <>. + +The connection handle sent in a client connection request message is +unique to that connection. The server notes it and a connection +request with a new or different handle indicates that the client is +attempting to make a new connection (or a delayed message about an +old one?). + +The OBD Connect Server Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +' obd_connect_server' +^^^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_CONNECT | PTL_RPC_MSG_REPLY +|==== + +The 'obd_connect_server' mess-sage format is sent by a server in reply +to an 'obd_connect_client' connection request message. to a target on +that server. MGSs, MDSs, and OSSs send these replies. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| obd_connect_data | connection data +|==== + +An 'obd_connect_server' message concatenates two data elements into a +single byte-stream. + +The OBD Statfs Server Message +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +'obd_statfs_server' +^^^^^^^^^^^^^^^^^^^ + +.example +[options="header"] +|==== +| pb_opc | pb_type +| MDS_STATFS | PTL_RPC_MSG_REPLY +|==== + +The 'obd_statfs_server' message returns 'statfs' system call +information to the client. + +.format +[options="header"] +|==== +| structure | meaning +| ptlrpc_body | message preamble +| obd_statfs | statfs system call info +|==== + diff --git a/lustre_operations.txt b/lustre_operations.txt new file mode 100644 index 0000000..96ff8d3 --- /dev/null +++ b/lustre_operations.txt @@ -0,0 +1,342 @@ +Lustre Operations +----------------- +[[lustre-operations]] + +Lustre operations are denoted by the 'pb_opc' op-code field of the +message preamble. Each operation is implemented as a pair of messages, +with the 'pb_type' field set to PTLRPC_MSG_REQUEST for requests +initiating the operation, and PTLRPC_MSG_REPLY for replies. Note that +as a general matter, the receipt by a client of the rply message only +assures the client hat the server has initiated the action, if +any. See the discussion on <> +and <> for how the client is given confirmation that a +request has been completed. + +Command 8: OST CONNECT - Client connection to an OST +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.OST_CONNECT (8) +[options="header"] +|==== +| request | reply +| obd_connect_client | obd_connect_server +|==== + +When a client initiates a connection to a specific target on an OSS, +it does so by sending an 'obd_connect_client' message and awaiting the +reply from the OSS of an 'obd_connect_server' message. From a previous +interaction with the MGS the client knows the UUID of the target OST, +and can fill that value into the 'obd_connect_client' message. + +The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the +capabilities appropriate to the client. The 'ocd_brw_size' is set to the +largest value for the size of an RPC that the client can handle. The +'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client +considers appropriate. Other fields in the preamble and +'obd_connect_data' structures are zero, as is the 'lustre_handle' +element. + +Once the server receives the 'obd_connect_client' message on behalf of +the given target it replies with an 'obd_connect_server' message. In +that message the server sends the 'pb__handle' to uniquely +identify the connection for subsequent communication. The client notes +that handle in its import for the given target. + +fixme: Are there circumstances that could lead to the 'status' +value in the reply being non-zero? What would lead to that and what +error values would result? + +The target maintains the last committed transaction for a client in +its export for that client. If this is the first connection, then that +last transactiion value would just be zero. If there were previous +transactions for the client, then the transaction number for the last +such committed transaction is put in the 'pb_last_committed' field. + +In a connection request the operation is not file system modifying, so +the 'pb_transno' value will be zero in the reply as well. + +fixme: there is still some work to be done about how the fields are +managed. + +Command 9: OST DISCONNECT - Disconnect client from an OST +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.OST_DISCONNECT (9) +[options="header"] +|==== +| request | reply +| empty | empty +|==== + +The information exchanged in a DISCONNECT message is that normally +conveyed in the mesage preamble. + +Command 33: MDS GETATTR - Get MDS Attributes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MDS_GETATTR (33) +[options="header"] +|==== +| request | reply +| mdt_body_capa | mds_getattr_server +|==== + + + +Command 38: MDS CONNECT - Client connection to an MDS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MDS_CONNECT (38) +[options="header"] +|==== +| request | reply +| obd_connect_client | obd_connect_server +|==== + +N.B. This is nearly identical to the explanation for OST_CONNECT and +for MGS_CONNECT. We may want to simplify and/or unify the discussion +and only call out how this one differees from a generic CONNECT +operation. + +When a client initiates a connection to a specific target on an MDS, +it does so by sending an 'obd_connect_client' message and awaiting the +reply from the MDS of an 'obd_connect_server' message. From a previous +interaction with the MGS the client knows the UUID of the target MDT, +and can fill that value into the 'obd_connect_client' message. + +The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the +capabilities appropriate to the client. The 'ocd_brw_size' is set to the +largest value for the size of an RPC that the client can handle. The +'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client +considers appropriate. Other fields in the preamble and +'obd_connect_data' structures are zero, as is the 'lustre_handle' +element. + +Once the server receives the 'obd_connect_client' message on behalf of +the given target it replies with an 'obd_connect_server' message. In +that message the server sends the 'pb__handle' to uniquely +identify the connection for subsequent communication. The client notes +that handle in its import for the given target. + +fixme: Are there circumstances that could lead to the 'status' +value in the reply being non-zero? What would lead to that and what +error values would result? + +The target maintains the last committed transaction for a client in +its export for that client. If this is the first connection, then that +last transactiion value would just be zero. If there were previous +transactions for the client, then the transaction number for the last +such committed transaction is put in the 'pb_last_committed' field. + +In a connection request the operation is not file system modifying, so +the 'pb_transno' value will be zero in the reply as well. + +fixme: there is still some work to be done about how the fields are +managed. + +Command 39: MDS DISCONNECT - Disconnect client from an MDS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MDS_DISCONNECT (39) +[options="header"] +|==== +| request | reply +| empty | empty +|==== + +The information exchanged in a DISCONNECT message is that normally +conveyed in the mesage preamble. + +Command 40: MDS GETSTATUS - get the status from a target +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The MDS_GETSTATUS command targets a specific MDT. If there are several, +the client will need to send a separate message for each. + +.MDS_GETSTATUS (41) +[options="header"] +|==== +| request | reply +| mdt_body_only | mdt_body_capa +|==== + +The 'mdt_body_only' request message is one that provides information +about a specific MDT, identifying which (among several possible) +target is being asked about. + +In the reply there is additional information about the MDT's +capabilities. + +Command 41: MDS STATFS - get statfs data from the server +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MDS_STATFS (41) +[options="header"] +|==== +| request | reply +| empty | obd_statfs_server +|==== + +The 'empty' request message is one that only has the 'ptlrpc_body' +data encoded. The fields have thier generic values for a request from +this client, with 'pb_opc' being set to MDS_STATFS (41). + +In the reply there is, in addition to the 'ptlrpc_body', data relevant +to a 'statfs' system call. + +Command 101: LDLM ENQUEUE - Enqueue a lock request +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.LDLM_ENQUEUE (101) +[options="header"] +|==== +| request | reply +| ldlm_enqueue_client | ldlm_enqueue_lvb_server +|==== + +In order to access data on disk at a target the client needs to +establish a lock (either concurrent for reads or exclusive for +writes). The client sends the 'ldlm_enqueue_client' message to the +server holding the target, and the server will reply with an +'ldlm_enqueue_server' message. (N.B. It actuallity it is an +'ldlm_enqueue_lvb_server' message with the length of the 'struct +ost_lvb' portion set to zero, which ammounts to the same thing). + +The target UUID is just "MGS", and the client UUID is set to the +32byte string it gets from ... where? + +The 'struct lustre_handle' (the fourth buffer in the message) has its +cookie set to .. what? It is set, but where does it come from? + +The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the +capabilities appropriate to the client. The 'ocd_brw_size' is set to the +largest value for the size of an RPC that the client can handle. The +'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client +considers appropriate. Other fields in the preamble and +'obd_connect_data' structures are zero. + +Once the server receives the 'obd_connect_client' message on behalf of +the given target it replies with an 'obd_connect_server' message. In +that message the server sends the 'pb__handle' to uniquely +identify the connection for subsequent communication. The client notes +that handle in its import for the given target. + +fixme: Are there circumstances that could lead to the 'status' +value in the reply being non-zero? What would lead to that and what +error values would result? + +The target maintains the last committed transaction for a client in +its export for that client. If this is the first connection, then that +last transactiion value would just be zero. If there were previous +transactions for the client, then the transaction number for the last +such committed transaction is put in the 'pb_last_committed' field. + +In a connection request the operation is not file system modifying, so +the 'pb_transno' value will be zero in the reply as well. + +Command 250: MGS CONNECT - Client connection to an MGS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MGS_CONNECT (250) +[options="header"] +|==== +| request | reply +| obd_connect_client | obd_connect_server +|==== + +When a client initiates a connection to the MGS, +it does so by sending an 'obd_connect_client' message and awaiting the +reply from the MGS of an 'obd_connect_server' message. This is the +first operation carried out by a client upon the issue of a 'mount' +command, and the target UUID is provided on the command line. + +The target UUID is just "MGS", and the client UUID is set to the +32byte string it gets from ... where? + +The 'struct lustre_handle' (the fourth buffer in the message) has its +cookie set to .. what? It is set, but where does it come from? + +The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the +capabilities appropriate to the client. The 'ocd_brw_size' is set to the +largest value for the size of an RPC that the client can handle. The +'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client +considers appropriate. Other fields in the preamble and +'obd_connect_data' structures are zero. + +Once the server receives the 'obd_connect_client' message on behalf of +the given target it replies with an 'obd_connect_server' message. In +that message the server sends the 'pb__handle' to uniquely +identify the connection for subsequent communication. The client notes +that handle in its import for the given target. + +fixme: Are there circumstances that could lead to the 'status' +value in the reply being non-zero? What would lead to that and what +error values would result? + +The target maintains the last committed transaction for a client in +its export for that client. If this is the first connection, then that +last transactiion value would just be zero. If there were previous +transactions for the client, then the transaction number for the last +such committed transaction is put in the 'pb_last_committed' field. + +In a connection request the operation is not file system modifying, so +the 'pb_transno' value will be zero in the reply as well. + +Command 251: MGS DISCONNECT - Disconnect client from an MGS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MGS_DISCONNECT (251) +[options="header"] +|==== +| request | reply +| empty | empty +|==== + +N.B. The usual 'struct req_format' definition does not exist for +MGS_DISCONNECT. It will take a more careful code review to verify that +it also has 'empty' messages gong back and forth. + +The information exchanged in a DISCONNECT message is that normally +conveyed in the mesage preamble. + +Command 256: MGS CONFIG READ - Read MGS configuration info +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.MGS_CONFIG_READ (256) +[options="header"] +|==== +| request | reply +| mgs_config_read_client | mgs_config_read_server +|==== + +Command 501: LLOG ORIGIN HANDLE CREATE - Create llog handle +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.LLOG_ORIGIN_HANDLE_CREATE (510) +[options="header"] +|==== +| request | reply +| llog_origin_handle_create_client | llogd_body_only +|==== + +Command 502: LLOG ORIGIN HANDLE NEXT BLOCK - Read the next block +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.LLOG_ORIGIN_HANDLE_NEXT_BLOCK (502) +[options="header"] +|==== +| request | reply +| llogd_body_only | llog_origin_handle_next_block_server +|==== + +Command 503: LLOG ORIGIN HANDLE READ HEADER - Read handle header +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.LLOG_ORIGIN_HANDLE_READ_HEADER (503) +[options="header"] +|==== +| request | reply +| llogd_body_only | llog_log_hdr_only +|==== + +LLOG_ORIGIN_HANDLE_NEXT_BLOCK diff --git a/protocol.txt b/protocol.txt index 4b3df34..88b5087 100644 --- a/protocol.txt +++ b/protocol.txt @@ -13,308 +13,33 @@ v0.0, January 2015 :keywords: PtlRPC, Lustre, Protocol -:numbered!: [abstract] Abstract -------- The Lustre parallel file -system <> provides a global POSIX <> namespace for the -computing resources of a data center. Lustre runs on Linux-based hosts -via kernel modules, and delegates block storage management to the -back-end servers while providing object-based storage to its -clients. Servers are responsible for both data objects (the contents -of actual files) and index objects (for directory information). Data -objects are gathered on Object Storage Servers (OSSs), and index -objects are stored on MetaData Storage Servers (MDSs). Each back-end -storage volume is a target with Object Storage Targets (OSTs) on OSSs, -and MetaData Storage Targets (MDTs) on MDSs. Clients assemble the -data from the MDT(s) and OST(s) to present a single coherent -POSIX-compliant file system. The clients and servers communicate and -coordinate among themselves via network protocols. A low-level -protocol, LNet, abstracts the details of the underlying networking -hardware and presents a uniform interface, originally based on Sandia -Portals <>, to Lustre clients and servers. Lustre, in turn, -layers its own protocol PtlRPC atop LNet. This document describes the -Lustre protocols, including how they are implemeted via PtlRPC and the -Lustre Distributed Lock Manager (based on the VAX/VMS Distributed Lock -Manager <>). This document does not describe Lustre itself in -any detail, except where doing so explains concepts that allow this -document to be self-contained. +include::introduction.txt[] -:numbered: - -Overview --------- - -'Content to be provided' - -Messages --------- - -These are the messages that traverse the network using PTLRPC. - -[NOTE] -This initial list combines some actual message names or types with the -POSIX semantic operations they are being used to implement, as well as -a few other underlying mechanisms (cf. "grant"). A subsequent -refinement will separate the various items and relate them to one -another. - -Client-MDS RPCs for POSIX namespace operations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -'Content to be provided' - -=== mount === - -'Content to be provided' - -=== unmount === - -'Content to be provided' - -=== create === - -'Content to be provided' - -=== open === - -'Content to be provided' - -=== close === - -'Content to be provided' - -=== unlink === - -'Content to be provided' - -=== mkdir === - -image:mkdir1.png[mkdir] - -=== rmdir === - -'Content to be provided' - -=== rename === - -'Content to be provided' - -=== link === - -'Content to be provided' - -=== symlink === - -'Content to be provided' - -=== getattr === - -'Content to be provided' - -=== setattr === - -'Content to be provided' - -=== statfs === - -'Content to be provided' - -=== ... === - -'Content to be provided' - - -Client-MDS RPCs for internal state management -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -'Content to be provided' - -=== connect === - -'Content to be provided' - -=== disconnect === - -'Content to be provided' - -=== FLD === - -'Content to be provided' - -=== SEQ === - -'Content to be provided' - -=== PING === +include::data_types.txt[] -'Content to be provided' +include::connection.txt[] -=== LDLM === +include::timeouts.txt[] -'Content to be provided' +include::file_id.txt[] -=== ... === +include::ldlm.txt[] -'Content to be provided' +include::llog.txt[] -Client-OSS RPCs for IO Operations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +include::recovery.txt[] -'Content to be provided' +include::security.txt[] -=== read === +include::lustre_messages.txt[] -'Content to be provided' +include::lustre_operations.txt[] -=== write === - -'Content to be provided' - -=== truncate === - -'Content to be provided' - -=== setattr === - -'Content to be provided' - -=== grant === - -'Content to be provided' - -=== ... === - -'Content to be provided' - -MDS-OSS RPCs for internal state management -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -'Content to be provided' - -=== object precreation === - -'Content to be provided' - -=== orphan recovery === - -'Content to be provided' - -=== UID/GID change === - -'Content to be provided' - -=== unlink === - -'Content to be provided' - -=== ... === - -'Content to be provided' - -MDS-OSS RPCs for quota management -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -'Content to be provided' - - -MDS-OSS OUT RPCs for distributed updates -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -'Content to be provided' - -=== DNE1 remote directories === - -'Content to be provided' - -=== DNE2 striped directories === - -'Content to be provided' - -=== LFSCK2/3 verification and repair === - -'Content to be provided' - -Message Flows -------------- - - Each file operation (in Lustre) generates a set of messages in a - particular sequence. There is one sequence for any particular - concrete operation, but under varying circumstances the same file - operation may generate a different sequence. - -State Machines --------------- - - For each File operation, the collection of possible sequences of - messages is governed by a state machine. +include::file_system_operations.txt[] :numbered!: -[glossary] -Glossary --------- -Here are some common terms used in discussing Lustre, POSIX semantics, -and the prtocols used to implement them. - -[glossary] -Object Storage Server:: - An object storage server (OSS) is a computer responsible for - running Lustre kernel services in support of managing bulk data - objects on the underlying storage. There can be multiple OSSs in a - Lustre file system. - -MetaData Server:: - A metadata server (MDS) is a computer responsible for running the - Lustre kernel services in support of managing the POSIX-compliant - name space and the indices associating files in that name space with - the locations of their corresponding objects. As of v2.4 there can - be multiple MDSs in a Lustre file system. - -Object Storage Target:: - An object storage target (OST) is the service provided by an OSS - that mediates the placement of data objects on the specific - underlying file system hardware. There can be multiple OSTs on a - given OSS. - -MetaData Target:: - A metadata target (MDT) is the service provided by an MDS that - mediates the management of name space indices on the underlying file - system hardware. As of v2.4 there can be multiple MDTs on an MDS. - -server:: - A computer providing a service, such as an OSS or an MDS - -target:: - Storage available to be served, such as an OST or an MDT. Also the - service being provided. - -protocol:: - An agreed upon formalism for communicating between two entities, - such as between two servers or between a client and a server. - -client:: - A computer taking advantage of a service provided by a server, such - as a Lustre client using MDS(s) and OSS(s) to assemble a - POSIX-compliant file system with its namespace and data storage - capabilities. - -PtlRPC:: - The protocol (or set of protocols) implemented via RPCs that is - (are) employed by Lustre to communicate between its clients and - servers. - -Remote Procedure Call:: - A mechanism for implementing operations involving one computer - acting on the behalf of another (RPC). - -LNet:: - A lower level protocol employed by PtlRPC to abstract the mechanisms - provided by various hardware centric protocols, such as TCP or - Infiniband. - -[appendix] -Concepts --------- - -'Content to be provided' +include::glossary.txt[] [appendix] License diff --git a/recovery.txt b/recovery.txt new file mode 100644 index 0000000..e18edfb --- /dev/null +++ b/recovery.txt @@ -0,0 +1,7 @@ +Recovery +-------- +[[recovery]] + +The discussion of recovery will be developed when the "normal" +operations are fairly complete. + diff --git a/security.txt b/security.txt new file mode 100644 index 0000000..d4a02f8 --- /dev/null +++ b/security.txt @@ -0,0 +1,8 @@ +Security +-------- +[[security]] + +The discussion of security is deferred until later. It is a side issue +to most of what is being documented here. It is seldom employed, and +may have older and buggy code. At least I think that's what Mike was +telling me. diff --git a/timeouts.txt b/timeouts.txt new file mode 100644 index 0000000..db3c3cd --- /dev/null +++ b/timeouts.txt @@ -0,0 +1,25 @@ + +Timeouts +-------- + +Timeouts in Lustre are handled by grouping connections into classes, +thus the timeout appropriate for a client connection to an OST is a +common property for all the OSTs on that client. That timeout may +differ from a timeout for other connections or replies. A full list of +the classes and how they are managed will wait for now. + +Every request message indicates a timeout value and every reply answers +with the value the service will honor. Initial connection requests +propose a value for the timeout, and subsequent requests and replies +pass that value back and forth as part of the message header +('pb_timeout'). + +Service Times +------------- + +Closely related to the timeouts in Lustre are the service times that +are expect and observed for each connection class. A request will +always list the service time as 0 in the message header +('pb_service_time'). The reply lists the time the server actual to +send the reply. +