FIGURES = figures/mkdir1.png
+TEXT = protocol.txt \
+ introduction.txt \
+ data_types.txt \
+ connection.txt \
+ timeouts.txt \
+ file_id.txt \
+ ldlm.txt \
+ llog.txt \
+ recovery.txt \
+ security.txt \
+ lustre_messages.txt \
+ lustre_operations.txt \
+ file_system_operations.txt \
+ glossary.txt
.SUFFIXES : .gnuplot .gv .pdf .png
all: protocol.html protocol.pdf
.PHONY: check
-check: protocol.txt
+check: $(TEXT)
@echo "Are there lines with trailing white space?"
build/whitespace.sh $<
-protocol.html: $(FIGURES) protocol.txt
+protocol.html: $(FIGURES) $(TEXT)
asciidoc protocol.txt
-protocol.pdf: $(FIGURES) protocol.txt
+protocol.pdf: $(FIGURES) $(TEXT)
a2x -f pdf --fop protocol.txt
.gv.png:
--- /dev/null
+Connections Between Lustre Entities
+-----------------------------------
+[[connection]]
+
+The Lustre protocol is connection-based in that each two entities
+maintain shared, coordinated state information. The most common
+example of two such entities are a client and a target on some
+server. The target is identified by name to the client through an
+interaction with the management server. The client then 'connects' to
+the given target on the indicated server by sending the appropriate
+version of the *_CONNECT message (MGS_CONNECT, MDS_CONNECT, or
+OST_CONNECT - colectively *_CONNECT) and receiving back the
+corresponding *_CONNECT reply. The server creates an 'export' for the
+connection between the target and the client, and the export holds the
+server state information for that connection. When the client gets the
+reply it creates an 'import', and the import holds the client state
+information for that connection. Note that if a server has N targets
+and M clients have connected to them, the server will have N x M
+exports and each client will have N imports.
+
+There are also connections between the servers: Each MDS and OSS has a
+connection to the MGS, where the MDS (respectively the OSS) plays the
+role of the client in the above discussion. That is, the MDS initiates
+the connection and has an import for the MGS, while the MGS has an
+export for each MDS. Each MDS connects to each OST, with an import on
+the MDS and an export on the OSS. This connection supports requests
+from the MDS to the OST for 'statfs' information such as size and
+access time values. Each OSS also connects to the first MDS to get
+access to auxiliary services, with an import on the OSS and an export
+on the first MDS. The auxiliary services are: the File ID Location
+Database (FLDB), the quota master service, and the sequence
+controller.
+
+Finally, for some communications the roles of message initiation and
+message reply are reversed. This is the case, for instance, with
+call-back operations. In that case the entity which would normally
+have an import has, instead, a 'reverse-export' and the
+other end of the connection maintains a 'reverse-import'. The
+reverse-import uses the same structure as a regular import, and the
+reverse-export uses the same structure as a regular export.
+
+Connection Structures
+~~~~~~~~~~~~~~~~~~~~~
+
+Connect Data
+^^^^^^^^^^^^
+
+An 'obd_connect_data' structure accompanies every connect operation in
+both the request message and in the reply message.
+
+----
+struct obd_connect_data {
+ __u64 ocd_connect_flags;
+ __u32 ocd_version; /* OBD_CONNECT_VERSION */
+ __u32 ocd_grant; /* OBD_CONNECT_GRANT */
+ __u32 ocd_index; /* OBD_CONNECT_INDEX */
+ __u32 ocd_brw_size; /* OBD_CONNECT_BRW_SIZE */
+ __u64 ocd_ibits_known; /* OBD_CONNECT_IBITS */
+ __u8 ocd_blocksize; /* OBD_CONNECT_GRANT_PARAM */
+ __u8 ocd_inodespace; /* OBD_CONNECT_GRANT_PARAM */
+ __u16 ocd_grant_extent; /* OBD_CONNECT_GRANT_PARAM */
+ __u32 ocd_unused;
+ __u64 ocd_transno; /* OBD_CONNECT_TRANSNO */
+ __u32 ocd_group; /* OBD_CONNECT_MDS */
+ __u32 ocd_cksum_types; /* OBD_CONNECT_CKSUM */
+ __u32 ocd_max_easize; /* OBD_CONNECT_MAX_EASIZE */
+ __u32 ocd_instance;
+ __u64 ocd_maxbytes; /* OBD_CONNECT_MAXBYTES */
+ __u64 padding1;
+ __u64 padding2;
+ __u64 padding3;
+ __u64 padding4;
+ __u64 padding5;
+ __u64 padding6;
+ __u64 padding7;
+ __u64 padding8;
+ __u64 padding9;
+ __u64 paddingA;
+ __u64 paddingB;
+ __u64 paddingC;
+ __u64 paddingD;
+ __u64 paddingE;
+ __u64 paddingF;
+};
+----
+
+The 'ocd_connect_flags' field encodes the connect flags giving the
+capabilities of a connection between client and target. Several of
+those flags (noted in comments above and the discussion below)
+actually control whether the remaining fields of 'obd_connect_data'
+get used. The [[connect-flags]] flags are:
+
+----
+#define OBD_CONNECT_RDONLY 0x1ULL /*client has read-only access*/
+#define OBD_CONNECT_INDEX 0x2ULL /*connect specific LOV idx */
+#define OBD_CONNECT_MDS 0x4ULL /*connect from MDT to OST */
+#define OBD_CONNECT_GRANT 0x8ULL /*OSC gets grant at connect */
+#define OBD_CONNECT_SRVLOCK 0x10ULL /*server takes locks for cli */
+#define OBD_CONNECT_VERSION 0x20ULL /*Lustre versions in ocd */
+#define OBD_CONNECT_REQPORTAL 0x40ULL /*Separate non-IO req portal */
+#define OBD_CONNECT_ACL 0x80ULL /*access control lists */
+#define OBD_CONNECT_XATTR 0x100ULL /*client use extended attr */
+#define OBD_CONNECT_CROW 0x200ULL /*MDS+OST create obj on write*/
+#define OBD_CONNECT_TRUNCLOCK 0x400ULL /*locks on server for punch */
+#define OBD_CONNECT_TRANSNO 0x800ULL /*replay sends init transno */
+#define OBD_CONNECT_IBITS 0x1000ULL /*support for inodebits locks*/
+#define OBD_CONNECT_JOIN 0x2000ULL /*files can be concatenated.
+ *We do not support JOIN FILE
+ *anymore, reserve this flags
+ *just for preventing such bit
+ *to be reused.*/
+#define OBD_CONNECT_ATTRFID 0x4000ULL /*Server can GetAttr By Fid*/
+#define OBD_CONNECT_NODEVOH 0x8000ULL /*No open hndl on specl nodes*/
+#define OBD_CONNECT_RMT_CLIENT 0x10000ULL /*Remote client */
+#define OBD_CONNECT_RMT_CLIENT_FORCE 0x20000ULL /*Remote client by force */
+#define OBD_CONNECT_BRW_SIZE 0x40000ULL /*Max bytes per rpc */
+#define OBD_CONNECT_QUOTA64 0x80000ULL /*Not used since 2.4 */
+#define OBD_CONNECT_MDS_CAPA 0x100000ULL /*MDS capability */
+#define OBD_CONNECT_OSS_CAPA 0x200000ULL /*OSS capability */
+#define OBD_CONNECT_CANCELSET 0x400000ULL /*Early batched cancels. */
+#define OBD_CONNECT_SOM 0x800000ULL /*Size on MDS */
+#define OBD_CONNECT_AT 0x1000000ULL /*client uses AT */
+#define OBD_CONNECT_LRU_RESIZE 0x2000000ULL /*LRU resize feature. */
+#define OBD_CONNECT_MDS_MDS 0x4000000ULL /*MDS-MDS connection */
+#define OBD_CONNECT_REAL 0x8000000ULL /*real connection */
+#define OBD_CONNECT_CHANGE_QS 0x10000000ULL /*Not used since 2.4 */
+#define OBD_CONNECT_CKSUM 0x20000000ULL /*support several cksum algos*/
+#define OBD_CONNECT_FID 0x40000000ULL /*FID is supported by server */
+#define OBD_CONNECT_VBR 0x80000000ULL /*version based recovery */
+#define OBD_CONNECT_LOV_V3 0x100000000ULL /*client supports LOV v3 EA */
+#define OBD_CONNECT_GRANT_SHRINK 0x200000000ULL /* support grant shrink */
+#define OBD_CONNECT_SKIP_ORPHAN 0x400000000ULL /* don't reuse orphan objids */
+#define OBD_CONNECT_MAX_EASIZE 0x800000000ULL /* preserved for large EA */
+#define OBD_CONNECT_FULL20 0x1000000000ULL /* it is 2.0 client */
+#define OBD_CONNECT_LAYOUTLOCK 0x2000000000ULL /* client uses layout lock */
+#define OBD_CONNECT_64BITHASH 0x4000000000ULL /* client supports 64-bits
+ * directory hash */
+#define OBD_CONNECT_MAXBYTES 0x8000000000ULL /* max stripe size */
+#define OBD_CONNECT_IMP_RECOV 0x10000000000ULL /* imp recovery support */
+#define OBD_CONNECT_JOBSTATS 0x20000000000ULL /* jobid in ptlrpc_body */
+#define OBD_CONNECT_UMASK 0x40000000000ULL /* create uses client umask */
+#define OBD_CONNECT_EINPROGRESS 0x80000000000ULL /* client handles -EINPROGRESS
+ * RPC error properly */
+#define OBD_CONNECT_GRANT_PARAM 0x100000000000ULL/* extra grant params used for
+ * finer space reservation */
+#define OBD_CONNECT_FLOCK_OWNER 0x200000000000ULL /* for the fixed 1.8
+ * policy and 2.x server */
+#define OBD_CONNECT_LVB_TYPE 0x400000000000ULL /* variable type of LVB */
+#define OBD_CONNECT_NANOSEC_TIME 0x800000000000ULL /* nanosecond timestamps */
+#define OBD_CONNECT_LIGHTWEIGHT 0x1000000000000ULL/* lightweight connection */
+#define OBD_CONNECT_SHORTIO 0x2000000000000ULL/* short io */
+#define OBD_CONNECT_PINGLESS 0x4000000000000ULL/* pings not required */
+#define OBD_CONNECT_FLOCK_DEAD 0x8000000000000ULL/* deadlock detection */
+#define OBD_CONNECT_DISP_STRIPE 0x10000000000000ULL/* create stripe disposition*/
+#define OBD_CONNECT_OPEN_BY_FID 0x20000000000000ULL /* open by fid won't pack
+ name in request */
+----
+
+Each flag corresponds to a particular capability that the client and
+target together will honor. A client will send a message including
+some subset of these capabilities during a connection request to a
+specific target. It tells the server what capabilities it has. The
+server then replies with the subset of those capabilities it agrees to
+honor (for the given target).
+
+If the OBD_CONNECT_VERSION flag is set then the 'ocd_version' field is
+honored. The 'ocd_version' gives an encoding of the Lustre
+version. For example, Version 2.7.32 would be hexadecimal number
+0x02073200.
+
+If the OBD_CONNECT_GRANT flag is set then the 'ocd_grant' field is
+honored. The 'ocd_grant' value in a reply (to a connection request)
+sets the client's grant.
+
+If the OBD_CONNECT_INDEX flag is set then the 'ocd_index' field is
+honored. The 'ocd_index' value is set in a reply to a connection
+request. It holds the LOV index of the target.
+
+If the OBD_CONNECT_BRW_SIZE flag is set then the 'ocd_brw_size' field
+is honored. The 'ocd_brw_size' value sets the size of the maximum
+supported RPC. The client proposes a value in its connection request,
+and the server's reply will either agree or further limit the size.
+
+If the OBD_CONNECT_IBITS flag is set then the 'ocd_ibits_known' field
+is honored. The 'ocd_ibits_known' value determines the handling of
+locks on inodes. See the discussion of inodes and extended attributes.
+
+If the OBD_CONNECT_GRANT_PARAM flag is set then the 'ocd_blocksize',
+'ocd_inodespace', and 'ocd_grant_extent' fields are honored. A server
+reply uses the 'ocd_blocksize' value to inform the client of the log
+base two of the size in bytes of the backend file system's blocks.
+
+A server reply uses the 'ocd_inodespace' value to inform the client of
+the log base two of the size of an inode.
+
+Under some circumstances (for example when ZFS is the back end file
+system) there may be additional overhead in handling writes for each
+extent. The server uses the 'ocd_grant_extent' value to inform the
+client of the size in bytes consumed from its grant on the server when
+creating a new file. The client uses this value in calculating how
+much dirty write cache it has and whether it has reached the limit
+established by the target's grant.
+
+If the OBD_CONNECT_TRANSNO flag is set then the 'ocd_transno' field is
+honored. A server uses the 'ocd_transno' value during recovery to
+inform the client of the transaction number at which it should begin
+replay.
+
+If the OBD_CONNECT_MDS flag is set then the 'ocd_group' field is
+honored. When an MDT connects to an OST the 'ocd_group' field informs
+the OSS of the MDT's index. Objects on that OST for that MDT will be
+in a common namespace served by that MDT.
+
+If the OBD_CONNECT_CKSUM flag is set then the 'ocd_cksum_types' field
+is honored. The client uses the 'ocd_checksum_types' field to propose
+to the server the client's available (presumably hardware assisted)
+checksum mechanisms. The server replies with the checksum types it has
+available. Finally, the client will employ the fastest of the agreed
+mechanisms.
+
+If the OBD_CONNECT_MAX_EASIZE flag is set then the 'ocd_max_easize'
+field is honored. The server uses 'ocd_max_easize' to inform the
+client about the amount of space that can be allocated in each inode
+for extended attributes. The 'ocd_max_easize' specifically refers to
+the space used for striping information. This allows the client to
+determine the maximum layout size (and hence stripe count) that can be
+stored on the MDT.
+
+The 'ocd_instance' field (alone) is not governed by an OBD_CONNECT_*
+flag. The MGS uses the 'ocd_instance' value in its reply to a
+connection request to inform the server and target of the "era" of its
+connection. The MGS initializes the era value for each server to zero
+and increments that value every time the target connects. This
+supports imperative recovery.
+
+If the OBD_CONNECT_MAXBYTES flag is set then the 'ocd_maxbytes' field
+is honored. An OSS uses the 'ocd_maxbytes' value to inform the client
+of the maximum OST object size for this target. A stripe on any OST
+for a multi-striped file cannot be larger than the minimum maxbytes
+value.
+
+The additional space in the 'obd_connect_data' structure is unused and
+reserved for future use.
+
+fixme: Discuss the meaning of the rest of the OBD_CONNECT_* flags.
+
+Import
+^^^^^^
+
+----
+#define IMP_STATE_HIST_LEN 16
+struct import_state_hist {
+ enum lustre_imp_state ish_state;
+ time_t ish_time;
+};
+struct obd_import {
+ struct portals_handle imp_handle;
+ atomic_t imp_refcount;
+ struct lustre_handle imp_dlm_handle;
+ struct ptlrpc_connection *imp_connection;
+ struct ptlrpc_client *imp_client;
+ cfs_list_t imp_pinger_chain;
+ cfs_list_t imp_zombie_chain;
+ cfs_list_t imp_replay_list;
+ cfs_list_t imp_sending_list;
+ cfs_list_t imp_delayed_list;
+ cfs_list_t imp_committed_list;
+ cfs_list_t *imp_replay_cursor;
+ struct obd_device *imp_obd;
+ struct ptlrpc_sec *imp_sec;
+ struct mutex imp_sec_mutex;
+ cfs_time_t imp_sec_expire;
+ wait_queue_head_t imp_recovery_waitq;
+ atomic_t imp_inflight;
+ atomic_t imp_unregistering;
+ atomic_t imp_replay_inflight;
+ atomic_t imp_inval_count;
+ atomic_t imp_timeouts;
+ enum lustre_imp_state imp_state;
+ struct import_state_hist imp_state_hist[IMP_STATE_HIST_LEN];
+ int imp_state_hist_idx;
+ int imp_generation;
+ __u32 imp_conn_cnt;
+ int imp_last_generation_checked;
+ __u64 imp_last_replay_transno;
+ __u64 imp_peer_committed_transno;
+ __u64 imp_last_transno_checked;
+ struct lustre_handle imp_remote_handle;
+ cfs_time_t imp_next_ping;
+ __u64 imp_last_success_conn;
+ cfs_list_t imp_conn_list;
+ struct obd_import_conn *imp_conn_current;
+ spinlock_t imp_lock;
+ /* flags */
+ unsigned long
+ imp_no_timeout:1,
+ imp_invalid:1,
+ imp_deactive:1,
+ imp_replayable:1,
+ imp_dlm_fake:1,
+ imp_server_timeout:1,
+ imp_delayed_recovery:1,
+ imp_no_lock_replay:1,
+ imp_vbr_failed:1,
+ imp_force_verify:1,
+ imp_force_next_verify:1,
+ imp_pingable:1,
+ imp_resend_replay:1,
+ imp_no_pinger_recover:1,
+ imp_need_mne_swab:1,
+ imp_force_reconnect:1,
+ imp_connect_tried:1;
+ __u32 imp_connect_op;
+ struct obd_connect_data imp_connect_data;
+ __u64 imp_connect_flags_orig;
+ int imp_connect_error;
+ __u32 imp_msg_magic;
+ __u32 imp_msghdr_flags; /* adjusted based on server capability */
+ struct ptlrpc_request_pool *imp_rq_pool; /* emergency request pool */
+ struct imp_at imp_at; /* adaptive timeout data */
+ time_t imp_last_reply_time; /* for health check */
+};
+----
+
+The 'imp_handle' value is the unique id for the import, and is used as
+a hash key to gain access to it. It is not used in any of the Lustre
+protocol messages, but rather is just for internal reference.
+
+The 'imp_refcount' is also for internal use. The value is incremented
+with each RPC created, and decremented as the request is freed. When
+the reference count is zero the import can be freed, as when the
+target is being disconnected.
+
+The 'imp_dlm_handle' is a reference to the LDLM export for this
+client.
+
+There can be multiple paths through the network to a given
+target, in which case there would be multiple 'obd_import_conn' items
+on the 'imp_conn_list'. Each 'obd_imp_conn' includes a
+'ptlrpc_connection', so 'imp_connection' points to the one that is
+actually in use.
+
+The 'imp_client' identifies the (local) portals for sending and
+receiving messages as well as the client's name. The information is
+specific to either an MDC or an OSC.
+
+The 'imp_ping_chain' places the import on a linked list of imports
+that need periodic pings.
+
+The 'imp_zombie_chain' places the import on a list ready for being
+freed. Unused imports ('imp_refcount' is zero) are deleted
+asynchronously by a garbage collecting process.
+
+In order to support recovery the client must keep requests that are in
+the process of being handled by the target. The target replies to a
+request as soon as the target has made its local update to
+memory. When the client receives that reply the request is put on the
+'imp_replay_list'. In the event of a failure (target crash, lost
+message) this list is then replayed for the target during the recovery
+process. When a request has been sent but has not yet received a reply
+it is placed on the 'imp_sending_list'. In the event of a failure
+those will simply be replayed after any recovery has been
+completed. Finally, there may be requests that the client is delaying
+before it sends them. This can happen if the client is in a degraded
+mode, as when it is in recovery after a failure. These requests are
+put on the 'imp_delayed_list' and not processed until recovery is
+complete and the 'imp_sending_list' has been replayed.
+
+In order to support recovery 'open' requests must be preserved even
+after they have completed. Those requests are placed on the
+'imp_committed_list' and the 'imp_replay_cursor' allows for
+accelerated access to those items.
+
+The 'imp_obd' is a reference to the details about the target device
+that is the subject of this import. There is a lot of state info in
+there along with many implementation details that are not relevant to
+the actual Lustre protocol. fixme: I'll want to go through all of the
+fields in that structure to see which, if any need more
+documentation.
+
+The security policy and settings are kept in 'imp_sec', and
+'imp_sec_mutex' helps manage access to that info. The 'imp_sec_expire'
+setting is in support of security policies that have an expiration
+strategy.
+
+Some processes may need the import to be in a fully connected state in
+order to proceed. The 'imp_recovery_waitq' is where those threads will
+wait during recovery.
+
+The 'imp_inflight' field counts the number of in-flight requests. It
+is incremented with each request sent and decremented with each reply
+received.
+
+The client reserves buffers for the processing of requests and
+replies, and then informs LNet about those buffers. Buffers may get
+reused during subsequent processing, but then a point may come when
+the buffer is no longer going to be used. The client increments the
+'imp_unregistering' counter and informs LNet the buffer is no longer
+needed. When LNet has freed the buffer it will notify the client and
+then the 'imp_unregistering' can be decremented again.
+
+During recovery the 'imp_reply_inflight' counts the number of requests
+from the reply list that have been sent and have not been replied to.
+
+The 'imp_inval_count' field counts how many threads are in the process
+of cleaning up this connection or waiting for cleanup to complete. The
+cleanup itself may be needed in the case there is an eviction or other
+problem (fixme what other problem?). The cleanup may involve freeing
+allocated resources, updating internal state, running replay lists,
+and invalidating cache. Since it could take a while there may end up
+multiple threads waiting on this process to complete.
+
+The 'imp_timeout' field is a counter that is incremented every time
+there is a timeout in communication with the target.
+
+The 'imp_state' tracks the state of the import. It draws from the
+enumerated set of values:
+
+.enum_lustre_imp_state
+[options="header"]
+|=====
+| state name | value
+| LUSTRE_IMP_CLOSED | 1
+| LUSTRE_IMP_NEW | 2
+| LUSTRE_IMP_DISCON | 3
+| LUSTRE_IMP_CONNECTING | 4
+| LUSTRE_IMP_REPLAY | 5
+| LUSTRE_IMP_REPLAY_LOCKS | 6
+| LUSTRE_IMP_REPLAY_WAIT | 7
+| LUSTRE_IMP_RECOVER | 8
+| LUSTRE_IMP_FULL | 9
+| LUSTRE_IMP_EVICTED | 10
+|=====
+fixme: what are the transitions between these states? The
+'imp_state_hist' array maintains a list of the last 16
+(IMP_STATE_HIST_LEN) states the import was in, along with the time it
+entered each (fixme: or is it when it left that state?). The list is
+maintained in a circular manner, so the 'imp_state_hist_idx' points to
+the entry in the list for the most recently visited state.
+
+The 'imp_generation' and 'imp_conn_cnt' fields are monotonically
+increasing counters. Every time a connection request is sent to the
+target the 'imp_conn_cnt' counter is incremented, and every time a
+reply is received for the connection request the 'imp_generation'
+counter is incremented.
+
+The 'imp_last_generation_checked' implements an optimization. When a
+replay process has successfully traversed the reply list the
+'imp_generation' value is noted here. If the generation has not
+incremented then the replay list does not need to be traversed again.
+
+During replay the 'imp_last_replay_transno' is set to the transaction
+number of the last request being replayed, and
+'imp_peer_committed_transno is set to the 'pb_last_committed' value
+(of the 'ptlrpc_body') from replies if that value is higher than the
+previous 'imp_peer_committed_transno'. The 'imp_last_transno_checked'
+field implements an optimization. It is set to the
+'imp_last_replay_transno' as its replay is initiated. If
+'imp_last_transno_checked' is still 'imp_last_replay_transno' and
+'imp_generation' is still 'imp_last_generation_checked' then there
+are no additional requests ready to be removed from the replay
+list. Furthermore, 'imp_last_transno_checked' may no longer be needed,
+since the committed transactions are now maintained on a separate list.
+
+The 'imp_remote_handle' is the handle sent by the target in a
+connection reply message to uniquely identify the export for this
+target and client that is maintained on the server. This is the handle
+used in all subsequent messages to the target.
+
+There are two separate ping intervals (fixme: what are the
+values?). If there are no uncommitted messages for the target then the
+default ping interval is used to set the 'imp_next_ping' to the time
+the next ping needs to be sent. If there are uncommitted requests then
+a "short interval" is used to set the time for the next ping.
+
+The 'imp_last_success_conn' value is set to the time of the last
+successful connection. fixme: The source says it is in 64 bit
+jiffies, but does not further indicate how that value is calculated.
+
+Since there can actually be multiple connection paths for a target
+(due to failover or multihomed configurations) the import maintains a
+list of all the possible connection paths in the list pointed to by
+the 'imp_conn_list' field. The 'imp_conn_current' points to the one
+currently in use. Compare with the 'imp_connection' fields. They point
+to different structures, but each is reachable from the other.
+
+Most of the flag, state, and list information in the import needs to
+be accessed atomically. The 'imp_lock' is used to maintain the
+consistency of the import while it is manipulated by multiple threads.
+
+The various flags are documented in the source code and are largely
+obvious from those short comments, reproduced here:
+
+.import flags
+[options="header"]
+|=====
+| flag | explanation
+| imp_no_timeout | timeouts are disabled
+| imp_invalid | client has been evicted
+| imp_deactive | client administratively disabled
+| imp_replayable | try to recover the import
+| imp_dlm_fake | don't run recovery (timeout instead)
+| imp_server_timeout | use 1/2 timeout on MDSs and OSCs
+| imp_delayed_recovery | VBR: imp in delayed recovery
+| imp_no_lock_replay | VBR: if gap was found then no lock replays
+| imp_vbr_failed | recovery by versions was failed
+| imp_force_verify | force an immidiate ping
+| imp_force_next_verify | force a scheduled ping
+| imp_pingable | target is pingable
+| imp_resend_replay | resend for replay
+| imp_no_pinger_recover | disable normal recovery, for test only.
+| imp_need_mne_swab | need IR MNE swab
+| imp_force_reconnect | import must be reconnected, not new connection
+| imp_connect_tried | import has tried to connect with server
+|=====
+A few additional notes are in order. The 'imp_dlm_fake' flag signifies
+that this is not a "real" import, but rather it is a "reverse"import
+in support of the LDLM. When the LDLM invokes callback operations the
+messages are initiated at the other end, so there need to a fake
+import to receive the replies from the operation. Prior to the
+introduction of adaptive timeouts the servers were given fixed timeout
+value that were half those used for the clients. The
+'imp_server_timeout' flag indicated that the import should use the
+half-sized timeouts, but with the introduction of adaptive timeouts
+this facility is no longer used. "VBR" is "version based recovery",
+and it introduces a new possibility for handling requests. Previously,
+f there were a gap in the transaction number sequence the the requests
+associated with the missing transaction numbers would be
+discarded. With VBR those transaction only need to be discarded if
+there is an actual dependency between the ones that were skipped and
+the currently latest committed transaction number. fixme: What are the
+circumstances that would lead to setting the 'imp_force_next_verify'
+or 'imp_pingable' flags? During recovery, the client sets the
+'imp_no_pinger_recover' flag, which tells the process to proceed from
+the current value of 'imp_replay_last_transno'. The
+'imp_need_mne_swab' flag indicates a version dependent circumstance
+where swabbing was inadvertently left out of one processing step.
+
+
+Export
+^^^^^^
+
+An 'obd_export' structure for a given target is created on a server
+for each client that connects to that target. The exports for all the
+clients for a given target are managed together. The export represents
+the connection state between the client and target as well as the
+current state of any ongoing activity. Thus each pending request will
+have a reference to the export. The export is discarded if the
+connection goes away, but only after all the references to it have
+been cleaned up. The state information for each export is also
+maintained on disk. In the event of a server failure, that or another
+server can read the export date from disk to enable recovery.
+
+----
+struct obd_export {
+ struct portals_handle exp_handle;
+ atomic_t exp_refcount;
+ atomic_t exp_rpc_count;
+ atomic_t exp_cb_count;
+ atomic_t exp_replay_count;
+ atomic_t exp_locks_count;
+#if LUSTRE_TRACKS_LOCK_EXP_REFS
+ cfs_list_t exp_locks_list;
+ spinlock_t exp_locks_list_guard;
+#endif
+ struct obd_uuid exp_client_uuid;
+ cfs_list_t exp_obd_chain;
+ cfs_hlist_node_t exp_uuid_hash;
+ cfs_hlist_node_t exp_nid_hash;
+ cfs_list_t exp_obd_chain_timed;
+ struct obd_device *exp_obd;
+ struct obd_import *exp_imp_reverse;
+ struct nid_stat *exp_nid_stats;
+ struct ptlrpc_connection *exp_connection;
+ __u32 exp_conn_cnt;
+ cfs_hash_t *exp_lock_hash;
+ cfs_hash_t *exp_flock_hash;
+ cfs_list_t exp_outstanding_replies;
+ cfs_list_t exp_uncommitted_replies;
+ spinlock_t exp_uncommitted_replies_lock;
+ __u64 exp_last_committed;
+ cfs_time_t exp_last_request_time;
+ cfs_list_t exp_req_replay_queue;
+ spinlock_t exp_lock;
+ struct obd_connect_data exp_connect_data;
+ enum obd_option exp_flags;
+ unsigned long
+ exp_failed:1,
+ exp_in_recovery:1,
+ exp_disconnected:1,
+ exp_connecting:1,
+ exp_delayed:1,
+ exp_vbr_failed:1,
+ exp_req_replay_needed:1,
+ exp_lock_replay_needed:1,
+ exp_need_sync:1,
+ exp_flvr_changed:1,
+ exp_flvr_adapt:1,
+ exp_libclient:1,
+ exp_need_mne_swab:1;
+ enum lustre_sec_part exp_sp_peer;
+ struct sptlrpc_flavor exp_flvr;
+ struct sptlrpc_flavor exp_flvr_old[2];
+ cfs_time_t exp_flvr_expire[2];
+ spinlock_t exp_rpc_lock;
+ cfs_list_t exp_hp_rpcs;
+ cfs_list_t exp_reg_rpcs;
+ cfs_list_t exp_bl_list;
+ spinlock_t exp_bl_list_lock;
+ union {
+ struct tg_export_data eu_target_data;
+ struct mdt_export_data eu_mdt_data;
+ struct filter_export_data eu_filter_data;
+ struct ec_export_data eu_ec_data;
+ struct mgs_export_data eu_mgs_data;
+ } u;
+ struct nodemap *exp_nodemap;
+};
+----
+
+The 'exp_handle' is a little extra information as compared with a
+'struct lustre_handle', which is just the cookie. The cookie that the
+server generates to uniquely identify this connection gets put into
+this structure along with their information about the device in
+question. This is the cookie the *_CONNECT reply sends back to the
+client and is then stored int he client's import.
+
+The 'exp_refcount' gets incremented whenever some aspect of the export
+is "in use". The arrival of an otherwise unprocessed message for this
+target will increment the refcount. A reference by an LDLM lock that
+gets taken will increment the refcount. Callback invocations and
+replay also lead to incrementing the ref_count. The next for fields -
+'exp_rpc_count', exp_cb_count', and 'exp_replay_count', and
+'exp_locks_count' - all subcategorize the 'exp_refcount' for debug
+purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard'
+are further debug info that lists the actual locks accounted in
+'exp_locks_count'.
+
+The 'exp_client_uuid' gives the UUID of the client connected to this
+export. Fixme: when and how does the UUID get generated?
+
+The server maintains all the exports for a given target on a circular
+list. Each export's place on that list is maintained in the
+'exp_obd_chain'. A common activity is to look up the export based on
+the UUID or the nid of the client, and the 'exp_uuid_hash' and
+'exp_nid_hash' fields maintain this export's place in hashes
+constructed for that purpose.
+
+Exports are also maintained on a list sorted by the last time the
+corresponding client was heard from. The 'exp_obd_chain_timed' field
+maintains the export's place on that list. When a message arrives from
+the client the time is "now" so the export gets put at the end of the
+list. Since it is circular, the next export is then the oldest. If it
+has not been heard of within its timeout interval that export is
+marked for later eviction.
+
+The 'exp_obd' points to the 'obd_device' structure for the device that
+is the target of this export.
+
+In the event of a call-back the export needs to have a the ability to
+initiate messages back to the client. The 'exp_imp_reverse' provides a
+"reverse" import that manages this capability.
+
+The '/proc' stats for the export (and the target) get updated via the
+'exp_nid_stats'.
+
+The 'exp_connection' points to the connection information for this
+export. This is the information about the actual networking pathway(s)
+that get used for communication.
+
+
+The 'exp_conn_cnt' notes the connection count value from the client at
+the time of the connection. In the event that more than one connection
+request is issued before the connection is established then the
+'exp_conn_cnt' will list the highest value. If a previous connection
+attempt (with a lower value) arrives later it may be safely
+discarded. Every request lists its connection count, so non-connection
+requests with lower connection count values can also be discarded.
+Note that this does not count how many times the client has connected
+to the target. If a client is evicted the export is deleted once it
+has been cleaned up and its 'exp_ref_count' reduced to zero. A new
+connection from the client will get a new export.
+
+The 'exp_lock_hash' provides access to the locks granted to the
+corresponding client for this target. If a lock cannot be granted it
+is discarded. A file system lock ("flock") is also implemented through
+the LDLM lock system, but not all LDLM locks are flocks. The ones that
+are flocks are gathered in a hash 'exp_flock_hash'. This supports
+deadlock detection.
+
+For those requests that initiate file system modifying transactions
+the request and its attendant locks need to be preserved until either
+a) the client acknowleges recieving the reply, or b) the transaction
+has been committed locally. This ensures a request can be replayed in
+the event of a failure. The reply is kept on the
+'exp_outstanding_replies' list until the LNet layer notifies the
+server that the reply has been acknowledged. A reply is kept on the
+'exp_uncommitted_replies' list until the transaction (if any) has been
+committed.
+
+The 'exp_last_committed' value keeps the transaction number of the
+last committed transaction. Every reply to a client includes this
+value as a means of early-as-possible notification of transactions that
+have been committed.
+
+The 'exp_last_request_time' is self explanatory.
+
+During reply a request that is waiting for reply is maintained on the
+list 'exp_req_replay_queue'.
+
+The 'exp_lock' spin-lock is used for access control to the exports
+flags, as well as the 'exp_outstanding_replies' list and the revers
+import, if any.
+
+The 'exp_connect_data' refers to an 'obd_connect_data' structure for
+the connection established between this target and the client this
+export refers to. See also the corresponding entry in the import and
+in the connect messages passed between the hosts.
+
+The 'exp_flags' field encodes three directives as follows:
+----
+enum obd_option {
+ OBD_OPT_FORCE = 0x0001,
+ OBD_OPT_FAILOVER = 0x0002,
+ OBD_OPT_ABORT_RECOV = 0x0004,
+};
+----
+fixme: Are the set for some exports and a condition of their
+existence? or do they reflect a transient state the export is passing
+through?
+
+The 'exp_failed' flag gets set whenever the target has failed for any
+reason or the export is otherwise due to be cleaned up. Once set it
+will not be unset in this export. Any subsequent connection between
+the client and the target would be governed by a new export.
+
+After a failure export data is retrieved from disk and the exports
+recreated. Exports created in this way will have their
+'exp_in_recovery' flag set. Once any outstanding requests and locks
+have been recovered for the client, then the export is recovered and
+'exp_in_recovery' can be cleared. When all the client exports for a
+given target have been recovered then the target is considered
+recovered, and when all targets have been recovered the server is
+considered recovered.
+
+A *_DISCONNECT message from the client will set the 'exp_disconnected'
+flag, as will any sort of failure of the target. Once set the export
+will be cleaned up and deleted.
+
+When a *_CONNECT message arrives the 'exp_connecting' flag is set. If
+for some reason a second *_CONNECT request arrives from the client it can
+be discarded when this flag is set.
+
+The 'exp_delayed' flag is no longer used. In older code it indicated
+that recovery had not completed in a timely fashion, but that a tardy
+recovery would still be possible, since there were no dependencies on
+the export.
+
+The 'exp_vbr_failed' flag indicates a failure during the recovery
+process. See <<recovery>> for a more detailed discussion of recovery
+and transaction replay. For a file system modifying request, the
+server composes its reply including the 'pb_pre_versions' entries in
+'ptlrpc_body', which indicate the most recent updates to the
+object. The client updates the request wth teh 'pb_transno' and
+'pb_pre_versions' from the reply, and keeps that request until the
+target signals that the transaction has been committed to disk. If the
+client times-out without that confirmation then it will 'replay' the
+request, which now includes the 'pb_pre_versions' information. During
+a replay the target checks that the object has not been further
+modified beyond those 'pb_pre_versions'. If this check fails then the
+request is out of date, and the recovery process fails for the
+connection between this client and this target. At that point the
+'exp_vbr_failed' flag is set to indicate version based recovery
+failed. This will lead to the client being evicted and this export
+being cleaned up and deleted.
+
+At the start of recovery both the 'exp_req_replay_needed' and
+'exp_lock_replay_needed' flags are set. As request replay is completed
+the 'exp_req_replay_needed' flag is cleared. As lock replay is
+completed the 'exp_lock_replay_needed' flag is cleared. Once both are
+cleared the 'exp_in_recovery' flag can be cleared.
+
+The 'exp_need_sync' supports an optimization. At mount time it is
+likely that every client (potentially thousands) will create an export
+and that export will need to be saved to disk synchronously. This can
+lead to an unusually high and poorly performing interaction with the
+disk. When the export is created the 'exp_need_sync' flag is set and
+the actual writing to disk is delayed. As transactions arrive from
+clients (in a much less coordinated fashion) the 'exp_need_sync' flag
+indicates a need to save the export as well as the transaction. At
+that point the flag is cleared (except see below).
+
+In DNE (phase I) the export for an MDT managing the connection from
+another MDT will want to always keep the 'exp_need_sync' flag set. For
+that special case such an export sets the 'exp_keep_sync', which then
+prevents the 'exp_need_sync' flag from ever being cleared. This will
+no longer be needed in DNE Phase II.
+
+The 'exp_flvr_changed' and 'exp_flvr_adapt' flags along with
+'exp_sp_peer', 'exp_flvr', 'exp_flvr_old', and 'exp_flvr_expire'
+fields are all used to manage the security settings for the
+connection. Security is discussed in the <<security>> section. (fixme:
+or will be.)
+
+The 'exp_libclient' flag indicates that the export is for a client
+based on "liblustre". This allows for simplified handling on the
+server. (fixme: how is processing simplified? It sounds like I may
+need a whole special section on liblustre.)
+
+The 'exp_need_mne_swab' flag indicates the presence of an old bug that
+affected one special case of failed swabbing. It is not part of
+current processing.
+
+As RPCs arrive they are first subjected to triage. Each request is
+placed on the 'exp_hp_rpcs' list and examined to see if it is high
+priority (fixme: what constitutes high priority? PING, truncate, bulk
+I/O, ... others?). If it is not high priority then it is moved to the
+'exp_reg_prcs' list. The 'exp_rpc_lock' protects both lists from
+concurrent access.
+
+All arriving LDLM requests get put on the 'exp_bl_list' and access to
+that list is controlled via the 'exp_bl_list_lock'.
+
+The union provides for target specific data. The 'eu_target_data' is
+for a common core of fields for a generic target. The others are
+specific to particular target types: 'eu_mdt_data' for MDTs,
+'eu_filter_data' for OSTs, 'eu_ec_data' for an "echo client" (fixme:
+describe what an echo client is somewhere), and 'eu_mgs_data' is for
+an MGS.
+
+The 'exp_bl_lock_at' field supports adaptive timeouts which will be
+discussed separately. (fixme: so discuss it somewhere.)
+
+Connection Count
+^^^^^^^^^^^^^^^^
+
+Each export maintains a connection count. Or is it just the management
+server?
--- /dev/null
+Data Structures and Defines
+---------------------------
+[[data-structs]]
+
+The following data types are used in the Lustre protocol description.
+
+Basic Data Types
+~~~~~~~~~~~~~~~~
+
+.Basic Data Types
+[options="header"]
+|=====
+| data types | size
+| __u8 | an 8-bit unsigned integer
+| __u16 | a 16-bit unsigned integer
+| __u32 | a 32-bit unsigned integer
+| __u64 | a 64-bit unsigned integer
+| __s64 | a 64-bit signed integer
+| obd_time | an __s64
+|=====
+
+
+Other Data Types
+~~~~~~~~~~~~~~~~
+
+The following topics introduce the various kinds of data that are
+represented and manipulated in Lustre messages and representations of
+the shared state on clients and servers.
+
+Grant
+^^^^^
+[[grant]]
+
+A grant value is part of a client's state for a given target. It
+provides an upper bound to the amount of dirty cache data the client
+will allow that is destined for the target. The value is established
+by agreement between the server and the client and represents a
+guarantee by the server that the target storage has space for the
+dirty data. The client can ask for additional grant, which the server
+may provide depending on how full the target is.
+
+LOV Index
+^^^^^^^^^
+[[lov-index]]
+
+Each target is assigned an LOV index (by the 'mkfs' command line) as
+the target is added to the file system. This value is stored in the
+MGS in order to identify its role in the file system.
+
+Transaction Number
+^^^^^^^^^^^^^^^^^^
+[[transno]]
+
+For each target there is a sequence of values (a strictly increasing
+series of numbers) where each operation that can modify the file
+system is assigned the next number in the series. This is the
+transaction number, and it imposes a strict serial ordering to all of
+the file system modifying operations. For file system modifying
+requests the server generates the next value in the sequence and
+informs the client of the value in the 'pb_transno' field of the
+'ptlrpc_body' of its reply to the client's request. For replys to
+requests that do not modify the file system the 'pb_transno' field in
+the 'ptlrpc_body' is just set to 0.
+
+Structured Data Types
+~~~~~~~~~~~~~~~~~~~~~
+
+Extended Attributes
+^^^^^^^^^^^^^^^^^^^
+
+I have not figured out how so called 'eadata' buffers are handled,
+yet. I am told that this is not just for extended attributes, but is a
+generic structure.
+
+Lustre Capabilities
+^^^^^^^^^^^^^^^^^^^
+
+A 'lustre_capa' structure conveys details about the capabilities
+supported (or requested) between a client and a given target.
+
+----
+#define CAPA_HMAC_MAX_LEN 64
+struct lustre_capa {
+ struct lu_fid lc_fid;
+ __u64 lc_opc;
+ __u64 lc_uid;
+ __u64 lc_gid;
+ __u32 lc_flags;
+ __u32 lc_keyid;
+ __u32 lc_timeout;
+ __u32 lc_expiry;
+ __u8 lc_hmac[CAPA_HMAC_MAX_LEN];
+}
+----
+
+MDT Data
+^^^^^^^^
+
+An 'mdt_body' structure holds details about a given MDT.
+
+----
+struct mdt_body {
+ struct lu_fid fid1;
+ struct lu_fid fid2;
+ struct lustre_handle handle;
+ __u64 valid;
+ __u64 size;
+ obd_time mtime;
+ obd_time atime;
+ obd_time ctime;
+ __u64 blocks;
+ __u64 ioepoch;
+ __u64 t_state;
+ __u32 fsuid;
+ __u32 fsgid;
+ __u32 capability;
+ __u32 mode;
+ __u32 uid;
+ __u32 gid;
+ __u32 flags;
+ __u32 rdev;
+ __u32 nlink;
+ __u32 unused2;
+ __u32 suppgid;
+ __u32 eadatasize;
+ __u32 aclsize;
+ __u32 max_mdsize;
+ __u32 max_cookiesize;
+ __u32 uid_h;
+ __u32 gid_h;
+ __u32 padding_5;
+ __u64 padding_6;
+ __u64 padding_7;
+ __u64 padding_8;
+ __u64 padding_9;
+ __u64 padding_10;
+}; /* 216 */
+----
+
+MGS Configuration Reference
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+----
+#define MTI_NAME_MAXLEN 64
+struct mgs_config_body {
+ char mcb_name[MTI_NAME_MAXLEN]; /* logname */
+ __u64 mcb_offset; /* next index of config log to request */
+ __u16 mcb_type; /* type of log: CONFIG_T_[CONFIG|RECOVER] */
+ __u8 mcb_reserved;
+ __u8 mcb_bits; /* bits unit size of config log */
+ __u32 mcb_units; /* # of units for bulk transfer */
+};
+----
+
+The 'mgs_config_body' structure has information identifying to the MGS
+which Lustre file system the client is asking about.
+
+MGS Configuration Data
+^^^^^^^^^^^^^^^^^^^^^^
+
+----
+struct mgs_config_res {
+ __u64 mcr_offset; /* index of last config log */
+ __u64 mcr_size; /* size of the log */
+};
+----
+
+The 'mgs_config_res' structure returns information about the Lustre
+file system.
+
+Lustre Handle
+^^^^^^^^^^^^^
+
+----
+struct lustre_handle {
+ __u64 cookie;
+};
+----
+
+A Lustre handle is a reference to an import or an export. Those
+objects maintain state about the connection between a given client
+and a given target. The import is on the client and the corresponding
+export is on the server.
+
+Lustre Message Header
+^^^^^^^^^^^^^^^^^^^^^
+[[lustre-message-header]]
+
+Every message has an initial header that informs the receiver about
+the size of the rest of the message to follow along with a few other
+details.
+
+----
+#define LUSTRE_MSG_MAGIC_V2 0x0BD00BD3
+#define MSGHDR_AT_SUPPORT 0x1
+struct lustre_msg_v2 {
+ __u32 lm_bufcount;
+ __u32 lm_secflvr;
+ __u32 lm_magic;
+ __u32 lm_repsize;
+ __u32 lm_cksum;
+ __u32 lm_flags;
+ __u32 lm_padding_2;
+ __u32 lm_padding_3;
+ __u32 lm_buflens[0];
+};
+#define lustre_msg lustre_msg_v2
+----
+
+The 'lm_buffcount' field gives the number of buffers that will follow
+the header. The header and sequence of buffers constitutes one
+message. Each of the buffers is a sequence of bytes whose contents
+corresponds to one of the structures described in this section. There
+will always be at least one, and no message has more than eight.
+
+The 'lm_secflvr' field gives an indication of whether any sort of
+cyptographic encoding of the subsequent buffers will be in force. The
+value is zero if there is no "crypto" and gives a code identifying the
+"flavor" of crypto if it is employed. Further, if crypto is employed
+there will only be one buffer following (i.e. buffcount = 1), and that
+buffer is an encoding of what would otherwise have been the sequence
+of buffers normally following the header. This document will defer all
+discussion of cryptography. An chapter is planned that will address it
+separately.
+
+The 'lm_magic' field is a "magic" value (LUSTRE_MSG_MAGIC_V2) that is
+checked in order to positively identify that the message is intended
+for the use to which it is being put. That is, we are indeed dealing
+with a Lustre message, and not, for example, corrupted memory or a bad
+pointer.
+
+The 'lm_repsize' field is an indication from the sender of an action
+request of the maximum available space that has been set aside for
+any reply to the request. A reply that attempts to use more than that
+much space will be discarded.
+
+The 'lm_cksum' has to do with the <<security>> settings for the
+cluster. Fixme: This may not be in current use. We need to verify.
+
+The 'lm_flags' field can be set to enable adaptive timeouts support
+with the value MSGHDR_AT_SUPPORT.
+
+The 'lm_padding*' fields are reserved for future use.
+
+The array of 'lm_bufflens' values has 'lm_bufcount' entries. Each
+entry corresponds to, and gives the length of, one of the buffers that
+will follow.
+
+The entire header is required to be a multiple of eight bytes
+long. Thus there may need to an extra four bytes of padding after the
+'lm_bufflens' array if that array has an odd number of entries.
+
+OBD statfs
+^^^^^^^^^^
+
+The 'obd_stafs' structure defines fields that are used for returning
+server common 'statfs' data items to a client. It augments that data
+with some Lustre-specific information, and also has space allocated
+for future use by Lustre.
+
+----
+struct obd_statfs {
+ __u64 os_type;
+ __u64 os_blocks;
+ __u64 os_bfree;
+ __u64 os_bavail;
+ __u64 os_files;
+ __u64 os_ffree;
+ __u8 os_fsid[40];
+ __u32 os_bsize;
+ __u32 os_namelen;
+ __u64 os_maxbytes;
+ __u32 os_state; /**< obd_statfs_state OS_STATE_* flag */
+ __u32 os_fprecreated; /* objs available now to the caller */
+ /* used in QoS code to find preferred
+ * OSTs */
+ __u32 os_spare2;
+ __u32 os_spare3;
+ __u32 os_spare4;
+ __u32 os_spare5;
+ __u32 os_spare6;
+ __u32 os_spare7;
+ __u32 os_spare8;
+ __u32 os_spare9;
+};
+----
+
+Lustre Message Preamble
+^^^^^^^^^^^^^^^^^^^^^^^
+[[lustre-message-preamble]]
+
+Every Lustre message starts with both the above header and an
+additional set of fields (in its first "buffer") given by the 'struct
+ptlrpc_body_v3' structure. This preamble has information information
+relevant to every message type. In particular, the Lustre message type
+is itself encoded in the 'pb_opc' Lustre operation number. The value
+of that op code determines what else will be in the message following
+the preamble.
+----
+#define PTLRPC_NUM_VERSIONS 4
+#define JOBSTATS_JOBID_SIZE 32
+struct ptlrpc_body_v3 {
+ struct lustre_handle pb_handle;
+ __u32 pb_type;
+ __u32 pb_version;
+ __u32 pb_opc;
+ __u32 pb_status;
+ __u64 pb_last_xid;
+ __u64 pb_last_seen;
+ __u64 pb_last_committed;
+ __u64 pb_transno;
+ __u32 pb_flags;
+ __u32 pb_op_flags;
+ __u32 pb_conn_cnt;
+ __u32 pb_timeout;
+ __u32 pb_service_time;
+ __u32 pb_limit;
+ __u64 pb_slv;
+ __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
+ __u64 pb_padding[4];
+ char pb_jobid[JOBSTATS_JOBID_SIZE];
+};
+#define ptlrpc_body ptlrpc_body_v3
+----
+In a connection request, sent by a client to server and regarding a
+specific target, the 'pb_handle' is 0. In the reply to a connection
+request, sent by the server, the handle is a value uniquely
+identifying the target. Subsequent messages between this client and
+this server regarding this target will use this handle to to gain
+access to their shared state. The handle is persistent across
+reconnects.
+
+The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are
+initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is
+PTL_RPC_MSG_ERR to convey that a message was received that could not
+be interpreted, that is, if it was corrupt or incomplete. The encoding
+of those type values is given by:
+----
+#define PTL_RPC_MSG_REQUEST 4711
+#define PTL_RPC_MSG_ERR 4712
+#define PTL_RPC_MSG_REPLY 4713
+----
+The error message type is only for responding to a message that failed
+to be interpreted as an actual message. Note that other errors, such
+as those that emerge from processing the actual message content, do
+not use the PTL_RPC_MSG_ERR type.
+
+The 'pb_version' identifies the version of the Lustre protocol and is
+derived from the following constants. The lower two bytes give the
+version of PtlRPC being employed in the message, and the upper two
+bytes encode the role of the host for the service being
+requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS.
+----
+#define PTLRPC_MSG_VERSION 0x00000003
+#define LUSTRE_VERSION_MASK 0xffff0000
+#define LUSTRE_OBD_VERSION 0x00010000
+#define LUSTRE_MDS_VERSION 0x00020000
+#define LUSTRE_OST_VERSION 0x00030000
+#define LUSTRE_DLM_VERSION 0x00040000
+#define LUSTRE_LOG_VERSION 0x00050000
+#define LUSTRE_MGS_VERSION 0x00060000
+----
+
+The 'pb_opc' value (operation code) gives the actual Lustre operation
+that is the subject of this message. For example, MDS_CONNECT is a
+Lustre operation (number 38). The following list gives the name used
+and the value for each operation.
+----
+typedef enum {
+ OST_REPLY = 0,
+ OST_GETATTR = 1,
+ OST_SETATTR = 2,
+ OST_READ = 3,
+ OST_WRITE = 4,
+ OST_CREATE = 5,
+ OST_DESTROY = 6,
+ OST_GET_INFO = 7,
+ OST_CONNECT = 8,
+ OST_DISCONNECT = 9,
+ OST_PUNCH = 10,
+ OST_OPEN = 11,
+ OST_CLOSE = 12,
+ OST_STATFS = 13,
+ OST_SYNC = 16,
+ OST_SET_INFO = 17,
+ OST_QUOTACHECK = 18,
+ OST_QUOTACTL = 19,
+ OST_QUOTA_ADJUST_QUNIT = 20,
+ MDS_GETATTR = 33,
+ MDS_GETATTR_NAME = 34,
+ MDS_CLOSE = 35,
+ MDS_REINT = 36,
+ MDS_READPAGE = 37,
+ MDS_CONNECT = 38,
+ MDS_DISCONNECT = 39,
+ MDS_GETSTATUS = 40,
+ MDS_STATFS = 41,
+ MDS_PIN = 42,
+ MDS_UNPIN = 43,
+ MDS_SYNC = 44,
+ MDS_DONE_WRITING = 45,
+ MDS_SET_INFO = 46,
+ MDS_QUOTACHECK = 47,
+ MDS_QUOTACTL = 48,
+ MDS_GETXATTR = 49,
+ MDS_SETXATTR = 50,
+ MDS_WRITEPAGE = 51,
+ MDS_IS_SUBDIR = 52,
+ MDS_GET_INFO = 53,
+ MDS_HSM_STATE_GET = 54,
+ MDS_HSM_STATE_SET = 55,
+ MDS_HSM_ACTION = 56,
+ MDS_HSM_PROGRESS = 57,
+ MDS_HSM_REQUEST = 58,
+ MDS_HSM_CT_REGISTER = 59,
+ MDS_HSM_CT_UNREGISTER = 60,
+ MDS_SWAP_LAYOUTS = 61,
+ LDLM_ENQUEUE = 101,
+ LDLM_CONVERT = 102,
+ LDLM_CANCEL = 103,
+ LDLM_BL_CALLBACK = 104,
+ LDLM_CP_CALLBACK = 105,
+ LDLM_GL_CALLBACK = 106,
+ LDLM_SET_INFO = 107,
+ MGS_CONNECT = 250,
+ MGS_DISCONNECT = 251,
+ MGS_EXCEPTION = 252,
+ MGS_TARGET_REG = 253,
+ MGS_TARGET_DEL = 254,
+ MGS_SET_INFO = 255,
+ MGS_CONFIG_READ = 256,
+ OBD_PING = 400,
+ OBD_LOG_CANCEL = 401,
+ OBD_QC_CALLBACK = 402,
+ OBD_IDX_READ = 403,
+ LLOG_ORIGIN_HANDLE_CREATE = 501,
+ LLOG_ORIGIN_HANDLE_NEXT_BLOCK = 502,
+ LLOG_ORIGIN_HANDLE_READ_HEADER = 503,
+ LLOG_ORIGIN_HANDLE_WRITE_REC = 504,
+ LLOG_ORIGIN_HANDLE_CLOSE = 505,
+ LLOG_ORIGIN_CONNECT = 506,
+ LLOG_ORIGIN_HANDLE_PREV_BLOCK = 508,
+ LLOG_ORIGIN_HANDLE_DESTROY = 509,
+ QUOTA_DQACQ = 601,
+ QUOTA_DQREL = 602,
+ SEQ_QUERY = 700,
+ SEC_CTX_INIT = 801,
+ SEC_CTX_INIT_CONT = 802,
+ SEC_CTX_FINI = 803,
+ FLD_QUERY = 900,
+ FLD_READ = 901,
+ UPDATE_OBJ = 1000,
+ LAST_OPC
+} cmd_t;
+----
+The symbols and values above identify the operations Lustre uses in
+its protocol. They are examined in detail in the
+<<lustre-operations,Lustre Operations>> section. Lustre carries out
+each of these operations via the exchange of a pair of messages: a
+request and a reply. The details of each message are specific to each
+operation. The <<lustre-messages,Lustre Messages>> chapter discusses
+each message and its contents.
+
+The 'pb_status' value in a request message is set to the 'pid' of the
+process making the request. In a reply message, a zero indicates that
+the service successfully initiated the requested operation. If for
+some reason the operation could not be initiated (eg. "permission
+denied") the status will encode the standard Linux kernel (POSIX)
+error code (eg. EPERM).
+
+'pb_last_xid' and 'pb_last_seen' are not used.
+
+The 'pb_last_committed' value is always zero in a request. In a reply
+it is the highest transaction number that has been committed to
+storage. The transaction numbers are maintained on a per-target basis
+and each series of transaction numbers is a strictly increasing
+sequence. This field is set in any kind of reply message including
+pings and non-modifying transactions.
+
+The 'pb_transno' value always zero in a new request. It is also zero
+for replies to operations that do not modify the file system. For
+replies to operations that do modify the file system it is the
+server-assigned value from the sequence of values associated with the
+given client and target. That transaction number is copied into the
+'pb_trans' field of the 'ptlrpc_body' of the originial request. If the
+request has to be replayed it will include the transaction number.
+
+The 'pb_flags' value governs the client state machine. Fixme: document
+what the states and transitions are of this state machine. Currently,
+only the bottom two bytes are used, and they encode state according to
+the following values:
+----
+#define MSG_GEN_FLAG_MASK 0x0000ffff
+#define MSG_LAST_REPLAY 0x0001
+#define MSG_RESENT 0x0002
+#define MSG_REPLAY 0x0004
+#define MSG_DELAY_REPLAY 0x0010
+#define MSG_VERSION_REPLAY 0x0020
+#define MSG_REQ_REPLAY_DONE 0x0040
+#define MSG_LOCK_REPLAY_DONE 0x0080
+----
+
+The 'pb_op_flags' value governs the client connection status state
+machine. Fixme: document what the states and transitions are of this
+state machine.
+----
+#define MSG_CONNECT_RECOVERING 0x00000001
+#define MSG_CONNECT_RECONNECT 0x00000002
+#define MSG_CONNECT_REPLAYABLE 0x00000004
+#define MSG_CONNECT_LIBCLIENT 0x00000010
+#define MSG_CONNECT_INITIAL 0x00000020
+#define MSG_CONNECT_ASYNC 0x00000040
+#define MSG_CONNECT_NEXT_VER 0x00000080
+#define MSG_CONNECT_TRANSNO 0x00000100
+----
+In normal operation an initial request to connect will set
+'pb_op_flags' to MSG_CONNECT_INITIAL and MSG_CONNECT_NEXT_VER. The
+reply to that connection request (and all other, non-connect, requests
+and replies) will set 'pb_op_flags' to 0.
+
+The 'pb_conn_cnt' (connection count) value in a request message
+reports the client's "era", which is part of the client and server's
+shared state. The value of the era is initialized to one when it is
+first connected to the MDT. Each subsequent connection (after an
+eviction) increments the era for the client. Since the 'pb_conn_cnt'
+reflects the client's era at the time the message was composed the
+server can use this value to discard late-arriving messages requesting
+operations on out-of-date shared state.
+
+The 'pb_timeout' value in a request indicates how long (in seconds)
+the requester plans to wait before timing out the operation. That is,
+the corresponding reply for this message should arrive within this
+time frame. The service may extend this time frame via an "early
+reply", which is a reply to this message that notifies the requester
+that it should extend its timeout interval by the value of the
+'pb_timeout' field in the reply. The "early reply" does not indicate
+the operation has actually been initiated. Clients maintain multiple
+request queues, called "portals", and each type of operation is
+assigned to one of these queues. There is a timeout value associated
+with each queue, and the timeout update affects all the messages
+associated with the given queue, not just the specific message that
+initiated the request. Finally, in a reply message (one that does
+indicate the operation has been initiated) the timeout value updates
+the timeout interval for the queue. Is this last point different from
+the "early reply" update?
+
+The 'pb_service_time' value is zero in a request. In a reply it
+indicates how long this particular operation actually took from the
+time it first arrived in the request queue (at the service) to the
+time the server replied. Note that the client can use this value and
+the local elapsed time for the operation to calculate network latency.
+
+The 'pb_limit' value is zero in a request. In a reply it is a value
+sent from a lock service to a client to set the maximum number of
+locks available to the client. When dynamic lock LRU's are enabled
+this allows for managing the size of the LRU.
+
+The 'pb_slv' value is zero in a request. On a DLM service, the "server
+lock volume" is a value that characterizes (estimates) the amount of
+traffic, or load, on that lock service. It is calculated as the
+product of the number of locks and their age. In a reply, the 'pb_slv'
+value indicates to the client the available share of the total lock
+load on the server that the client is allowed to consume. The client
+is then responsible for reducing its number or (or age) of locks to
+stay within this limit.
+
+The array of 'pb_pre_versions' values has four entries. They are
+always zero in a new request message. They are also zero in replies to
+operations that do not modify the file system. For an operation that
+does modify the file system, the reply encodes the most recent
+transaction numbers for the objects modified by this operation, and
+the 'pb_pre_versions' values are copied into the original request when
+the reply arrives. If the request needs to be replayed then the
+updated 'pb_pre_versions' values accompany the replayed request.
+
+'pb_padding' is reserved for future use.
+
+The 'pb_jobid' (string) value gives a unique identifier associated
+with the process on behalf of which this message was generated. The
+identifier is assigned to the user process by a job scheduler, if any.
+
+Object Based Disk UUID
+^^^^^^^^^^^^^^^^^^^^^^
+
+----
+#define UUID_MAX 40
+struct obd_uuid {
+ char uuid[UUID_MAX];
+};
+----
+
+OST ID
+^^^^^^
+
+----
+struct ost_id {
+ union {
+ struct ostid {
+ __u64 oi_id;
+ __u64 oi_seq;
+ } oi;
+ struct lu_fid oi_fid;
+ } LUSTRE_ANONYMOUS_UNION_NAME;
+};
+----
+
--- /dev/null
+Lustre File Identifier
+----------------------
+[[fids]]
+
+----
+struct lu_fid {
+ __u64 f_seq;
+ __u32 f_oid;
+ __u32 f_ver;
+};
+----
+
+File IDs ('fids') are 128-bit numbers that uniquely identify files and
+directories on the MDTs and OSTs of a Lustre file system. The fid for
+a Lustre file or directory is the fid from the corresponding MDT entry
+for the file. Each of the data objects for that file will also have a
+fid for each corresponding piece of the file on each of the
+OSTs. Encoded in the fid is the target on which that file metadata or
+file fragment resides. The map from fid to target is in the File
+Location DataBase (FLDB).
+
--- /dev/null
+
+File System Operations
+----------------------
+[[file-system-operations]]
+
+Lustre is a POSIX compliant file system that provides namespace and
+data storage services to clients. It implements all the usual file
+system functionality including creating, writing, reading, and
+removing files and directories. These file system operations are
+implemented via <<lustre-operations,Lustre operations>>, which carry
+out communication and coordination with the servers. In this section
+we present the sequence of Lustre Operations, along with their
+effects, of a variety of file system operations.
+
+Mount
+~~~~~
+
+Before any other interaction can take place between a client and a
+Lustre file system the client must 'mount' the file system, and Lustre
+services must already be in place (on the servers). A file system
+mount may be initiated at the Linux shell command line, which in turn
+invokes the 'mount()' system call. Kernel modules for Lustre
+exchange a series of messages with the servers, beginning with
+messages that retrieve details about the file system from the
+management server (MGS). This provides the client with the identities
+of all the metadata servers (MDSs) and targets (MDTs) as well as all
+the object storage servers (OSSs) and targets (OSTs). The client then
+sequences through each of the targets exchanging additional messages
+to initiate the connections with them. The following sections present
+the details of the Lustre operations that accomplish the file system
+mount.
+
+Messages Between the Client and the MGS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to be able to mount the Lustre file system the client needs
+to know the identities of the various servers and targets so that it
+can initiate connections to them. The following sequence of operations
+accomplishes this.
+
+----
+MGS_CONNECT
+LDLM_ENQUEUE (concurrent read)
+LLOG_ORIGIN_HANDLE_CREATE (filename: lfs-sptlrpc)
+LDLM_ENQUEUE (concurrent read)
+LLOG_ORIGIN_HANDLE_CREATE (filename: lfs-client)
+LLOG_ORIGIN_HANDLE_READ_HEADER
+LLOG_ORIGIN_HANDLE_NEXT_BLOCK
+LDLM_ENQUEUE (concurrent read)
+MGS_CONFIG_READ (name: lfs-cliir)
+LDLM_ENQUEUE (concurrent read)
+LLOG_ORIGIN_HANDLE_CREATE (filename: params)
+LLOG_ORIGIN_HANDLE_READ_HEADER
+----
+
+Prior to any other interaction between a client and a Lustre server
+(or between two servers) the client must establish a 'connection'. The
+connection establishes shared state between the two hosts. On the
+client this connection state information is called an 'import', and
+there is an import on the client for each target it connects to. On
+the server this connection state is referred to as an 'export', and
+again the server has an export for each client that has connected to
+it. There a separate export for each client for each target.
+
+The client begins by carrying out the MGS_CONNECT Lustre operation,
+which establishes the connection (creates the import and the export)
+between the client and the MGS. The connect message from the client
+includes a 'handle' to uniquely identify itself (subsequent messages
+to the LDLM will refer to that client-handle). The connection data
+from the client also proposes the set of <<connect-flags,connection
+flags>> appropriate to connecting to an MGS.
+
+.Flags for the client connection to an MGS
+[options="header"]
+|====
+| obd_connect_data->ocd_connect_flags
+| OBD_CONNECT_VERSION
+| OBD_CONNECT_AT
+| OBD_CONNECT_FULL20
+| OBD_CONNECT_IMP_RECOV
+| OBD_CONNECT_MNE_SWAB
+| OBD_CONNECT_PINGLESS
+|====
+
+The MGS's reply to the connection request will include the handle that
+the server and client will both use to identify this connection in
+subsequent messages. This is the 'connection-handle' (as opposed to
+the client-handle mentioned a moment ago). The MGS also replies with
+the same set of connection flags.
+
+Once the connection is established the client gets configuration
+information for the file system from the MGS in four stages. First,
+the two exchange messages establishing the file system wide security
+policy that will be followed in all subsequent communications. Second,
+the client gets a bitmap instructing it as to which among the
+configuration records on the MGS it needs. Third, reading those
+records from the MGS gives the client the list of all the servers and
+targets it will need to communicate with. Fourth, the client reads
+cluster wide configuration data (the sort that might be set at the
+client command line with a 'lctl conf_param' command). The following
+paragraphs go into these four stages in more detail.
+
+Each time the client is going to read information from server storage
+it needs to first acquire the appropriate lock. Since the client is
+only reading data, the locks will be 'concurrent read' locks. The
+LDLM_ENQUEUE command communicates this lock request to the MGS
+target. The request identifies the target via the connection-handle
+from the connection reply, and identifies the client (itself) with the
+client-handle from its original connection request. The MGS's reply
+grants that lock, if appropriate. If other clients were making some
+sort of modification to the MGS data then the lock exchange might
+result in a delay while the client waits. More details about the
+behavior of the <<ldlm,Distributed Lock Manager>> are in that
+section. For now, let's assume the locks are granted for each of these
+four operations. The first LLOG_ORIGIN_HANDLE_CREATE operation (the
+client is creating its own local handle not the target's file) asks
+for the security configuration file ("lfs-sptlrpc"). <<security>>
+discusses security, and for now let's assume there is nothing to be
+done for security. That is, subsequent messages will all use an "empty
+security flavor" and no encryption will take place. In this case the
+MGS's reply ('pb_status' == -2, ENOENT) indicated that there was no
+such file, so nothing actually gets read.
+
+Another LDLM_ENQUEUE and LLOG_ORIGIN_HANDLE_CREATE pair of operations
+identifies the configuration client data ("lfs-client") file, and in
+this case there is data to read. The LLOG_ORIGIN_HANDLE_CREATE reply
+identifies the actual object of interest on the MGS via the
+'llog_logid' field in the 'struct llogd_body'. The MGS stores
+configuration data in log records. A header at the beginning of
+"lfs-client" uses a bitmap to identify the log records that are
+actually needed. The header includes both which records to retrieve
+and how large those records are. The LLOG_ORIGIN_HANDLE_READ_HEADER
+request uses the 'llog_logid' to identify desired log file, and the
+reply provides the bitmap and size information identifying the
+records that are actually needed. The
+LLOG_ORIGIN_HANDLE_NEXT_BLOCK operations retrieves the data thus
+identified.
+
+Knowing the specific configuration records it wants, the client then
+proceeds to retrieve them. This requires another LDLM_ENQUEUE
+operation, followed this time by the MGS_CONFIG_READ operation, which
+get the UUIDs for the servers and targets from the configuration log
+("lfs-cliir").
+
+A final LDLM_ENQUEUE, LLOG_ORIGIN_HANDLE_CREATE, and
+LLOG_ORIGIN_HANDLE_READ_HEADER then retrieve the cluster wide
+configuration data ("params").
+
+Messages Between the Client and the MDSs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After the foregoing interaction with the MGS the client has a list of
+the MDSs and MDTs in the file system. Next, the client invokes four
+Lustre operations with each MDT on the list.
+
+----
+MDS_CONNECT
+MDS_STATFS
+MDS_GETSTATUS
+MDS_GETATTR
+----
+
+The MDS_CONNECT operation establishes a connection between the client
+and a specific target (MDT) on an MDS. Thus, if an MDS has multiple
+targets, there is a separate MDS_CONNECT operation for each. This
+creates an import for the target on the client and an export for the
+client and target on the MDS. As with the connect operation for the
+MGS, the connect message from the client includes a UUID to uniquely
+identify this connection, and subsequent messages to the lock manager
+on the server will refer to that UUID. The connection data from the
+client also proposes the set of <<connect-flags,connection flags>>
+appropriate to connecting to an MDS. The following are the flags
+always included.
+
+.Always included flags for the client connection to an MDS
+[options="header"]
+|====
+| obd_connect_data->ocd_connect_flags
+| OBD_CONNECT_RDONLY
+| OBD_CONNECT_VERSION
+| OBD_CONNECT_ACL
+| OBD_CONNECT_XATTR
+| OBD_CONNECT_IBITS
+| OBD_CONNECT_NODEVOH
+| OBD_CONNECT_ATTRFID
+| OBD_CONNECT_CANCELSET
+| OBD_CONNECT_AT
+| OBD_CONNECT_RMT_CLIENT
+| OBD_CONNECT_RMT_CLIENT_FORCE
+| OBD_CONNECT_BRW_SIZE
+| OBD_CONNECT_MDS_CAPA
+| OBD_CONNECT_OSS_CAPA
+| OBD_CONNECT_MDS_MDS
+| OBD_CONNECT_FID
+| LRU_RESIZE_CONNECT_FLAG
+| OBD_CONNECT_VBR
+| OBD_CONNECT_LOV_V3
+| OBD_CONNECT_SOM
+| OBD_CONNECT_FULL20
+| OBD_CONNECT_64BITHASH
+| OBD_CONNECT_JOBSTATS
+| OBD_CONNECT_EINPROGRESS
+| OBD_CONNECT_LIGHTWEIGHT
+| OBD_CONNECT_UMASK
+| OBD_CONNECT_LVB_TYPE
+| OBD_CONNECT_LAYOUTLOCK
+| OBD_CONNECT_PINGLESS
+| OBD_CONNECT_MAX_EASIZE
+| OBD_CONNECT_FLOCK_DEAD
+| OBD_CONNECT_DISP_STRIPE
+| OBD_CONNECT_LFSCK
+| OBD_CONNECT_OPEN_BY_FID
+| OBD_CONNECT_DIR_STRIPE
+|====
+
+.Optional flags for the client connection to an MDS
+[options="header"]
+|====
+| obd_connect_data->ocd_connect_flags
+| OBD_CONNECT_SOM
+| OBD_CONNECT_LRU_RESIZE
+| OBD_CONNECT_ACL
+| OBD_CONNECT_UMASK
+| OBD_CONNECT_RDONLY
+| OBD_CONNECT_XATTR
+| OBD_CONNECT_XATTR
+| OBD_CONNECT_RMT_CLIENT_FORCE
+|====
+
+The MDS replies to the connect message with a subset of the flags
+proposed by the client, and the client notes those values in its
+import. The MDS's reply to the connection request will include a UUID
+that the server and client will both use to identify this connection
+in subsequent messages.
+
+The client next uses an MDS_STATFS operation to request 'statfs'
+information from the target, and that data is returned in the reply
+message. The actual fields closely resemble the results of a 'statfs'
+system call. See the 'obd_statfs' structure in the <<data-structs,Data
+Structures and Defines Section>>.
+
+The client uses the MDS_GETSTATUS operation to request information
+about the mount point of the file system. fixme: Does MDS_GETSTATUS
+only ask about the root (so it would seem)? The server reply contains
+the 'fid' of the root directory of the file system being mounted. If
+there is a security policy the capabilities of that security policy
+are included in the reply.
+
+The client then uses the MDS_GETATTR operation to get get further
+information about the root directory of the file system. The request
+message includes the above fid. It will also include the security
+capability (if appropriate). The reply also holds the same fid, and in
+this case the 'mdt_body' has several additional fields filled
+in. These include the mtime, atime, ctime, mode, uid, and gid. It also
+includes the size of the extended attributes and the size of the ACL
+information. The reply message also includes the extended attributes
+and the ACL. From the extended attributes the client can find out
+about striping information for the root, if any.
+
+Messages Between the Client and the OSSs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Additional CONNECT messages flow between the client and each OST
+enumerated by the MGS.
+
+----
+OST_CONNECT
+----
+
+Unmount
+~~~~~~~
+
+----
+OST_DISCONNECT
+MDS_DISCONNECT
+MGS_DISCONNECT
+----
+
+Create
+~~~~~~
+
+Further discussion of the 'creat()' system call.
--- /dev/null
+[glossary]
+Glossary
+--------
+Here are some common terms used in discussing Lustre, POSIX semantics,
+and the protocols used to implement them.
+
+[glossary]
+back end file system::
+Storage on a server in support of the object store or of metadata is
+stored on the back end file system. The two available file systems are
+currently ldiskfs and ZFS.
+
+Client::
+A Lustre client is a computer taking advantage of the services
+provided by MDSs and OSSs to assemble a POSIX-compliant file system
+with its namespace and data storage capabilities.
+
+Distributed Lock Manager::
+The distributed lock manager (DLM, often referred to as the Lustre
+Distributed Lock Manager, or LDLM) is the service that enforces a
+consistent (cache-coherent) view of the data objects in the file
+system.
+
+early reply::
+A message returned in response to a request message that allows the
+target to extend timeout values without otherwise indicating the
+request has progressed.
+
+export::
+The term for the per-client shared state information held by a server.
+
+extent:: A range of offsets in a file.
+
+import::
+The term for the per-target shared state information held by a client.
+
+Inodes and Extended Attributes::
+Metadata about a Lustre file is encoded in the extended attributes of
+an inode in the back-end file system on the MDS. Some of that
+information establishes the stripe layout of the file. The size of the
+stripe layout information varies among Lustre file systems. The amount
+of space reserved for layout information is a small as possible given
+the maximum stripe count possible on the file system. Clients, servers,
+and the distributed lock manager will all need to be aware of this
+size, which is communicated in the 'ocd_ibis_known' filed of the
+'obd_connect_data' structure.
+
+LNet::
+A lower level protocol employed by PtlRPC to abstract the mechanisms
+provided by various hardware centric protocols, such as TCP or
+Infiniband.
+
+The Log Subsystem::
+
+Fixme:
+The log subsystem (LOG) is something I have no idea about right now.
+
+Logical Object Volume::
+The logical object volume (LOV) layer abstracts the targets on a server.
+
+Management Server::
+The Management Server (MGS) maintains an understanding of the other
+servers, targets, and client in the Lustre file system. It holds their
+identities and keeps track of disconnections and reconnections in
+order to support imperative recovery.
+
+Metadata Server::
+A metadata server (MDS) is a computer responsible for running the
+Lustre kernel services in support of managing the POSIX-compliant name
+space and the indices associating files in that name space with the
+locations of their corresponding objects. As of v2.4 there can be
+multiple MDSs in a Lustre file system.
+
+Metadata Target::
+A metadata target (MDT) is the service provided by an MDS that
+mediates the management of name space indices on the underlying file
+system hardware. As of v2.4 there can be multiple MDTs on an MDS.
+
+Object Based Disk::
+Object Based Disk (OBD) is the term used for any target, MDT or OST.
+
+Object Storage Server::
+An object storage server (OSS) is a computer responsible for running
+Lustre kernel services in support of managing bulk data objects on the
+underlying storage. There can be multiple OSSs in a Lustre file
+system.
+
+Object Storage Target::
+An object storage target (OST) is the service provided by an OSS that
+mediates the placement of data objects on the specific underlying file
+system hardware. There can be multiple OSTs on a given OSS.
+
+protocol::
+An agreed upon formalism for communicating between two entities, such
+as between two servers or between a client and a server.
+
+PtlRPC::
+The protocol (or set of protocols) implemented via RPCs that is (are)
+employed by Lustre to communicate between its clients and servers.
+
+Remote Procedure Call::
+A mechanism for implementing operations involving one computer acting
+on the behalf of another (RPC).
+
+server::
+A computer that provides a service. For example, management (MGS),
+Metadata (MDS), or object storage (OSS) services in support of a
+Lustre file system.
+
+target::
+Storage available to be served, such as an OST or an MDT. Also the
+service being provided.
+
+UUID:: A universally unique identifier.
--- /dev/null
+Introduction
+------------
+
+[NOTE]
+I am leaving the introductory content here for now, but it is the last
+thing that should be written. The following is just a very early
+sketch, and will be revised entirely once the rest of the content has
+begun to shape up.
+
+The Lustre parallel file system provides a global POSIX namespace for
+the computing resources of a data center. Lustre runs on Linux-based
+hosts via kernel modules, and delegates block storage management to
+the back-end servers while providing object-based storage to its
+clients. Servers are responsible for both data objects (the contents
+of actual files) and index objects (for directory information). Data
+objects are gathered on Object Storage Servers (OSSs), and index
+objects are stored on Metadata Servers (MDSs). Each back-end
+storage volume is a target with Object Storage Targets (OSTs) on OSSs,
+and Metadata Targets (MDTs) on MDSs. Clients assemble the
+data from the MDTs and OSTs to present a single coherent
+POSIX-compliant file system. The clients and servers communicate and
+coordinate among themselves via network protocols. A low-level
+protocol, LNet, abstracts the details of the underlying networking
+hardware and presents a uniform interface, originally based on Sandia
+Portals <<PORTALS>>, to Lustre clients and servers. Lustre, in turn,
+layers its own protocol atop LNet. This document describes the Lustre
+protocol.
+
+Lustre runs across multiple hosts, coordinating the activities among
+those hosts via the exchange of messages over a network. On each host,
+Lustre is implemented via a collection of Linux processes (often
+called "threads"). This discussion will refer to a more formalized
+notion of 'processes' that abstract some of the thread-level
+details. The description of the activities on each host comprise a
+collection of 'abstract processes'. Each abstract process may be
+thought of as a state machine, or automaton, following a fixed set of
+rules for how it consumes messages, changes state, and produces other
+messages. We speak of the 'behavior' of a process as shorthand for the
+management of its state and the rules governing what messages it can
+consume and produce. Processes communicate with each other via
+messages. The Lustre protocol is the collection of messages the
+processes exchange along with the rules governing the behavior of
+those processes.
+
--- /dev/null
+The Lustre Distributed Lock Manager
+-----------------------------------
+[[ldlm]]
+
+The discussion of the LDLM is deferred for now. We'll get into it soon
+enough.
+
+LDLM Structures
+~~~~~~~~~~~~~~~
+
+Lock Modes
+^^^^^^^^^^
+
+----
+typedef enum {
+ LCK_MINMODE = 0,
+ LCK_EX = 1,
+ LCK_PW = 2,
+ LCK_PR = 4,
+ LCK_CW = 8,
+ LCK_CR = 16,
+ LCK_NL = 32,
+ LCK_GROUP = 64,
+ LCK_COS = 128,
+ LCK_MAXMODE
+} ldlm_mode_t;
+----
+
+LDLM Extent
+^^^^^^^^^^^
+
+----
+struct ldlm_extent {
+ __u64 start;
+ __u64 end;
+ __u64 gid;
+};
+----
+
+LDLM Flock Wire
+^^^^^^^^^^^^^^^
+
+----
+struct ldlm_flock_wire {
+ __u64 lfw_start;
+ __u64 lfw_end;
+ __u64 lfw_owner;
+ __u32 lfw_padding;
+ __u32 lfw_pid;
+};
+----
+
+LDLM Inode Bits
+^^^^^^^^^^^^^^^
+
+----
+struct ldlm_inodebits {
+ __u64 bits;
+};
+----
+
+LDLM Wire Policy Data
+^^^^^^^^^^^^^^^^^^^^^
+
+----
+typedef union {
+ struct ldlm_extent l_extent;
+ struct ldlm_flock_wire l_flock;
+ struct ldlm_inodebits l_inodebits;
+} ldlm_wire_policy_data_t;
+----
+
+Resource Descriptor
+^^^^^^^^^^^^^^^^^^^
+
+----
+struct ldlm_resource_desc {
+ ldlm_type_t lr_type;
+ __u32 lr_padding; /* also fix lustre_swab_ldlm_resource_desc */
+ struct ldlm_res_id lr_name;
+};
+----
+
+The 'ldlm_type_t' is given by one of these values:
+----
+typedef enum {
+ LDLM_PLAIN = 10,
+ LDLM_EXTENT = 11,
+ LDLM_FLOCK = 12,
+ LDLM_IBITS = 13
+} ldlm_type_t;
+----
+
+Lock Descriptor
+^^^^^^^^^^^^^^^
+
+----
+struct ldlm_lock_desc {
+ struct ldlm_resource_desc l_resource;
+ ldlm_mode_t l_req_mode;
+ ldlm_mode_t l_granted_mode;
+ ldlm_wire_policy_data_t l_policy_data;
+};
+----
+
+Lock Request
+^^^^^^^^^^^^
+
+----
+#define LDLM_LOCKREQ_HANDLES 2
+struct ldlm_request {
+ __u32 lock_flags;
+ __u32 lock_count;
+ struct ldlm_lock_desc lock_desc;
+ struct lustre_handle lock_handle[LDLM_LOCKREQ_HANDLES];
+};
+----
+
+Lock Reply
+^^^^^^^^^^
+
+----
+struct ldlm_reply {
+ __u32 lock_flags;
+ __u32 lock_padding; /* also fix lustre_swab_ldlm_reply */
+ struct ldlm_lock_desc lock_desc;
+ struct lustre_handle lock_handle;
+ __u64 lock_policy_res1;
+ __u64 lock_policy_res2;
+};
+----
+
+Lock Value Block
+^^^^^^^^^^^^^^^^
+
+A lock value block is part of reply messages from servers when an
+LDLM_ENQUEUE command has been issued. There are two varieties. Which
+is chosen depends on the target.
+
+----
+struct ost_lvb_v1 {
+ __u64 lvb_size;
+ obd_time lvb_mtime;
+ obd_time lvb_atime;
+ obd_time lvb_ctime;
+ __u64 lvb_blocks;
+};
+struct ost_lvb {
+ __u64 lvb_size;
+ obd_time lvb_mtime;
+ obd_time lvb_atime;
+ obd_time lvb_ctime;
+ __u64 lvb_blocks;
+ __u32 lvb_mtime_ns;
+ __u32 lvb_atime_ns;
+ __u32 lvb_ctime_ns;
+ __u32 lvb_padding;
+};
+----
--- /dev/null
+The Lustre Log Facility
+-----------------------
+[[llog]]
+
+LLOG Structures
+~~~~~~~~~~~~~~~
+
+LLOG Log ID
+^^^^^^^^^^^
+
+----
+struct llog_logid {
+ struct ost_id lgl_oi;
+ __u32 lgl_ogen;
+};
+----
+
+LLog Information
+^^^^^^^^^^^^^^^^
+
+----
+struct llogd_body {
+ struct llog_logid lgd_logid;
+ __u32 lgd_ctxt_idx;
+ __u32 lgd_llh_flags;
+ __u32 lgd_index;
+ __u32 lgd_saved_index;
+ __u32 lgd_len;
+ __u64 lgd_cur_offset;
+};
+----
+
+The lgd_llh_flags are:
+----
+enum llog_flag {
+ LLOG_F_ZAP_WHEN_EMPTY = 0x1,
+ LLOG_F_IS_CAT = 0x2,
+ LLOG_F_IS_PLAIN = 0x4,
+};
+----
+
+LLog Record Header
+^^^^^^^^^^^^^^^^^^
+
+----
+struct llog_rec_hdr {
+ __u32 lrh_len;
+ __u32 lrh_index;
+ __u32 lrh_type;
+ __u32 lrh_id;
+};
+----
+
+LLog Record Tail
+^^^^^^^^^^^^^^^^
+
+----
+struct llog_rec_tail {
+ __u32 lrt_len;
+ __u32 lrt_index;
+};
+----
+
+LLog Log Header Information
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+----
+struct llog_log_hdr {
+ struct llog_rec_hdr llh_hdr;
+ obd_time llh_timestamp;
+ __u32 llh_count;
+ __u32 llh_bitmap_offset;
+ __u32 llh_size;
+ __u32 llh_flags;
+ __u32 llh_cat_idx;
+ /* for a catalog the first plain slot is next to it */
+ struct obd_uuid llh_tgtuuid;
+ __u32 llh_reserved[LLOG_HEADER_SIZE/sizeof(__u32) - 23];
+ __u32 llh_bitmap[LLOG_BITMAP_BYTES/sizeof(__u32)];
+ struct llog_rec_tail llh_tail;
+};
+----
+
--- /dev/null
+Lustre Messages
+---------------
+[[lustre-messages]]
+
+A Lustre message is a sequence of bytes. The message begins with a
+<<lustre-message-header,Lustre Message Header>> and has between one
+and nine subsequences called "buffers". Each buffer has a structure
+(the size and meaning of the bytes) that corresponds to the 'struct'
+entities in the <<data-structs,Data Structures and Defines Section>>
+Section. The header gives the number of buffers in its 'lm_buffcount'
+field. The first buffer in any message is always the
+<<lustre-message-preamble,Lustre Message Preamble>>. The operation
+code ('pb_opc' field) and the message type ('pb_type' field: request
+or reply?) in the preamble together specify the "format" of the
+message, where the format is the number and content of the remaining
+buffers. As a shorthand, it is useful to name each of these formats,
+and the following list gives all of the formats along with the number
+and content of the buffers for messages in that format. Note that
+while the combination of 'pb_opc' and 'pb_type' uniquely determines a
+message's format, the reverse is not true. A given format may be used
+in many different messages.
+
+The "Empty" Message
+~~~~~~~~~~~~~~~~~~~
+
+'empty'
+^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_STATFS | PTL_RPC_MSG_REQUEST
+|====
+
+
+An 'empty' message is one that consists only of the Lustre message
+preamble 'ptlrpc_body'. It occurs as either the request of the reply
+(or both) in a variety of Lustre operations.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+|====
+
+The LDLM Enqueue Client Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'ldlm_enqueue_client'
+^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LDLM_ENQUEUE | PTL_RPC_MSG_REQUEST
+|====
+
+
+An 'ldlm_enqueue_client' message format is used to acquire a lock.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| ldlm_request | details of the lock request
+|====
+
+An 'ldlm_enqueue_client' message concatenates two data elements into a
+single byte-stream. The two elements correspond to structures
+detailed in the <<data-structs,Data Structures and Defines Section>>.
+
+The LDLM Enqueue Server Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'ldlm_enqueue_server'
+^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LDLM_ENQUEUE | PTL_RPC_MSG_REPLY
+|====
+
+
+An 'ldlm_enqueue_server' message format is used to inform a client
+about the status of its request for a lock.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| ldlm_reply | details of the lock request
+|====
+
+An 'ldlm_enqueue_server' message concatenates two data elements
+into a single byte-stream. The three elements correspond to structures
+detailed in the <<data-structs,Data Structures and Defines Section>>.
+
+The LDLM Enqueue Server Message with LVB
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'ldlm_enqueue_lvb_server'
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LDLM_ENQUEUE | PTL_RPC_MSG_REPLY
+|====
+
+
+An 'ldlm_enqueue_lvb_server' message format is used to inform a client
+about the status of its request for a lock.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| ldlm_reply | details of the lock request
+| ost_lvb | lock value block
+|====
+
+An 'ldlm_enqueue_lvb_server' message concatenates three data elements
+into a single byte-stream. It closely resembles the
+'ldlm_enqueue_server' message with the addition of a lock value
+block. The three elements correspond to structures detailed in the
+<<data-structs,Data Structures and Defines Section>>.
+
+The LLog Origin Handle Create Client
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'llog_origin_handle_create_client'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LLOG_ORIGIN_HANDLE_CREATE | PTL_RPC_MSG_REQUEST
+|====
+
+
+An 'llog_origin_handle_create_client' message format is used to
+ask for the creation of a log entry object.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| llogd_body | LLog description
+| string | The name of the desired log
+|====
+
+Fixme: I don't actually see where the string gets set.
+
+The LLog Service (body-only) Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'llogd_body_only'
+^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LLOG_ORIGIN_HANDLE_CREATE | PTL_RPC_MSG_REPLY
+|====
+
+
+An 'llogd_body_only' message replies with information about the log.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| llogd_body | LLog description
+|====
+
+
+The LLog Log (header-only) Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'llog_log_hdr_only'
+^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LLOG_ORIGIN_HANDLE_READ_HEADER | PTL_RPC_MSG_REPLY
+|====
+
+
+An 'llog_log_hdr_only' message replies with header information from
+the log.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| llog_log_hdr | LLog log header info
+|====
+
+
+The LLog Return Next Block from Server
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'llog_origin_handle_next_block_server'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| LLOG_ORIGIN_HANDLE_NEXT_BLOCK | PTL_RPC_MSG_REPLY
+|====
+
+
+An 'llog_origin_handle_next_block_server' message replies with the
+next block of data from the log.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| llogd_body | LLog description
+| eadata | variable length field for extended attributes
+|====
+
+
+The MDS Getattr Server Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'mds_getattr_server'
+^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_GETATTR | PTL_RPC_MSG_REPLY
+|====
+
+An 'mds_getattr_server' message format is used to convey MDT data along
+with information about the Lustre capabilities of that MDT.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| mdt_body | Information about the MDT
+| MDT_MD | OST stripe and index info
+| ACL | ACLs for the fid
+| lustre_capa | Security capa info for fid1
+| lustre_capa | Security capa info for fid2
+|====
+
+An 'mdt_getattr_server' message concatenates three data elements into
+a single byte-stream. The three elements correspond to structures
+detailed in the <<data-structs,Data Structures and Defines Section>>.
+
+The MDT Capability Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'mdt_body_capa'
+^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_GETSTATUS | PTL_RPC_MSG_REPLY
+| MDS_GETATTR | PTL_RPC_MSG_REQUEST
+|====
+
+An 'mdt_body_capa' message format is used to convey MDT data along
+with information about the Lustre capabilities of that MDT.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| mdt_body | Information about the MDT
+| lustre_capa | security capa info
+|====
+
+An 'mdt_body_capa' message concatenates three data elements into
+a single byte-stream. The three elements correspond to structures
+detailed in the <<data-structs,Data Structures and Defines Section>>.
+
+The MDT "Body Only" Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'mdt_body_only'
+^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_GETSTATUS | PTL_RPC_MSG_REQUEST
+|====
+
+
+An 'mdt_body_only' message format is used to convey MDT data.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| mdt_body | Information about the MDT
+|====
+
+An 'mdt_body_only' message concatenates two data elements into
+a single byte-stream. The two elements correspond to structures
+detailed in the <<data-structs,Data Structures and Defines Section>>.
+
+The MGS Config Read Client Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'mgs_config_read_client'
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MGS_CONFIG_READ | PTL_RPC_MSG_REQUEST
+|====
+
+An 'mgs_config_read_client' message requests configuration data from
+the MGS.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| mgs_config_body | Information about the MGS supporting the request
+|====
+
+
+The MGS Config Read Server Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'mgs_config_read_server'
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MGS_CONFIG_READ | PTL_RPC_MSG_REPLY
+|====
+
+An 'mgs_config_read_server' message returns configuration data from
+the MGS.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| mgs_config_body | Information about the MGS supporting the request
+|====
+
+
+The OBD Connect Client Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'obd_connect_client'
+^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_CONNECT | PTL_RPC_MSG_REQUEST
+|====
+
+
+An 'obd_connect_client' message format is used to initiate the
+connection from one host to a target on another host. Clients will
+connect to the MGS, to MDTS on MDSs, and to OSTs on OSSs. Furthermore,
+MDSs and OSSs will connect to the MGS, and MDSs will connect to OSTs
+on OSSs. In each case the host initiating the the connection request
+sends an 'obd_connect_client' message. The reply to this message is
+the obd_connect_server message.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| obd_uuid | UUID of the target
+| obd_uuid | UUID of the client
+| lustre_handle | connection handle
+| obd_connect_data | connection data
+|====
+
+An 'obd_connect_client' message concatenates five data elements into
+a single byte-stream. The five elements correspond to structures
+detailed in the <<data-structs,Data Structures and Defines Section>>.
+
+The connection handle sent in a client connection request message is
+unique to that connection. The server notes it and a connection
+request with a new or different handle indicates that the client is
+attempting to make a new connection (or a delayed message about an
+old one?).
+
+The OBD Connect Server Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+' obd_connect_server'
+^^^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_CONNECT | PTL_RPC_MSG_REPLY
+|====
+
+The 'obd_connect_server' mess-sage format is sent by a server in reply
+to an 'obd_connect_client' connection request message. to a target on
+that server. MGSs, MDSs, and OSSs send these replies.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| obd_connect_data | connection data
+|====
+
+An 'obd_connect_server' message concatenates two data elements into a
+single byte-stream.
+
+The OBD Statfs Server Message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+'obd_statfs_server'
+^^^^^^^^^^^^^^^^^^^
+
+.example
+[options="header"]
+|====
+| pb_opc | pb_type
+| MDS_STATFS | PTL_RPC_MSG_REPLY
+|====
+
+The 'obd_statfs_server' message returns 'statfs' system call
+information to the client.
+
+.format
+[options="header"]
+|====
+| structure | meaning
+| ptlrpc_body | message preamble
+| obd_statfs | statfs system call info
+|====
+
--- /dev/null
+Lustre Operations
+-----------------
+[[lustre-operations]]
+
+Lustre operations are denoted by the 'pb_opc' op-code field of the
+message preamble. Each operation is implemented as a pair of messages,
+with the 'pb_type' field set to PTLRPC_MSG_REQUEST for requests
+initiating the operation, and PTLRPC_MSG_REPLY for replies. Note that
+as a general matter, the receipt by a client of the rply message only
+assures the client hat the server has initiated the action, if
+any. See the discussion on <<transno,transaction numbers>>
+and <<recovery>> for how the client is given confirmation that a
+request has been completed.
+
+Command 8: OST CONNECT - Client connection to an OST
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.OST_CONNECT (8)
+[options="header"]
+|====
+| request | reply
+| obd_connect_client | obd_connect_server
+|====
+
+When a client initiates a connection to a specific target on an OSS,
+it does so by sending an 'obd_connect_client' message and awaiting the
+reply from the OSS of an 'obd_connect_server' message. From a previous
+interaction with the MGS the client knows the UUID of the target OST,
+and can fill that value into the 'obd_connect_client' message.
+
+The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the
+capabilities appropriate to the client. The 'ocd_brw_size' is set to the
+largest value for the size of an RPC that the client can handle. The
+'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client
+considers appropriate. Other fields in the preamble and
+'obd_connect_data' structures are zero, as is the 'lustre_handle'
+element.
+
+Once the server receives the 'obd_connect_client' message on behalf of
+the given target it replies with an 'obd_connect_server' message. In
+that message the server sends the 'pb__handle' to uniquely
+identify the connection for subsequent communication. The client notes
+that handle in its import for the given target.
+
+fixme: Are there circumstances that could lead to the 'status'
+value in the reply being non-zero? What would lead to that and what
+error values would result?
+
+The target maintains the last committed transaction for a client in
+its export for that client. If this is the first connection, then that
+last transactiion value would just be zero. If there were previous
+transactions for the client, then the transaction number for the last
+such committed transaction is put in the 'pb_last_committed' field.
+
+In a connection request the operation is not file system modifying, so
+the 'pb_transno' value will be zero in the reply as well.
+
+fixme: there is still some work to be done about how the fields are
+managed.
+
+Command 9: OST DISCONNECT - Disconnect client from an OST
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.OST_DISCONNECT (9)
+[options="header"]
+|====
+| request | reply
+| empty | empty
+|====
+
+The information exchanged in a DISCONNECT message is that normally
+conveyed in the mesage preamble.
+
+Command 33: MDS GETATTR - Get MDS Attributes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MDS_GETATTR (33)
+[options="header"]
+|====
+| request | reply
+| mdt_body_capa | mds_getattr_server
+|====
+
+
+
+Command 38: MDS CONNECT - Client connection to an MDS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MDS_CONNECT (38)
+[options="header"]
+|====
+| request | reply
+| obd_connect_client | obd_connect_server
+|====
+
+N.B. This is nearly identical to the explanation for OST_CONNECT and
+for MGS_CONNECT. We may want to simplify and/or unify the discussion
+and only call out how this one differees from a generic CONNECT
+operation.
+
+When a client initiates a connection to a specific target on an MDS,
+it does so by sending an 'obd_connect_client' message and awaiting the
+reply from the MDS of an 'obd_connect_server' message. From a previous
+interaction with the MGS the client knows the UUID of the target MDT,
+and can fill that value into the 'obd_connect_client' message.
+
+The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the
+capabilities appropriate to the client. The 'ocd_brw_size' is set to the
+largest value for the size of an RPC that the client can handle. The
+'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client
+considers appropriate. Other fields in the preamble and
+'obd_connect_data' structures are zero, as is the 'lustre_handle'
+element.
+
+Once the server receives the 'obd_connect_client' message on behalf of
+the given target it replies with an 'obd_connect_server' message. In
+that message the server sends the 'pb__handle' to uniquely
+identify the connection for subsequent communication. The client notes
+that handle in its import for the given target.
+
+fixme: Are there circumstances that could lead to the 'status'
+value in the reply being non-zero? What would lead to that and what
+error values would result?
+
+The target maintains the last committed transaction for a client in
+its export for that client. If this is the first connection, then that
+last transactiion value would just be zero. If there were previous
+transactions for the client, then the transaction number for the last
+such committed transaction is put in the 'pb_last_committed' field.
+
+In a connection request the operation is not file system modifying, so
+the 'pb_transno' value will be zero in the reply as well.
+
+fixme: there is still some work to be done about how the fields are
+managed.
+
+Command 39: MDS DISCONNECT - Disconnect client from an MDS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MDS_DISCONNECT (39)
+[options="header"]
+|====
+| request | reply
+| empty | empty
+|====
+
+The information exchanged in a DISCONNECT message is that normally
+conveyed in the mesage preamble.
+
+Command 40: MDS GETSTATUS - get the status from a target
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The MDS_GETSTATUS command targets a specific MDT. If there are several,
+the client will need to send a separate message for each.
+
+.MDS_GETSTATUS (41)
+[options="header"]
+|====
+| request | reply
+| mdt_body_only | mdt_body_capa
+|====
+
+The 'mdt_body_only' request message is one that provides information
+about a specific MDT, identifying which (among several possible)
+target is being asked about.
+
+In the reply there is additional information about the MDT's
+capabilities.
+
+Command 41: MDS STATFS - get statfs data from the server
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MDS_STATFS (41)
+[options="header"]
+|====
+| request | reply
+| empty | obd_statfs_server
+|====
+
+The 'empty' request message is one that only has the 'ptlrpc_body'
+data encoded. The fields have thier generic values for a request from
+this client, with 'pb_opc' being set to MDS_STATFS (41).
+
+In the reply there is, in addition to the 'ptlrpc_body', data relevant
+to a 'statfs' system call.
+
+Command 101: LDLM ENQUEUE - Enqueue a lock request
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.LDLM_ENQUEUE (101)
+[options="header"]
+|====
+| request | reply
+| ldlm_enqueue_client | ldlm_enqueue_lvb_server
+|====
+
+In order to access data on disk at a target the client needs to
+establish a lock (either concurrent for reads or exclusive for
+writes). The client sends the 'ldlm_enqueue_client' message to the
+server holding the target, and the server will reply with an
+'ldlm_enqueue_server' message. (N.B. It actuallity it is an
+'ldlm_enqueue_lvb_server' message with the length of the 'struct
+ost_lvb' portion set to zero, which ammounts to the same thing).
+
+The target UUID is just "MGS", and the client UUID is set to the
+32byte string it gets from ... where?
+
+The 'struct lustre_handle' (the fourth buffer in the message) has its
+cookie set to .. what? It is set, but where does it come from?
+
+The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the
+capabilities appropriate to the client. The 'ocd_brw_size' is set to the
+largest value for the size of an RPC that the client can handle. The
+'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client
+considers appropriate. Other fields in the preamble and
+'obd_connect_data' structures are zero.
+
+Once the server receives the 'obd_connect_client' message on behalf of
+the given target it replies with an 'obd_connect_server' message. In
+that message the server sends the 'pb__handle' to uniquely
+identify the connection for subsequent communication. The client notes
+that handle in its import for the given target.
+
+fixme: Are there circumstances that could lead to the 'status'
+value in the reply being non-zero? What would lead to that and what
+error values would result?
+
+The target maintains the last committed transaction for a client in
+its export for that client. If this is the first connection, then that
+last transactiion value would just be zero. If there were previous
+transactions for the client, then the transaction number for the last
+such committed transaction is put in the 'pb_last_committed' field.
+
+In a connection request the operation is not file system modifying, so
+the 'pb_transno' value will be zero in the reply as well.
+
+Command 250: MGS CONNECT - Client connection to an MGS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MGS_CONNECT (250)
+[options="header"]
+|====
+| request | reply
+| obd_connect_client | obd_connect_server
+|====
+
+When a client initiates a connection to the MGS,
+it does so by sending an 'obd_connect_client' message and awaiting the
+reply from the MGS of an 'obd_connect_server' message. This is the
+first operation carried out by a client upon the issue of a 'mount'
+command, and the target UUID is provided on the command line.
+
+The target UUID is just "MGS", and the client UUID is set to the
+32byte string it gets from ... where?
+
+The 'struct lustre_handle' (the fourth buffer in the message) has its
+cookie set to .. what? It is set, but where does it come from?
+
+The 'ocd_connect_flags' field is set to (fixme: what?) reflecting the
+capabilities appropriate to the client. The 'ocd_brw_size' is set to the
+largest value for the size of an RPC that the client can handle. The
+'ocd_ibits_known' and 'ocd_checksum_types' values are set to what the client
+considers appropriate. Other fields in the preamble and
+'obd_connect_data' structures are zero.
+
+Once the server receives the 'obd_connect_client' message on behalf of
+the given target it replies with an 'obd_connect_server' message. In
+that message the server sends the 'pb__handle' to uniquely
+identify the connection for subsequent communication. The client notes
+that handle in its import for the given target.
+
+fixme: Are there circumstances that could lead to the 'status'
+value in the reply being non-zero? What would lead to that and what
+error values would result?
+
+The target maintains the last committed transaction for a client in
+its export for that client. If this is the first connection, then that
+last transactiion value would just be zero. If there were previous
+transactions for the client, then the transaction number for the last
+such committed transaction is put in the 'pb_last_committed' field.
+
+In a connection request the operation is not file system modifying, so
+the 'pb_transno' value will be zero in the reply as well.
+
+Command 251: MGS DISCONNECT - Disconnect client from an MGS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MGS_DISCONNECT (251)
+[options="header"]
+|====
+| request | reply
+| empty | empty
+|====
+
+N.B. The usual 'struct req_format' definition does not exist for
+MGS_DISCONNECT. It will take a more careful code review to verify that
+it also has 'empty' messages gong back and forth.
+
+The information exchanged in a DISCONNECT message is that normally
+conveyed in the mesage preamble.
+
+Command 256: MGS CONFIG READ - Read MGS configuration info
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.MGS_CONFIG_READ (256)
+[options="header"]
+|====
+| request | reply
+| mgs_config_read_client | mgs_config_read_server
+|====
+
+Command 501: LLOG ORIGIN HANDLE CREATE - Create llog handle
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.LLOG_ORIGIN_HANDLE_CREATE (510)
+[options="header"]
+|====
+| request | reply
+| llog_origin_handle_create_client | llogd_body_only
+|====
+
+Command 502: LLOG ORIGIN HANDLE NEXT BLOCK - Read the next block
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.LLOG_ORIGIN_HANDLE_NEXT_BLOCK (502)
+[options="header"]
+|====
+| request | reply
+| llogd_body_only | llog_origin_handle_next_block_server
+|====
+
+Command 503: LLOG ORIGIN HANDLE READ HEADER - Read handle header
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.LLOG_ORIGIN_HANDLE_READ_HEADER (503)
+[options="header"]
+|====
+| request | reply
+| llogd_body_only | llog_log_hdr_only
+|====
+
+LLOG_ORIGIN_HANDLE_NEXT_BLOCK
:keywords: PtlRPC, Lustre, Protocol
-:numbered!: [abstract] Abstract -------- The Lustre parallel file
-system <<lustre>> provides a global POSIX <<POSIX>> namespace for the
-computing resources of a data center. Lustre runs on Linux-based hosts
-via kernel modules, and delegates block storage management to the
-back-end servers while providing object-based storage to its
-clients. Servers are responsible for both data objects (the contents
-of actual files) and index objects (for directory information). Data
-objects are gathered on Object Storage Servers (OSSs), and index
-objects are stored on MetaData Storage Servers (MDSs). Each back-end
-storage volume is a target with Object Storage Targets (OSTs) on OSSs,
-and MetaData Storage Targets (MDTs) on MDSs. Clients assemble the
-data from the MDT(s) and OST(s) to present a single coherent
-POSIX-compliant file system. The clients and servers communicate and
-coordinate among themselves via network protocols. A low-level
-protocol, LNet, abstracts the details of the underlying networking
-hardware and presents a uniform interface, originally based on Sandia
-Portals <<PORTALS>>, to Lustre clients and servers. Lustre, in turn,
-layers its own protocol PtlRPC atop LNet. This document describes the
-Lustre protocols, including how they are implemeted via PtlRPC and the
-Lustre Distributed Lock Manager (based on the VAX/VMS Distributed Lock
-Manager <<VAX_DLM>>). This document does not describe Lustre itself in
-any detail, except where doing so explains concepts that allow this
-document to be self-contained.
+include::introduction.txt[]
-:numbered:
-
-Overview
---------
-
-'Content to be provided'
-
-Messages
---------
-
-These are the messages that traverse the network using PTLRPC.
-
-[NOTE]
-This initial list combines some actual message names or types with the
-POSIX semantic operations they are being used to implement, as well as
-a few other underlying mechanisms (cf. "grant"). A subsequent
-refinement will separate the various items and relate them to one
-another.
-
-Client-MDS RPCs for POSIX namespace operations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-'Content to be provided'
-
-=== mount ===
-
-'Content to be provided'
-
-=== unmount ===
-
-'Content to be provided'
-
-=== create ===
-
-'Content to be provided'
-
-=== open ===
-
-'Content to be provided'
-
-=== close ===
-
-'Content to be provided'
-
-=== unlink ===
-
-'Content to be provided'
-
-=== mkdir ===
-
-image:mkdir1.png[mkdir]
-
-=== rmdir ===
-
-'Content to be provided'
-
-=== rename ===
-
-'Content to be provided'
-
-=== link ===
-
-'Content to be provided'
-
-=== symlink ===
-
-'Content to be provided'
-
-=== getattr ===
-
-'Content to be provided'
-
-=== setattr ===
-
-'Content to be provided'
-
-=== statfs ===
-
-'Content to be provided'
-
-=== ... ===
-
-'Content to be provided'
-
-
-Client-MDS RPCs for internal state management
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-'Content to be provided'
-
-=== connect ===
-
-'Content to be provided'
-
-=== disconnect ===
-
-'Content to be provided'
-
-=== FLD ===
-
-'Content to be provided'
-
-=== SEQ ===
-
-'Content to be provided'
-
-=== PING ===
+include::data_types.txt[]
-'Content to be provided'
+include::connection.txt[]
-=== LDLM ===
+include::timeouts.txt[]
-'Content to be provided'
+include::file_id.txt[]
-=== ... ===
+include::ldlm.txt[]
-'Content to be provided'
+include::llog.txt[]
-Client-OSS RPCs for IO Operations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+include::recovery.txt[]
-'Content to be provided'
+include::security.txt[]
-=== read ===
+include::lustre_messages.txt[]
-'Content to be provided'
+include::lustre_operations.txt[]
-=== write ===
-
-'Content to be provided'
-
-=== truncate ===
-
-'Content to be provided'
-
-=== setattr ===
-
-'Content to be provided'
-
-=== grant ===
-
-'Content to be provided'
-
-=== ... ===
-
-'Content to be provided'
-
-MDS-OSS RPCs for internal state management
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-'Content to be provided'
-
-=== object precreation ===
-
-'Content to be provided'
-
-=== orphan recovery ===
-
-'Content to be provided'
-
-=== UID/GID change ===
-
-'Content to be provided'
-
-=== unlink ===
-
-'Content to be provided'
-
-=== ... ===
-
-'Content to be provided'
-
-MDS-OSS RPCs for quota management
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-'Content to be provided'
-
-
-MDS-OSS OUT RPCs for distributed updates
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-'Content to be provided'
-
-=== DNE1 remote directories ===
-
-'Content to be provided'
-
-=== DNE2 striped directories ===
-
-'Content to be provided'
-
-=== LFSCK2/3 verification and repair ===
-
-'Content to be provided'
-
-Message Flows
--------------
-
- Each file operation (in Lustre) generates a set of messages in a
- particular sequence. There is one sequence for any particular
- concrete operation, but under varying circumstances the same file
- operation may generate a different sequence.
-
-State Machines
---------------
-
- For each File operation, the collection of possible sequences of
- messages is governed by a state machine.
+include::file_system_operations.txt[]
:numbered!:
-[glossary]
-Glossary
---------
-Here are some common terms used in discussing Lustre, POSIX semantics,
-and the prtocols used to implement them.
-
-[glossary]
-Object Storage Server::
- An object storage server (OSS) is a computer responsible for
- running Lustre kernel services in support of managing bulk data
- objects on the underlying storage. There can be multiple OSSs in a
- Lustre file system.
-
-MetaData Server::
- A metadata server (MDS) is a computer responsible for running the
- Lustre kernel services in support of managing the POSIX-compliant
- name space and the indices associating files in that name space with
- the locations of their corresponding objects. As of v2.4 there can
- be multiple MDSs in a Lustre file system.
-
-Object Storage Target::
- An object storage target (OST) is the service provided by an OSS
- that mediates the placement of data objects on the specific
- underlying file system hardware. There can be multiple OSTs on a
- given OSS.
-
-MetaData Target::
- A metadata target (MDT) is the service provided by an MDS that
- mediates the management of name space indices on the underlying file
- system hardware. As of v2.4 there can be multiple MDTs on an MDS.
-
-server::
- A computer providing a service, such as an OSS or an MDS
-
-target::
- Storage available to be served, such as an OST or an MDT. Also the
- service being provided.
-
-protocol::
- An agreed upon formalism for communicating between two entities,
- such as between two servers or between a client and a server.
-
-client::
- A computer taking advantage of a service provided by a server, such
- as a Lustre client using MDS(s) and OSS(s) to assemble a
- POSIX-compliant file system with its namespace and data storage
- capabilities.
-
-PtlRPC::
- The protocol (or set of protocols) implemented via RPCs that is
- (are) employed by Lustre to communicate between its clients and
- servers.
-
-Remote Procedure Call::
- A mechanism for implementing operations involving one computer
- acting on the behalf of another (RPC).
-
-LNet::
- A lower level protocol employed by PtlRPC to abstract the mechanisms
- provided by various hardware centric protocols, such as TCP or
- Infiniband.
-
-[appendix]
-Concepts
---------
-
-'Content to be provided'
+include::glossary.txt[]
[appendix]
License
--- /dev/null
+Recovery
+--------
+[[recovery]]
+
+The discussion of recovery will be developed when the "normal"
+operations are fairly complete.
+
--- /dev/null
+Security
+--------
+[[security]]
+
+The discussion of security is deferred until later. It is a side issue
+to most of what is being documented here. It is seldom employed, and
+may have older and buggy code. At least I think that's what Mike was
+telling me.
--- /dev/null
+
+Timeouts
+--------
+
+Timeouts in Lustre are handled by grouping connections into classes,
+thus the timeout appropriate for a client connection to an OST is a
+common property for all the OSTs on that client. That timeout may
+differ from a timeout for other connections or replies. A full list of
+the classes and how they are managed will wait for now.
+
+Every request message indicates a timeout value and every reply answers
+with the value the service will honor. Initial connection requests
+propose a value for the timeout, and subsequent requests and replies
+pass that value back and forth as part of the message header
+('pb_timeout').
+
+Service Times
+-------------
+
+Closely related to the timeouts in Lustre are the service times that
+are expect and observed for each connection class. A request will
+always list the service time as 0 in the message header
+('pb_service_time'). The reply lists the time the server actual to
+send the reply.
+