--- /dev/null
+Lustre runs across multiple hosts, coordinating the activities among
+those hosts via the exchange of messages over a network. On each host,
+Lustre is implemented via a collection of threads. This discussion
+will abstract some of the thread-level details in order to describe
+the activities on each host as a collection of processes. Each process
+may be thought of as a state machine, or automoton, following a fixed
+set of rules for how it consumes messages, changes state, and produces
+other messages; that is, its behavior. Processes communicate with each
+other on a host via shared memory and with processes on other hosts
+via messages. The Lustre protocol is the collection of messages the
+processes exchange along with the rules governing the behavior of
+those processes.
+
+In order to understand the Lustre protocol it is helpful to begin with
+a description of messages being exchanged. Lustre uses a particular
+format for its messages called PtlRPC. A PtlRPC message is a sequence
+of bytes in a particular order and with specific meaning associated
+with bytes in the message. The message (sequence of bytes) is
+delivered to a lower level communication mechanism called LNet in
+order to be transported from one host to another. This document will
+not discuss LNet beyond identifying it as a transport layer that
+abstracts any underlying details of the actual networking hardware.
+
+The following discussion is intended to be self-contained, in that
+additional external documents are not necessary in order for one to
+understand (and indeed implement) the behaviors and messges
+described. Nevertheless, for the interested there will be occasional
+references directly into the Lustre code-base where one may see the
+protocol as it is realized in one particular implementation, that
+being Lustre-2.6.92-0 as pulled from the git repository for Lustre on
+January 26th, 2015. The sole exception to the rule that this document
+is self-contained is that the discussion will not be burdened by the
+actual numerical values for hard-coded implentation details like
+"magic" value numbers or flags and their fields. References to the
+source code will be provided as needed for a prospective (otherwise)
+black-box implementer to build a compatible implementation. This
+document will confine itself to the symbolic values.
+
+The structure of a PtlRPC message
+=================================
+
+A PtlRPC message is a sequence of bytes. It can vary in length and has
+additional structure, but its simplest expression is just a byte
+array. The bytes of a message can be divided into an initial "header"
+and one or more "buffers" that follow the header. The header at
+beginning of a message can be further divided into a sequence of
+(cf. lustre/include/lustre/lustre_idl.h: "struct lustre_msg_v2") eight
+4-byte "fields" (32-bit unsinged integers) followed by a variable
+length sequence of additional 4-byte entries organized as an
+array. The fields, in order and using names abstracted from the
+sources, are:
+
+header
+------
+1) buffcount - The number of buffers that will follow the header. The
+ form and content of these buffers is discussed below.
+2) secflvr - An indication of whether any sort of cyptographic
+ encoding of the susequent buffers will be in force. The value is
+ zero if there is no "crypto" and gives a code identifying the
+ "flavor" of crypto if it is employed. Further, if crypto is
+ employed there will only be one buffer following (i.e. buffcount =
+ 1), and that buffer is an encoding of what would otherwise have
+ been the sequence of buffers normally following the header. This
+ document will defer all discussion of cryptograpy. An addendum is
+ planned that will address it separately.
+3) magic - PtlRPC messages include a "magic" value
+ (ibid. "LUSTRE_MSG_MAGIC_V2") that is checked in order to
+ positively identify that the message is intended for the use to
+ which it is being put. That is, we are indeed dealing with a PtlRPC
+ message, and not, for example, corrupted memory or a bad pointer.
+4) repsize - An indication from the sender of an action request of the
+ maximum available space that has been set asside for any reply to
+ the request. A reply that attempts to use more than that much
+ space will be discarded. Question: How does the receiver know, at
+ the time of receipt, what the repsize value was from the request
+ the reply is in reply to?
+5) cksum - The checksum (CRC-32-bit) of the header, including any
+ padding (see below) but not the additional buffers.
+6) flags - On of two values (ibid. "LUSTRE_MSG_MAGIC_V1" and
+ "LUSTRE_MSG_MAGIC_V2") indicating ===What?== I forget.
+7) padding - This field and the next are two 4-byte fields used to
+ assure that the following array is aligned on a 16-byte boundary.
+8) padding - The second 4-byte padding field.
+9) bufflens[] - An array of 4-byte unsigned integers with 'bufcount'
+ entries. Each entry corresponds to, and gives the length of, one
+ of the buffers that will follow and that constitute the remainder
+ of the message.
+10) padding - The first of the buffers following the header must be
+ aligned on a 16-byte boundary. Since the length of the 'buflens'
+ array is in increments of four bytes we may need up to twelve
+ additional bytes of padding before the first buffer.
+
+The 'buffcount' field gives the number of buffers that follow. The
+length of the i^{th} buffer is given by the field 'bufffen[i]', and
+the buffers themselves follow immediately and in order. As mentioned
+above, the 'secflvr' field will be zero unless some sort of
+cryptographic encoding is employed, and the interpretation of
+encrypted PtlRPC messages is left to another document.
+
+Each buffer has additional structure imposed on it, and the first
+buffer always has the following format (ibid. "struct ptlrpc_body_v3")
+with fields:
+1) handle - A 64-bit value to uniquely determine shared state between
+ a sender and a reciever. When a communication is initiated, as in a
+ "connect" message (from a client to a server), the value will be
+ 0. A reply (from the server back to the client) to this message
+ will contain a value (a "cookie") to identify the shared
+ state information (the "export") for the client that is maintained
+ on the server. The client will then associate this cookie with the
+ shared state information (the "import") that it maintains for about
+ the server. Subsequent messages between this client and this server
+ will refer to the same shared state by using this cookie as the
+ handle in this field.
+2) type - One of the three message types (ibid.)
+ "PTL_RPC_MSG_REQUEST", "PTL_RPC_MSG_ERR", or
+ "PTL_RPC_MSG_REPLY". As one might expect, "request" and "reply" are
+ the two usual message types, one for initiating and exchange and
+ the other for completing it. Teh "err" message type is only for
+ responding to a PtlRPC message that failed to be interpeted as an
+ actual message. That is, "err" does not reflect any kind of an
+ error in processing a PtlRPC once it has be decoded into its
+ constituent components, but only if and when that decoding fails.
+3) version - This field encodes (ibid.) the "PTLRPC_MSG_VERSION" value
+ in combination ('or'ed) with one of the Lustre version symbols:
+ LUSTRE_OBD_VERSION
+ LUSTRE_MDS_VERSION
+ LUSTRE_OST_VERSION
+ LUSTRE_DLM_VERSION
+ LUSTRE_LOG_VERSION
+ LUSTRE_MGS_VERSION
+ What exactly is the significance of these?
+4) opc - Gives the actual operation that is the subject of this
+ PtlRPC. There is a long list of such "op codes". Documenting the
+ semantics of each of them is one of the core purposes of this
+ document. For reference (ibid.) they are detailed elsewhere.
+
+ If you look at all the instances in the source code defined in
+ *_cmd_t enumerations you get the above list of 73 items. If you look
+ in the req_formats struct in layout.c you will see a list of 94
+ items. They have 44 items in common. Let's figure out the
+ connection between the two, if any.
+
+ There are 95 distinct patterns of PtlRPC structures (grep for
+ "static const struct req_msg_field *" in
+ lustre/ptlrpc/layout.c). There are 94 named dialogs where each
+ dialog consistes of two of the foregoing PtlRPC structure
+ patterns. The pair of patterns is in the form of a call and
+ response pair, though there is also the option for having no
+ response or even for having neither a call nor a response. In those
+ cases the special PtlRPC structure pattern is refered to as
+ "empty".
+
+5) status -
+6) last_xid -
+7) last_seen -
+8) last_committed -
+9) transno -
+10) flags -
+11) op_flags -
+12) conn_cnt -
+13) timeout -
+14) service_time -
+15) limit -
+16) slv -
+17) pre_versions[PTLRPC_NUM_VERSIONS] -
+18) padding[4] -
+19) jobid[LUSTRE_JOBID_SIZE] -