Lustre runs across multiple hosts, coordinating the activities among those hosts via the exchange of messages over a network. On each host, Lustre is implemented via a collection of threads. This discussion will abstract some of the thread-level details in order to describe the activities on each host as a collection of processes. Each process may be thought of as a state machine, or automoton, following a fixed set of rules for how it consumes messages, changes state, and produces other messages; that is, its behavior. Processes communicate with each other on a host via shared memory and with processes on other hosts via messages. The Lustre protocol is the collection of messages the processes exchange along with the rules governing the behavior of those processes. In order to understand the Lustre protocol it is helpful to begin with a description of messages being exchanged. Lustre uses a particular format for its messages called PtlRPC. A PtlRPC message is a sequence of bytes in a particular order and with specific meaning associated with bytes in the message. The message (sequence of bytes) is delivered to a lower level communication mechanism called LNet in order to be transported from one host to another. This document will not discuss LNet beyond identifying it as a transport layer that abstracts any underlying details of the actual networking hardware. The following discussion is intended to be self-contained, in that additional external documents are not necessary in order for one to understand (and indeed implement) the behaviors and messges described. Nevertheless, for the interested there will be occasional references directly into the Lustre code-base where one may see the protocol as it is realized in one particular implementation, that being Lustre-2.6.92-0 as pulled from the git repository for Lustre on January 26th, 2015. The sole exception to the rule that this document is self-contained is that the discussion will not be burdened by the actual numerical values for hard-coded implentation details like "magic" value numbers or flags and their fields. References to the source code will be provided as needed for a prospective (otherwise) black-box implementer to build a compatible implementation. This document will confine itself to the symbolic values. The structure of a PtlRPC message ================================= A PtlRPC message is a sequence of bytes. It can vary in length and has additional structure, but its simplest expression is just a byte array. The bytes of a message can be divided into an initial "header" and one or more "buffers" that follow the header. The header at beginning of a message can be further divided into a sequence of (cf. lustre/include/lustre/lustre_idl.h: "struct lustre_msg_v2") eight 4-byte "fields" (32-bit unsinged integers) followed by a variable length sequence of additional 4-byte entries organized as an array. The fields, in order and using names abstracted from the sources, are: header ------ 1) buffcount - The number of buffers that will follow the header. The form and content of these buffers is discussed below. 2) secflvr - An indication of whether any sort of cyptographic encoding of the susequent buffers will be in force. The value is zero if there is no "crypto" and gives a code identifying the "flavor" of crypto if it is employed. Further, if crypto is employed there will only be one buffer following (i.e. buffcount = 1), and that buffer is an encoding of what would otherwise have been the sequence of buffers normally following the header. This document will defer all discussion of cryptograpy. An addendum is planned that will address it separately. 3) magic - PtlRPC messages include a "magic" value (ibid. "LUSTRE_MSG_MAGIC_V2") that is checked in order to positively identify that the message is intended for the use to which it is being put. That is, we are indeed dealing with a PtlRPC message, and not, for example, corrupted memory or a bad pointer. 4) repsize - An indication from the sender of an action request of the maximum available space that has been set asside for any reply to the request. A reply that attempts to use more than that much space will be discarded. Question: How does the receiver know, at the time of receipt, what the repsize value was from the request the reply is in reply to? 5) cksum - The checksum (CRC-32-bit) of the header, including any padding (see below) but not the additional buffers. 6) flags - On of two values (ibid. "LUSTRE_MSG_MAGIC_V1" and "LUSTRE_MSG_MAGIC_V2") indicating ===What?== I forget. 7) padding - This field and the next are two 4-byte fields used to assure that the following array is aligned on a 16-byte boundary. 8) padding - The second 4-byte padding field. 9) bufflens[] - An array of 4-byte unsigned integers with 'bufcount' entries. Each entry corresponds to, and gives the length of, one of the buffers that will follow and that constitute the remainder of the message. 10) padding - The first of the buffers following the header must be aligned on a 16-byte boundary. Since the length of the 'buflens' array is in increments of four bytes we may need up to twelve additional bytes of padding before the first buffer. The 'buffcount' field gives the number of buffers that follow. The length of the i^{th} buffer is given by the field 'bufffen[i]', and the buffers themselves follow immediately and in order. As mentioned above, the 'secflvr' field will be zero unless some sort of cryptographic encoding is employed, and the interpretation of encrypted PtlRPC messages is left to another document. Each buffer has additional structure imposed on it, and the first buffer always has the following format (ibid. "struct ptlrpc_body_v3") with fields: 1) handle - A 64-bit value to uniquely determine shared state between a sender and a reciever. When a communication is initiated, as in a "connect" message (from a client to a server), the value will be 0. A reply (from the server back to the client) to this message will contain a value (a "cookie") to identify the shared state information (the "export") for the client that is maintained on the server. The client will then associate this cookie with the shared state information (the "import") that it maintains for about the server. Subsequent messages between this client and this server will refer to the same shared state by using this cookie as the handle in this field. 2) type - One of the three message types (ibid.) "PTL_RPC_MSG_REQUEST", "PTL_RPC_MSG_ERR", or "PTL_RPC_MSG_REPLY". As one might expect, "request" and "reply" are the two usual message types, one for initiating and exchange and the other for completing it. Teh "err" message type is only for responding to a PtlRPC message that failed to be interpeted as an actual message. That is, "err" does not reflect any kind of an error in processing a PtlRPC once it has be decoded into its constituent components, but only if and when that decoding fails. 3) version - This field encodes (ibid.) the "PTLRPC_MSG_VERSION" value in combination ('or'ed) with one of the Lustre version symbols: LUSTRE_OBD_VERSION LUSTRE_MDS_VERSION LUSTRE_OST_VERSION LUSTRE_DLM_VERSION LUSTRE_LOG_VERSION LUSTRE_MGS_VERSION What exactly is the significance of these? 4) opc - Gives the actual operation that is the subject of this PtlRPC. There is a long list of such "op codes". Documenting the semantics of each of them is one of the core purposes of this document. For reference (ibid.) they are detailed elsewhere. If you look at all the instances in the source code defined in *_cmd_t enumerations you get the above list of 73 items. If you look in the req_formats struct in layout.c you will see a list of 94 items. They have 44 items in common. Let's figure out the connection between the two, if any. There are 95 distinct patterns of PtlRPC structures (grep for "static const struct req_msg_field *" in lustre/ptlrpc/layout.c). There are 94 named dialogs where each dialog consistes of two of the foregoing PtlRPC structure patterns. The pair of patterns is in the form of a call and response pair, though there is also the option for having no response or even for having neither a call nor a response. In those cases the special PtlRPC structure pattern is refered to as "empty". 5) status - 6) last_xid - 7) last_seen - 8) last_committed - 9) transno - 10) flags - 11) op_flags - 12) conn_cnt - 13) timeout - 14) service_time - 15) limit - 16) slv - 17) pre_versions[PTLRPC_NUM_VERSIONS] - 18) padding[4] - 19) jobid[LUSTRE_JOBID_SIZE] -