1 Lustre runs across multiple hosts, coordinating the activities among
2 those hosts via the exchange of messages over a network. On each host,
3 Lustre is implemented via a collection of threads. This discussion
4 will abstract some of the thread-level details in order to describe
5 the activities on each host as a collection of processes. Each process
6 may be thought of as a state machine, or automoton, following a fixed
7 set of rules for how it consumes messages, changes state, and produces
8 other messages; that is, its behavior. Processes communicate with each
9 other on a host via shared memory and with processes on other hosts
10 via messages. The Lustre protocol is the collection of messages the
11 processes exchange along with the rules governing the behavior of
14 In order to understand the Lustre protocol it is helpful to begin with
15 a description of messages being exchanged. Lustre uses a particular
16 format for its messages called PtlRPC. A PtlRPC message is a sequence
17 of bytes in a particular order and with specific meaning associated
18 with bytes in the message. The message (sequence of bytes) is
19 delivered to a lower level communication mechanism called LNet in
20 order to be transported from one host to another. This document will
21 not discuss LNet beyond identifying it as a transport layer that
22 abstracts any underlying details of the actual networking hardware.
24 The following discussion is intended to be self-contained, in that
25 additional external documents are not necessary in order for one to
26 understand (and indeed implement) the behaviors and messges
27 described. Nevertheless, for the interested there will be occasional
28 references directly into the Lustre code-base where one may see the
29 protocol as it is realized in one particular implementation, that
30 being Lustre-2.6.92-0 as pulled from the git repository for Lustre on
31 January 26th, 2015. The sole exception to the rule that this document
32 is self-contained is that the discussion will not be burdened by the
33 actual numerical values for hard-coded implentation details like
34 "magic" value numbers or flags and their fields. References to the
35 source code will be provided as needed for a prospective (otherwise)
36 black-box implementer to build a compatible implementation. This
37 document will confine itself to the symbolic values.
39 The structure of a PtlRPC message
40 =================================
42 A PtlRPC message is a sequence of bytes. It can vary in length and has
43 additional structure, but its simplest expression is just a byte
44 array. The bytes of a message can be divided into an initial "header"
45 and one or more "buffers" that follow the header. The header at
46 beginning of a message can be further divided into a sequence of
47 (cf. lustre/include/lustre/lustre_idl.h: "struct lustre_msg_v2") eight
48 4-byte "fields" (32-bit unsinged integers) followed by a variable
49 length sequence of additional 4-byte entries organized as an
50 array. The fields, in order and using names abstracted from the
55 1) buffcount - The number of buffers that will follow the header. The
56 form and content of these buffers is discussed below.
57 2) secflvr - An indication of whether any sort of cyptographic
58 encoding of the susequent buffers will be in force. The value is
59 zero if there is no "crypto" and gives a code identifying the
60 "flavor" of crypto if it is employed. Further, if crypto is
61 employed there will only be one buffer following (i.e. buffcount =
62 1), and that buffer is an encoding of what would otherwise have
63 been the sequence of buffers normally following the header. This
64 document will defer all discussion of cryptograpy. An addendum is
65 planned that will address it separately.
66 3) magic - PtlRPC messages include a "magic" value
67 (ibid. "LUSTRE_MSG_MAGIC_V2") that is checked in order to
68 positively identify that the message is intended for the use to
69 which it is being put. That is, we are indeed dealing with a PtlRPC
70 message, and not, for example, corrupted memory or a bad pointer.
71 4) repsize - An indication from the sender of an action request of the
72 maximum available space that has been set asside for any reply to
73 the request. A reply that attempts to use more than that much
74 space will be discarded. Question: How does the receiver know, at
75 the time of receipt, what the repsize value was from the request
76 the reply is in reply to?
77 5) cksum - The checksum (CRC-32-bit) of the header, including any
78 padding (see below) but not the additional buffers.
79 6) flags - On of two values (ibid. "LUSTRE_MSG_MAGIC_V1" and
80 "LUSTRE_MSG_MAGIC_V2") indicating ===What?== I forget.
81 7) padding - This field and the next are two 4-byte fields used to
82 assure that the following array is aligned on a 16-byte boundary.
83 8) padding - The second 4-byte padding field.
84 9) bufflens[] - An array of 4-byte unsigned integers with 'bufcount'
85 entries. Each entry corresponds to, and gives the length of, one
86 of the buffers that will follow and that constitute the remainder
88 10) padding - The first of the buffers following the header must be
89 aligned on a 16-byte boundary. Since the length of the 'buflens'
90 array is in increments of four bytes we may need up to twelve
91 additional bytes of padding before the first buffer.
93 The 'buffcount' field gives the number of buffers that follow. The
94 length of the i^{th} buffer is given by the field 'bufffen[i]', and
95 the buffers themselves follow immediately and in order. As mentioned
96 above, the 'secflvr' field will be zero unless some sort of
97 cryptographic encoding is employed, and the interpretation of
98 encrypted PtlRPC messages is left to another document.
100 Each buffer has additional structure imposed on it, and the first
101 buffer always has the following format (ibid. "struct ptlrpc_body_v3")
103 1) handle - A 64-bit value to uniquely determine shared state between
104 a sender and a reciever. When a communication is initiated, as in a
105 "connect" message (from a client to a server), the value will be
106 0. A reply (from the server back to the client) to this message
107 will contain a value (a "cookie") to identify the shared
108 state information (the "export") for the client that is maintained
109 on the server. The client will then associate this cookie with the
110 shared state information (the "import") that it maintains for about
111 the server. Subsequent messages between this client and this server
112 will refer to the same shared state by using this cookie as the
113 handle in this field.
114 2) type - One of the three message types (ibid.)
115 "PTL_RPC_MSG_REQUEST", "PTL_RPC_MSG_ERR", or
116 "PTL_RPC_MSG_REPLY". As one might expect, "request" and "reply" are
117 the two usual message types, one for initiating and exchange and
118 the other for completing it. Teh "err" message type is only for
119 responding to a PtlRPC message that failed to be interpeted as an
120 actual message. That is, "err" does not reflect any kind of an
121 error in processing a PtlRPC once it has be decoded into its
122 constituent components, but only if and when that decoding fails.
123 3) version - This field encodes (ibid.) the "PTLRPC_MSG_VERSION" value
124 in combination ('or'ed) with one of the Lustre version symbols:
131 What exactly is the significance of these?
132 4) opc - Gives the actual operation that is the subject of this
133 PtlRPC. There is a long list of such "op codes". Documenting the
134 semantics of each of them is one of the core purposes of this
135 document. For reference (ibid.) they are detailed elsewhere.
137 If you look at all the instances in the source code defined in
138 *_cmd_t enumerations you get the above list of 73 items. If you look
139 in the req_formats struct in layout.c you will see a list of 94
140 items. They have 44 items in common. Let's figure out the
141 connection between the two, if any.
143 There are 95 distinct patterns of PtlRPC structures (grep for
144 "static const struct req_msg_field *" in
145 lustre/ptlrpc/layout.c). There are 94 named dialogs where each
146 dialog consistes of two of the foregoing PtlRPC structure
147 patterns. The pair of patterns is in the form of a call and
148 response pair, though there is also the option for having no
149 response or even for having neither a call nor a response. In those
150 cases the special PtlRPC structure pattern is refered to as
165 17) pre_versions[PTLRPC_NUM_VERSIONS] -
167 19) jobid[LUSTRE_JOBID_SIZE] -