1 Data Structures and Defines
2 ---------------------------
5 The following data types are used in the Lustre protocol description.
14 | __u8 | an 8-bit unsigned integer
15 | __u16 | a 16-bit unsigned integer
16 | __u32 | a 32-bit unsigned integer
17 | __u64 | a 64-bit unsigned integer
18 | __s64 | a 64-bit signed integer
26 The following topics introduce the various kinds of data that are
27 represented and manipulated in Lustre messages and representations of
28 the shared state on clients and servers.
34 A grant value is part of a client's state for a given target. It
35 provides an upper bound to the amount of dirty cache data the client
36 will allow that is destined for the target. The value is established
37 by agreement between the server and the client and represents a
38 guarantee by the server that the target storage has space for the
39 dirty data. The client can ask for additional grant, which the server
40 may provide depending on how full the target is.
46 Each target is assigned an LOV index (by the 'mkfs' command line) as
47 the target is added to the file system. This value is stored in the
48 MGS in order to identify its role in the file system.
54 For each target there is a sequence of values (a strictly increasing
55 series of numbers) where each operation that can modify the file
56 system is assigned the next number in the series. This is the
57 transaction number, and it imposes a strict serial ordering to all of
58 the file system modifying operations. For file system modifying
59 requests the server generates the next value in the sequence and
60 informs the client of the value in the 'pb_transno' field of the
61 'ptlrpc_body' of its reply to the client's request. For replys to
62 requests that do not modify the file system the 'pb_transno' field in
63 the 'ptlrpc_body' is just set to 0.
65 Miscellaneous Structured Data Types
66 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71 I have not figured out how so called 'eadata' buffers are handled,
72 yet. I am told that this is not just for extended attributes, but is a
78 A 'lustre_capa' structure conveys details about the capabilities
79 supported (or requested) between a client and a given target. I am
80 told that this is deprecated in later version of Lustre.
83 #define CAPA_HMAC_MAX_LEN 64
93 __u8 lc_hmac[CAPA_HMAC_MAX_LEN];
97 include::lustre_file_ids.txt[]
100 MGS Configuration Reference
101 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
104 #define MTI_NAME_MAXLEN 64
105 struct mgs_config_body {
106 char mcb_name[MTI_NAME_MAXLEN]; /* logname */
107 __u64 mcb_offset; /* next index of config log to request */
108 __u16 mcb_type; /* type of log: CONFIG_T_[CONFIG|RECOVER] */
110 __u8 mcb_bits; /* bits unit size of config log */
111 __u32 mcb_units; /* # of units for bulk transfer */
115 The 'mgs_config_body' structure has information identifying to the MGS
116 which Lustre file system the client is asking about.
118 MGS Configuration Data
119 ^^^^^^^^^^^^^^^^^^^^^^
122 struct mgs_config_res {
123 __u64 mcr_offset; /* index of last config log */
124 __u64 mcr_size; /* size of the log */
128 The 'mgs_config_res' structure returns information about the Lustre
131 include::lustre_handle.txt[]
133 Lustre Message Header
134 ^^^^^^^^^^^^^^^^^^^^^
135 [[lustre-message-header]]
137 Every message has an initial header that informs the receiver about
138 the size of the rest of the message to follow along with a few other
142 #define LUSTRE_MSG_MAGIC_V2 0x0BD00BD3
143 #define MSGHDR_AT_SUPPORT 0x1
144 struct lustre_msg_v2 {
155 #define lustre_msg lustre_msg_v2
158 The 'lm_buffcount' field gives the number of buffers that will follow
159 the header. The header and sequence of buffers constitutes one
160 message. Each of the buffers is a sequence of bytes whose contents
161 corresponds to one of the structures described in this section. There
162 will always be at least one, and no message has more than eight.
164 The 'lm_secflvr' field gives an indication of whether any sort of
165 cyptographic encoding of the subsequent buffers will be in force. The
166 value is zero if there is no "crypto" and gives a code identifying the
167 "flavor" of crypto if it is employed. Further, if crypto is employed
168 there will only be one buffer following (i.e. buffcount = 1), and that
169 buffer is an encoding of what would otherwise have been the sequence
170 of buffers normally following the header. This document will defer all
171 discussion of cryptography. An chapter is planned that will address it
174 The 'lm_magic' field is a "magic" value (LUSTRE_MSG_MAGIC_V2) that is
175 checked in order to positively identify that the message is intended
176 for the use to which it is being put. That is, we are indeed dealing
177 with a Lustre message, and not, for example, corrupted memory or a bad
180 The 'lm_repsize' field is an indication from the sender of an action
181 request of the maximum available space that has been set aside for
182 any reply to the request. A reply that attempts to use more than that
183 much space will be discarded.
185 The 'lm_cksum' has to do with the <<security>> settings for the
186 cluster. Fixme: This may not be in current use. We need to verify.
188 The 'lm_flags' field can be set to enable adaptive timeouts support
189 with the value MSGHDR_AT_SUPPORT.
191 The 'lm_padding*' fields are reserved for future use.
193 The array of 'lm_bufflens' values has 'lm_bufcount' entries. Each
194 entry corresponds to, and gives the length of, one of the buffers that
197 The entire header is required to be a multiple of eight bytes
198 long. Thus there may need to an extra four bytes of padding after the
199 'lm_bufflens' array if that array has an odd number of entries.
204 The 'obd_stafs' structure defines fields that are used for returning
205 server common 'statfs' data items to a client. It augments that data
206 with some Lustre-specific information, and also has space allocated
207 for future use by Lustre.
221 __u32 os_state; /**< obd_statfs_state OS_STATE_* flag */
222 __u32 os_fprecreated; /* objs available now to the caller */
223 /* used in QoS code to find preferred
236 Lustre Message Preamble
237 ^^^^^^^^^^^^^^^^^^^^^^^
238 [[lustre-message-preamble]]
240 Every Lustre message starts with both the above header and an
241 additional set of fields (in its first "buffer") given by the 'struct
242 ptlrpc_body_v3' structure. This preamble has information information
243 relevant to every message type. In particular, the Lustre message type
244 is itself encoded in the 'pb_opc' Lustre operation number. The value
245 of that op code determines what else will be in the message following
248 #define PTLRPC_NUM_VERSIONS 4
249 #define JOBSTATS_JOBID_SIZE 32
250 struct ptlrpc_body_v3 {
251 struct lustre_handle pb_handle;
258 __u64 pb_last_committed;
264 __u32 pb_service_time;
267 __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
269 char pb_jobid[JOBSTATS_JOBID_SIZE];
271 #define ptlrpc_body ptlrpc_body_v3
273 In a connection request, sent by a client to server and regarding a
274 specific target, the 'pb_handle' is 0. In the reply to a connection
275 request, sent by the server, the handle is a value uniquely
276 identifying the target. Subsequent messages between this client and
277 this server regarding this target will use this handle to to gain
278 access to their shared state. The handle is persistent across
281 The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are
282 initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is
283 PTL_RPC_MSG_ERR to convey that a message was received that could not
284 be interpreted, that is, if it was corrupt or incomplete. The encoding
285 of those type values is given by:
287 #define PTL_RPC_MSG_REQUEST 4711
288 #define PTL_RPC_MSG_ERR 4712
289 #define PTL_RPC_MSG_REPLY 4713
291 The error message type is only for responding to a message that failed
292 to be interpreted as an actual message. Note that other errors, such
293 as those that emerge from processing the actual message content, do
294 not use the PTL_RPC_MSG_ERR type.
296 The 'pb_version' identifies the version of the Lustre protocol and is
297 derived from the following constants. The lower two bytes give the
298 version of PtlRPC being employed in the message, and the upper two
299 bytes encode the role of the host for the service being
300 requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS.
302 #define PTLRPC_MSG_VERSION 0x00000003
303 #define LUSTRE_VERSION_MASK 0xffff0000
304 #define LUSTRE_OBD_VERSION 0x00010000
305 #define LUSTRE_MDS_VERSION 0x00020000
306 #define LUSTRE_OST_VERSION 0x00030000
307 #define LUSTRE_DLM_VERSION 0x00040000
308 #define LUSTRE_LOG_VERSION 0x00050000
309 #define LUSTRE_MGS_VERSION 0x00060000
312 The 'pb_opc' value (operation code) gives the actual Lustre operation
313 that is the subject of this message. For example, MDS_CONNECT is a
314 Lustre operation (number 38). The following list gives the name used
315 and the value for each operation.
336 OST_QUOTA_ADJUST_QUNIT = 20,
338 MDS_GETATTR_NAME = 34,
349 MDS_DONE_WRITING = 45,
358 MDS_HSM_STATE_GET = 54,
359 MDS_HSM_STATE_SET = 55,
361 MDS_HSM_PROGRESS = 57,
362 MDS_HSM_REQUEST = 58,
363 MDS_HSM_CT_REGISTER = 59,
364 MDS_HSM_CT_UNREGISTER = 60,
365 MDS_SWAP_LAYOUTS = 61,
369 LDLM_BL_CALLBACK = 104,
370 LDLM_CP_CALLBACK = 105,
371 LDLM_GL_CALLBACK = 106,
374 MGS_DISCONNECT = 251,
376 MGS_TARGET_REG = 253,
377 MGS_TARGET_DEL = 254,
379 MGS_CONFIG_READ = 256,
381 OBD_LOG_CANCEL = 401,
382 OBD_QC_CALLBACK = 402,
384 LLOG_ORIGIN_HANDLE_CREATE = 501,
385 LLOG_ORIGIN_HANDLE_NEXT_BLOCK = 502,
386 LLOG_ORIGIN_HANDLE_READ_HEADER = 503,
387 LLOG_ORIGIN_HANDLE_WRITE_REC = 504,
388 LLOG_ORIGIN_HANDLE_CLOSE = 505,
389 LLOG_ORIGIN_CONNECT = 506,
390 LLOG_ORIGIN_HANDLE_PREV_BLOCK = 508,
391 LLOG_ORIGIN_HANDLE_DESTROY = 509,
396 SEC_CTX_INIT_CONT = 802,
404 The symbols and values above identify the operations Lustre uses in
405 its protocol. They are examined in detail in the
406 <<lustre-operations,Lustre Operations>> section. Lustre carries out
407 each of these operations via the exchange of a pair of messages: a
408 request and a reply. The details of each message are specific to each
409 operation. The <<lustre-messages,Lustre Messages>> chapter discusses
410 each message and its contents.
412 The 'pb_status' value in a request message is set to the 'pid' of the
413 process making the request. In a reply message, a zero indicates that
414 the service successfully initiated the requested operation. If for
415 some reason the operation could not be initiated (eg. "permission
416 denied") the status will encode the standard Linux kernel (POSIX)
417 error code (eg. EPERM).
419 'pb_last_xid' and 'pb_last_seen' are not used.
421 The 'pb_last_committed' value is always zero in a request. In a reply
422 it is the highest transaction number that has been committed to
423 storage. The transaction numbers are maintained on a per-target basis
424 and each series of transaction numbers is a strictly increasing
425 sequence. This field is set in any kind of reply message including
426 pings and non-modifying transactions.
428 The 'pb_transno' value always zero in a new request. It is also zero
429 for replies to operations that do not modify the file system. For
430 replies to operations that do modify the file system it is the
431 server-assigned value from the sequence of values associated with the
432 given client and target. That transaction number is copied into the
433 'pb_trans' field of the 'ptlrpc_body' of the originial request. If the
434 request has to be replayed it will include the transaction number.
436 The 'pb_flags' value governs the client state machine. Fixme: document
437 what the states and transitions are of this state machine. Currently,
438 only the bottom two bytes are used, and they encode state according to
439 the following values:
441 #define MSG_GEN_FLAG_MASK 0x0000ffff
442 #define MSG_LAST_REPLAY 0x0001
443 #define MSG_RESENT 0x0002
444 #define MSG_REPLAY 0x0004
445 #define MSG_DELAY_REPLAY 0x0010
446 #define MSG_VERSION_REPLAY 0x0020
447 #define MSG_REQ_REPLAY_DONE 0x0040
448 #define MSG_LOCK_REPLAY_DONE 0x0080
451 The 'pb_op_flags' value governs the client connection status state
452 machine. Fixme: document what the states and transitions are of this
455 #define MSG_CONNECT_RECOVERING 0x00000001
456 #define MSG_CONNECT_RECONNECT 0x00000002
457 #define MSG_CONNECT_REPLAYABLE 0x00000004
458 #define MSG_CONNECT_LIBCLIENT 0x00000010
459 #define MSG_CONNECT_INITIAL 0x00000020
460 #define MSG_CONNECT_ASYNC 0x00000040
461 #define MSG_CONNECT_NEXT_VER 0x00000080
462 #define MSG_CONNECT_TRANSNO 0x00000100
464 In normal operation an initial request to connect will set
465 'pb_op_flags' to MSG_CONNECT_INITIAL and MSG_CONNECT_NEXT_VER. The
466 reply to that connection request (and all other, non-connect, requests
467 and replies) will set 'pb_op_flags' to 0.
469 The 'pb_conn_cnt' (connection count) value in a request message
470 reports the client's "era", which is part of the client and server's
471 shared state. The value of the era is initialized to one when it is
472 first connected to the MDT. Each subsequent connection (after an
473 eviction) increments the era for the client. Since the 'pb_conn_cnt'
474 reflects the client's era at the time the message was composed the
475 server can use this value to discard late-arriving messages requesting
476 operations on out-of-date shared state.
478 The 'pb_timeout' value in a request indicates how long (in seconds)
479 the requester plans to wait before timing out the operation. That is,
480 the corresponding reply for this message should arrive within this
481 time frame. The service may extend this time frame via an "early
482 reply", which is a reply to this message that notifies the requester
483 that it should extend its timeout interval by the value of the
484 'pb_timeout' field in the reply. The "early reply" does not indicate
485 the operation has actually been initiated. Clients maintain multiple
486 request queues, called "portals", and each type of operation is
487 assigned to one of these queues. There is a timeout value associated
488 with each queue, and the timeout update affects all the messages
489 associated with the given queue, not just the specific message that
490 initiated the request. Finally, in a reply message (one that does
491 indicate the operation has been initiated) the timeout value updates
492 the timeout interval for the queue. Is this last point different from
493 the "early reply" update?
495 The 'pb_service_time' value is zero in a request. In a reply it
496 indicates how long this particular operation actually took from the
497 time it first arrived in the request queue (at the service) to the
498 time the server replied. Note that the client can use this value and
499 the local elapsed time for the operation to calculate network latency.
501 The 'pb_limit' value is zero in a request. In a reply it is a value
502 sent from a lock service to a client to set the maximum number of
503 locks available to the client. When dynamic lock LRU's are enabled
504 this allows for managing the size of the LRU.
506 The 'pb_slv' value is zero in a request. On a DLM service, the "server
507 lock volume" is a value that characterizes (estimates) the amount of
508 traffic, or load, on that lock service. It is calculated as the
509 product of the number of locks and their age. In a reply, the 'pb_slv'
510 value indicates to the client the available share of the total lock
511 load on the server that the client is allowed to consume. The client
512 is then responsible for reducing its number or (or age) of locks to
513 stay within this limit.
515 The array of 'pb_pre_versions' values has four entries. They are
516 always zero in a new request message. They are also zero in replies to
517 operations that do not modify the file system. For an operation that
518 does modify the file system, the reply encodes the most recent
519 transaction numbers for the objects modified by this operation, and
520 the 'pb_pre_versions' values are copied into the original request when
521 the reply arrives. If the request needs to be replayed then the
522 updated 'pb_pre_versions' values accompany the replayed request.
524 'pb_padding' is reserved for future use.
526 The 'pb_jobid' (string) value gives a unique identifier associated
527 with the process on behalf of which this message was generated. The
528 identifier is assigned to the user process by a job scheduler, if any.
530 Object Based Disk UUID
531 ^^^^^^^^^^^^^^^^^^^^^^
550 struct lu_fid oi_fid;
551 } LUSTRE_ANONYMOUS_UNION_NAME;
555 include::mdt_structs.txt[]
557 include::mds_reint_structs.txt[]
559 include::ost_setattr_structs.txt[]