1 Ptlrpc_body - The Lustre RPC Descriptor
2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5 Every Lustre message starts with both the above header and an
6 additional set of fields (in its first "buffer") given by the 'struct
7 ptlrpc_body_v3' structure. This preamble has information information
8 relevant to every RPC type. In particular, the RPC type is itself
9 encoded in the 'pb_opc' Lustre operation number. The value of that
10 opcode, as well as whether it is an RPC 'request' or 'reply',
11 determines what else will be in the message following the preamble.
13 #define PTLRPC_NUM_VERSIONS 4
14 #define JOBSTATS_JOBID_SIZE 32
16 struct lustre_handle pb_handle;
23 __u64 pb_last_committed;
29 __u32 pb_service_time;
32 __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
34 char pb_jobid[JOBSTATS_JOBID_SIZE];
38 In a connection request, sent by a client to a server and regarding a
39 specific target, the 'pb_handle' is 0. In the reply to a connection
40 request, sent by the target, the handle is a value uniquely
41 identifying the target. Subsequent messages between this client and
42 this target will use this handle to to gain access to their shared
43 state. The handle is persistent across client reconnects to the same
44 instance of the server, but if the client unmounts the filesystem or
45 is evicted then it must re-connect as a new client.
47 The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are
48 initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is
49 PTL_RPC_MSG_ERR in a reply to convey that a message was received that
50 could not be interpreted, that is, if it was corrupt or
51 incomplete. The encoding of those type values is given by:
53 #define PTL_RPC_MSG_REQUEST 4711
54 #define PTL_RPC_MSG_ERR 4712
55 #define PTL_RPC_MSG_REPLY 4713
57 The 'pb_type' = PTL_RPC_MSG_ERR is only for message handling errors.
58 This may be a message that failed to be interpreted as an actual
59 message, or it may indicate some more general problem with the
60 connection between the client and the target. Note that other errors,
61 such as those that emerge from processing the actual message content,
62 do not use the PTL_RPC_MSG_ERR type.
64 The 'pb_status' field provides an error return code, if any, for the
65 RPC. When 'pb_type' = PTL_RPC_MSG_ERR the 'pb_status' will also be
66 set to one of the following message handling errors:
71 | pb_type | pb_status | if
72 | PTL_RPC_MSG_ERR | -ENOMEM | No memory for reply buffer
73 | PTL_RPC_MSG_ERR | -ENOTSUPP | Invalid opcode
74 | PTL_RPC_MSG_ERR | -EINVAL | Bad magic or version
75 | PTL_RPC_MSG_ERR | -EPROTO | Request is malformed or cannot be
76 processed in current context
79 A PTL_RPC_MSG_ERR message does not need to allocate memory, so it
80 should normally be sent as a reply even if there is not enough memory
81 to allocate the normal reply buffer, unless the underlying network
82 transport itself cannot allocate memory to send it. (fixme: and what
85 In most cases there is a reply with 'pb_type' = PTL_RPC_MSG_REPLY,
86 indicating that the request was processed, but it may still have
87 'pb_status' set to a non-zero value to indicate that the request
88 encountered an error during processing (see below). This may indicate
89 something very specific to the particular RPC, but it may also be a
90 very general sort of error. Those that are specific to particular RPCs
91 will be documented with the respective RPCs, and those that are more
92 generic are listed here:
97 | pb_type | pb_status | meaning
98 | PTL_RPC_MSG_REPLY | -ENOTCONN | Client is not connected to the
99 target, typically meaning the
100 server was restarted or the
101 client was evicted, and the
102 client needs to reconnect.
103 | PTL_RPC_MSG_REPLY | -EINPROGRESS | The request cannot be processed
104 currently due to some other
105 factor, such as during initial
106 mount, a delay contacting the
107 quota master during a write, or
108 LFSCK rebuilding the OI table,
109 but the client should continue
110 to retry after a delay until
111 interrupted or successful. This
112 avoids blocking the server
113 threads with client requests
114 that cannot currently be
115 processed, but other requests
116 might be processed in the
118 | PTL_RPC_MSG_REPLY | -ESHUTDOWN | The server is being stopped and
119 no new connections are allowed.
122 The significance of -ENOTCONN is discussed more fully in
123 <<connection>>, but a brief comment may be useful here. The networking
124 layers supporting the exchange of RPCs can be in good working order
125 when 'pb_status' = -ENOTCONN is returned in an RPC reply message. The
126 connection refered to by that status is the Lustre connection. That
127 connection is part of the shared state between Lustre clients and
128 servers that gets established via *_CONNECT RPCs,
129 and can be lost due to an 'eviction' in the face of temporary connection
130 failure or in case of an unresponsive client (from the server's point
131 of view). So, even when that Lusre connection is lost (or has not been
132 established, yet), RPC messages such as *_CONNECT can be exchanged.
134 The 'pb_version' identifies the version of the Lustre protocol and is
135 derived from the following constants. The lower two bytes give the
136 version of PtlRPC being employed in the message, and the upper two
137 bytes encode the role of the host for the service being
138 requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS.
140 #define PTLRPC_MSG_VERSION 0x00000003
141 #define LUSTRE_VERSION_MASK 0xffff0000
142 #define LUSTRE_OBD_VERSION 0x00010000
143 #define LUSTRE_MDS_VERSION 0x00020000
144 #define LUSTRE_OST_VERSION 0x00030000
145 #define LUSTRE_DLM_VERSION 0x00040000
146 #define LUSTRE_LOG_VERSION 0x00050000
147 #define LUSTRE_MGS_VERSION 0x00060000
150 The 'pb_opc' value (operation code) gives the actual Lustre operation
151 that is the subject of this message. For example, MDS_CONNECT is a
152 Lustre operation (number 38). The following list gives the name used
153 and the value for each operation.
158 <<ost-setattr-rpc,OST_SETATTR>> = 2,
164 <<ost-connect-rpc,OST_CONNECT>> = 8,
165 <<ost-connect-rpc,OST_DISCONNECT>> = 9,
174 OST_QUOTA_ADJUST_QUNIT = 20,
176 <<mds-getattr-rpc,MDS_GETATTR>> = 33,
177 MDS_GETATTR_NAME = 34,
181 <<mds-connect-rpc,MDS_CONNECT>> = 38,
182 <<mds-disconnect,MDS_DISCONNECT>> = 39,
183 <<mds-getstatus-rpc,MDS_GETSTATUS>> = 40,
184 <mds-statfs-rpc,MDS_STATFS>> = 41,
188 MDS_DONE_WRITING = 45,
192 <<mds-getxattr-rpc,MDS_GETXATTR>> = 49,
197 MDS_HSM_STATE_GET = 54,
198 MDS_HSM_STATE_SET = 55,
200 MDS_HSM_PROGRESS = 57,
201 MDS_HSM_REQUEST = 58,
202 MDS_HSM_CT_REGISTER = 59,
203 MDS_HSM_CT_UNREGISTER = 60,
204 MDS_SWAP_LAYOUTS = 61,
206 <<ldlm-enqueue-rpc,LDLM_ENQUEUE>> = 101,
208 <ldlm-cancel-rpc,LDLM_CANCEL>> = 103,
209 <ldlm_bl_callback-rpc,LDLM_BL_CALLBACK>> = 104,
210 <ldlm-cp-callback-rpc,LDLM_CP_CALLBACK>> = 105,
211 LDLM_GL_CALLBACK = 106,
214 <<mgs-connect-rpc,MGS_CONNECT>> = 250,
215 <<mgs-disconnect-rpc,MGS_DISCONNECT>> = 251,
217 MGS_TARGET_REG = 253,
218 MGS_TARGET_DEL = 254,
220 <<mgs-config-read-rpc,MGS_CONFIG_READ>> = 256,
223 OBD_LOG_CANCEL = 401,
224 OBD_QC_CALLBACK = 402,
227 <<llog-origin-handle-create-rpc,LLOG_ORIGIN_HANDLE_CREATE>> = 501,
228 <<llog-origin-handle-next-block,LLOG_ORIGIN_HANDLE_NEXT_BLOCK>> = 502,
229 <<llog-origin-handle-read-header,LLOG_ORIGIN_HANDLE_READ_HEADER>> = 503,
230 LLOG_ORIGIN_HANDLE_WRITE_REC = 504,
231 LLOG_ORIGIN_HANDLE_CLOSE = 505,
232 LLOG_ORIGIN_CONNECT = 506,
233 LLOG_ORIGIN_HANDLE_PREV_BLOCK = 508,
234 LLOG_ORIGIN_HANDLE_DESTROY = 509,
242 SEC_CTX_INIT_CONT = 802,
251 The symbols and values above identify the operations Lustre uses in
252 its protocol. They are examined in detail in the
253 <<lustre-prcs,Lustre RPCs>> section. Lustre carries out
254 each of these operations via the exchange of a pair of messages: a
257 The 'pb_status' field was already mentioned above in conjuction with
258 the 'pb_type' field in replies. In a request message 'pb_status' is
259 set to the 'pid' of the process making the request. In a reply
260 message, a zero indicates that the service successfully initiated the
261 requested operation. When an error is being reported the value will
262 encode a standard Linux kernel (POSIX) error code as initially
263 defined for the i386/x86_64 architecture. The 'pb_status' value is
264 returned as a negative number, so for example, a permissions error
265 would be indicated as -EPERM.
267 'pb_last_xid' and 'pb_last_seen' are not used.
269 The 'pb_last_committed' value is always zero in a request. In a reply
270 it is the highest transaction number that has been committed to
271 storage. The transaction numbers are maintained on a per-target basis
272 and each series of transaction numbers is a strictly increasing
273 sequence for modifications originating from any client. This field is
274 set in any kind of reply message including pings and non-modifying
275 transactions. If 'pb_last_committed' is larger than, or equal to, any
276 of the client's uncommitted requests (see 'pb_transno' below) then the
277 server is confirming those requests have been committed to stable
278 storage. At that point the client may free those request structures
279 as they will no longer be necessary for replay during recovery.
281 The 'pb_transno' value is always zero in a new request. It is also
282 zero for replies to operations that do not modify the file system. For
283 replies to operations that do modify the file system it is the
284 target-unique, server-assigned transaction number for the client
285 request. The 'pb_transno' assigned to each modifying request is in
286 strictly increasing order, but may not be sequential for a single
287 client, and the client may receive replies in a different order than
288 they were processed by the server.Upon receipt of the reply, the
289 client copies this transaction number from 'pb_transno' of the reply
290 to 'pb_transno' of the saved request. If 'pb_transno' is larger than
291 'pb_last_commited' (above) then the request has only been processed at
292 the target and is not yet committed to stable storage. The client
293 must save the request for later resend to the server in case the
294 target fails before the modification can be committed to disk.If the
295 request has to be replayed it will include the transaction number.
297 The 'pb_flags' value governs the client state machine. Fixme: document
298 what the states and transitions are of this state machine. Currently,
299 only the bottom two bytes are used, and they encode state according to
300 the following values:
302 #define MSG_GEN_FLAG_MASK 0x0000ffff
303 #define MSG_LAST_REPLAY 0x0001
304 #define MSG_RESENT 0x0002
305 #define MSG_REPLAY 0x0004
306 #define MSG_DELAY_REPLAY 0x0010
307 #define MSG_VERSION_REPLAY 0x0020
308 #define MSG_REQ_REPLAY_DONE 0x0040
309 #define MSG_LOCK_REPLAY_DONE 0x0080
312 MGS_LAST_REPLAY is currently unused. It had been used to indicate that
313 this is the last RPC request to be replayed by this client during
314 recovery. MGS_LAST_REPLAY has been replaced by MSG_REQ_REPLAY_DONE
315 and MSG_LOCK_REPLAY_DONE.
317 MGS_RESENT is set when this RPC request is being resent because no
320 MGS_REPLAY indicates this RPC request is being replayed after the
321 client received a reply but before it was committed to storage. The
322 'pb_transno' field holds the server-assigned transaction number.
324 MGS_DELAY_REPLAY is currently unused.
326 MSG_VERSION_REPLAY indicates that a replayed request has
327 pb_pre_versions[] filled with the prior object versions and can be
328 used with Version Based Recovery.
330 MSG_LOCK_REPLAY_DONE indicates the client has completed lock replay,
331 and is ready to finish recovery.
333 The 'pb_op_flags' values are specific to a particular 'pb_opc', but
334 are currently only used by the *_CONNECT RPCs.The 'pb_op_flags' value
335 for connect operations governs the client connection status state
339 #define MSG_CONNECT_RECOVERING 0x00000001
340 #define MSG_CONNECT_RECONNECT 0x00000002
341 #define MSG_CONNECT_REPLAYABLE 0x00000004
342 #define MSG_CONNECT_LIBCLIENT 0x00000010
343 #define MSG_CONNECT_INITIAL 0x00000020
344 #define MSG_CONNECT_ASYNC 0x00000040
345 #define MSG_CONNECT_NEXT_VER 0x00000080
346 #define MSG_CONNECT_TRANSNO 0x00000100
349 MGS_CONNECT_RECOVERING indicate the server is in recovery
351 MGS_CONNECT_RECONNECT indicates the client is reconnecting after
352 non-responsiveness from the server.
354 MGS_CONNECT_REPLAYABLE indicates the server connection supports RPC
355 replay (only OSTs support non-recoverable connections, but that is not
358 The MGS_CONNECT_LIBCLIENT is for the a 'liblustre' client. It is
361 The client sends MGS_CONNECT_INITIAL the first time the client is
362 connecting to the server. MSG_CONNECT_INITIAL connections are not
363 allowed during server recovery.
365 MGS_CONNECT_ASYNC is currently unused.
367 MSG_CONNECT_NEXT_VER indicates that the client can understand the next
368 higher protocol version in addition to the currently used protocol, and
369 the server can reply to the connect with that higher RPC version if
370 it is supported, otherwise it will reply with the
371 same RPC version as the request. This allows RPC protocol versions to
372 be negotiated during a transition period (e.g. upgrade from RPC from
373 LUSTRE_MSG_MAGIC_V1 to LUSTRE_MSG_MAGIC_V2).
375 In normal operation an initial request to connect will set
376 'pb_op_flags' to MSG_CONNECT_INITIAL (in some earlier versions
377 MSG_CONNECT_NEXT_VER was mistakenly included, though it did no
378 harm). The reply to that connection request (and all other,
379 non-connect, requests and replies) will set 'pb_op_flags' to 0.
381 The 'pb_conn_cnt' (connection count) value in a request message
382 reports the client's "era", which is part of the client and server's
383 shared state. The value of the era is initialized to one when it is
384 first connected to the MDT. Each subsequent connection (after an
385 eviction) increments the era for the client. Since the 'pb_conn_cnt'
386 reflects the client's era at the time the message was composed the
387 server can use this value to discard late-arriving messages requesting
388 operations on out-of-date shared state.
390 The 'pb_timeout' value in a request indicates how long (in seconds)
391 the requester plans to wait before timing out the operation. That is,
392 the corresponding reply for this message should arrive within this
393 time frame. The service may extend this time frame via an "early
394 reply", which is a reply to this message that notifies the requester
395 that it should extend its timeout interval by the value of the
396 'pb_timeout' field in the reply. The "early reply" does not indicate
397 the operation has actually been initiated. Clients maintain multiple
398 request queues, called "portals", and each type of operation is
399 assigned to one of these queues. There is a timeout value associated
400 with each queue, and the timeout update affects all the messages
401 associated with the given queue, not just the specific message that
402 initiated the request. Finally, in a reply message (one that does
403 indicate the operation has been initiated) the timeout value updates
404 the timeout interval for the queue. Is this last point different from
405 the "early reply" update?
407 The 'pb_service_time' value is zero in a request. In a reply it
408 indicates how long this particular operation actually took from the
409 time it first arrived in the request queue (at the service) to the
410 time the server replied. Note that the client can use this value and
411 the local elapsed time for the operation to calculate network latency.
413 The 'pb_limit' value is zero in a request. In a reply it is a value
414 sent from a lock service to a client to set the maximum number of
415 locks available to the client. When dynamic lock LRU's are enabled
416 this allows for managing the size of the LRU.
418 The 'pb_slv' value is zero in a request. On a DLM service, the "server
419 lock volume" is a value that characterizes (estimates) the amount of
420 traffic, or load, on that lock service. It is calculated as the
421 product of the number of locks and their age. In a reply, the 'pb_slv'
422 value indicates to the client the available share of the total lock
423 load on the server that the client is allowed to consume. The client
424 is then responsible for reducing its number or (or age) of locks to
425 stay within this limit.
427 The array of 'pb_pre_versions' values has four entries. They are
428 always zero in a new request message. They are also zero in replies to
429 operations that do not modify the file system. For an operation that
430 does modify the file system, the reply encodes the most recent
431 transaction numbers for the objects modified by this operation, and
432 the 'pb_pre_versions' values are copied into the original request when
433 the reply arrives. If the request needs to be replayed then the
434 updated 'pb_pre_versions' values accompany the replayed request.
436 'pb_padding' is reserved for future use.
438 The 'pb_jobid' (string) value gives a unique identifier associated
439 with the process on behalf of which this message was generated. The
440 identifier is assigned to the user process by a job scheduler, if any.