Ptlrpc_body - The Lustre RPC Descriptor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [[struct-ptlrpc-body]] Every Lustre message starts with both the above header and an additional set of fields (in its first "buffer") given by the 'struct ptlrpc_body_v3' structure. This preamble has information information relevant to every RPC type. In particular, the RPC type is itself encoded in the 'pb_opc' Lustre operation number. The value of that opcode, as well as whether it is an RPC 'request' or 'reply', determines what else will be in the message following the preamble. ---- #define PTLRPC_NUM_VERSIONS 4 #define JOBSTATS_JOBID_SIZE 32 struct ptlrpc_body { struct lustre_handle pb_handle; __u32 pb_type; __u32 pb_version; __u32 pb_opc; __u32 pb_status; __u64 pb_last_xid; __u64 pb_last_seen; __u64 pb_last_committed; __u64 pb_transno; __u32 pb_flags; __u32 pb_op_flags; __u32 pb_conn_cnt; __u32 pb_timeout; __u32 pb_service_time; __u32 pb_limit; __u64 pb_slv; __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS]; __u64 pb_padding[4]; char pb_jobid[JOBSTATS_JOBID_SIZE]; }; ---- In a connection request, sent by a client to a server and regarding a specific target, the 'pb_handle' is 0. In the reply to a connection request, sent by the target, the handle is a value uniquely identifying the target. Subsequent messages between this client and this target will use this handle to to gain access to their shared state. The handle is persistent across client reconnects to the same instance of the server, but if the client unmounts the filesystem or is evicted then it must re-connect as a new client. The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is PTL_RPC_MSG_ERR in a reply to convey that a message was received that could not be interpreted, that is, if it was corrupt or incomplete. The encoding of those type values is given by: ---- #define PTL_RPC_MSG_REQUEST 4711 #define PTL_RPC_MSG_ERR 4712 #define PTL_RPC_MSG_REPLY 4713 ---- The 'pb_type' = PTL_RPC_MSG_ERR is only for message handling errors. This may be a message that failed to be interpreted as an actual message, or it may indicate some more general problem with the connection between the client and the target. Note that other errors, such as those that emerge from processing the actual message content, do not use the PTL_RPC_MSG_ERR type. The 'pb_status' field provides an error return code, if any, for the RPC. When 'pb_type' = PTL_RPC_MSG_ERR the 'pb_status' will also be set to one of the following message handling errors: .format [options="header"] |==== | pb_type | pb_status | if | PTL_RPC_MSG_ERR | -ENOMEM | No memory for reply buffer | PTL_RPC_MSG_ERR | -ENOTSUPP | Invalid opcode | PTL_RPC_MSG_ERR | -EINVAL | Bad magic or version | PTL_RPC_MSG_ERR | -EPROTO | Request is malformed or cannot be processed in current context |==== A PTL_RPC_MSG_ERR message does not need to allocate memory, so it should normally be sent as a reply even if there is not enough memory to allocate the normal reply buffer, unless the underlying network transport itself cannot allocate memory to send it. (fixme: and what happens then?) In most cases there is a reply with 'pb_type' = PTL_RPC_MSG_REPLY, indicating that the request was processed, but it may still have 'pb_status' set to a non-zero value to indicate that the request encountered an error during processing (see below). This may indicate something very specific to the particular RPC, but it may also be a very general sort of error. Those that are specific to particular RPCs will be documented with the respective RPCs, and those that are more generic are listed here: .format [options="header"] |==== | pb_type | pb_status | meaning | PTL_RPC_MSG_REPLY | -ENOTCONN | Client is not connected to the target, typically meaning the server was restarted or the client was evicted, and the client needs to reconnect. | PTL_RPC_MSG_REPLY | -EINPROGRESS | The request cannot be processed currently due to some other factor, such as during initial mount, a delay contacting the quota master during a write, or LFSCK rebuilding the OI table, but the client should continue to retry after a delay until interrupted or successful. This avoids blocking the server threads with client requests that cannot currently be processed, but other requests might be processed in the meantime. | PTL_RPC_MSG_REPLY | -ESHUTDOWN | The server is being stopped and no new connections are allowed. |==== The significance of -ENOTCONN is discussed more fully in <>, but a brief comment may be useful here. The networking layers supporting the exchange of RPCs can be in good working order when 'pb_status' = -ENOTCONN is returned in an RPC reply message. The connection refered to by that status is the Lustre connection. That connection is part of the shared state between Lustre clients and servers that gets established via MDS_CONNECT and OST_CONNECT RPCs, and can be lost due to an 'eviction'. So, even when that Lusre connection is lost (or has not been established, yet), RPC messages can be exchanged. The 'pb_version' identifies the version of the Lustre protocol and is derived from the following constants. The lower two bytes give the version of PtlRPC being employed in the message, and the upper two bytes encode the role of the host for the service being requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS. ---- #define PTLRPC_MSG_VERSION 0x00000003 #define LUSTRE_VERSION_MASK 0xffff0000 #define LUSTRE_OBD_VERSION 0x00010000 #define LUSTRE_MDS_VERSION 0x00020000 #define LUSTRE_OST_VERSION 0x00030000 #define LUSTRE_DLM_VERSION 0x00040000 #define LUSTRE_LOG_VERSION 0x00050000 #define LUSTRE_MGS_VERSION 0x00060000 ---- The 'pb_opc' value (operation code) gives the actual Lustre operation that is the subject of this message. For example, MDS_CONNECT is a Lustre operation (number 38). The following list gives the name used and the value for each operation. ---- typedef enum { OST_REPLY = 0, OST_GETATTR = 1, <> = 2, OST_READ = 3, OST_WRITE = 4, OST_CREATE = 5, OST_DESTROY = 6, OST_GET_INFO = 7, <> = 8, <> = 9, OST_PUNCH = 10, OST_OPEN = 11, OST_CLOSE = 12, OST_STATFS = 13, OST_SYNC = 16, OST_SET_INFO = 17, OST_QUOTACHECK = 18, OST_QUOTACTL = 19, OST_QUOTA_ADJUST_QUNIT = 20, <> = 33, MDS_GETATTR_NAME = 34, MDS_CLOSE = 35, MDS_REINT = 36, MDS_READPAGE = 37, <> = 38, <> = 39, <> = 40, > = 41, MDS_PIN = 42, MDS_UNPIN = 43, MDS_SYNC = 44, MDS_DONE_WRITING = 45, MDS_SET_INFO = 46, MDS_QUOTACHECK = 47, MDS_QUOTACTL = 48, <> = 49, MDS_SETXATTR = 50, MDS_WRITEPAGE = 51, MDS_IS_SUBDIR = 52, MDS_GET_INFO = 53, MDS_HSM_STATE_GET = 54, MDS_HSM_STATE_SET = 55, MDS_HSM_ACTION = 56, MDS_HSM_PROGRESS = 57, MDS_HSM_REQUEST = 58, MDS_HSM_CT_REGISTER = 59, MDS_HSM_CT_UNREGISTER = 60, MDS_SWAP_LAYOUTS = 61, <> = 101, LDLM_CONVERT = 102, > = 103, > = 104, > = 105, LDLM_GL_CALLBACK = 106, LDLM_SET_INFO = 107, <> = 250, <> = 251, MGS_EXCEPTION = 252, MGS_TARGET_REG = 253, MGS_TARGET_DEL = 254, MGS_SET_INFO = 255, <> = 256, OBD_PING = 400, OBD_LOG_CANCEL = 401, OBD_QC_CALLBACK = 402, OBD_IDX_READ = 403, <> = 501, <> = 502, <> = 503, LLOG_ORIGIN_HANDLE_WRITE_REC = 504, LLOG_ORIGIN_HANDLE_CLOSE = 505, LLOG_ORIGIN_CONNECT = 506, LLOG_ORIGIN_HANDLE_PREV_BLOCK = 508, LLOG_ORIGIN_HANDLE_DESTROY = 509, QUOTA_DQACQ = 601, QUOTA_DQREL = 602, SEQ_QUERY = 700, SEC_CTX_INIT = 801, SEC_CTX_INIT_CONT = 802, SEC_CTX_FINI = 803, FLD_QUERY = 900, FLD_READ = 901, UPDATE_OBJ = 1000 } cmd_t; ---- The symbols and values above identify the operations Lustre uses in its protocol. They are examined in detail in the <> section. Lustre carries out each of these operations via the exchange of a pair of messages: a request and a reply. The details of each message are specific to each operation. The <> chapter discusses each message and its contents. The 'pb_status' field was already mentioned above in conjuction with the 'pb_type' field in replies. In a request message 'pb_status' is set to the 'pid' of the process making the request. In a reply message, a zero indicates that the service successfully initiated the requested operation. When an error is being reported the value will encode a standard Linux kernel (POSIX) error code as initially defined for the i386/x86_64 architecture. The 'pb_status' value is returned as a negative number, so for example, a permissions error would be indicated as -EPERM. 'pb_last_xid' and 'pb_last_seen' are not used. The 'pb_last_committed' value is always zero in a request. In a reply it is the highest transaction number that has been committed to storage. The transaction numbers are maintained on a per-target basis and each series of transaction numbers is a strictly increasing sequence for modifications originating from any client. This field is set in any kind of reply message including pings and non-modifying transactions. If 'pb_last_committed' is larger than, or equal to, any of the client's uncommitted requests (see 'pb_transno' below) then the server is confirming those requests have been committed to stable storage. At that point the client will free the request structures. The 'pb_transno' value is always zero in a new request. It is also zero for replies to operations that do not modify the file system. For replies to operations that do modify the file system it is the target-unique, server-assigned transaction number for the client request. The 'pb_transno' assigned to each modifying request is in strictly increasing order, but may not be sequential for a single client, and the client may receive replies in a different order than they were processed by the server.Upon receipt of the reply, the client copies this transaction number from 'pb_transno' of the reply to 'pb_transno' of the saved request. If 'pb_transno' is larger than 'pb_last_commited' (above) then the request has only been processed at the target and is not yet committed to stable storage. The client must save the request for later resend to the server in case the target fails before the modification can be committed to disk.If the request has to be replayed it will include the transaction number. The 'pb_flags' value governs the client state machine. Fixme: document what the states and transitions are of this state machine. Currently, only the bottom two bytes are used, and they encode state according to the following values: ---- #define MSG_GEN_FLAG_MASK 0x0000ffff #define MSG_LAST_REPLAY 0x0001 #define MSG_RESENT 0x0002 #define MSG_REPLAY 0x0004 #define MSG_DELAY_REPLAY 0x0010 #define MSG_VERSION_REPLAY 0x0020 #define MSG_REQ_REPLAY_DONE 0x0040 #define MSG_LOCK_REPLAY_DONE 0x0080 ---- MGS_LAST_REPLAY is currently unused. It had been used to indicate that this is the last RPC request to be replayed by this client during recovery. MGS_LAST_REPLAY has been replaced by MSG_REQ_REPLAY_DONE and MSG_LOCK_REPLAY_DONE. MGS_RESENT is set when this RPC request is being resent because no reply was received. MGS_REPLAY indicates this RPC request is being replayed after the client received a reply but before it was committed to storage. The 'pb_transno' field holds the server-assigned transaction number. MGS_DELAY_REPLAY is currently unused. MSG_VERSION_REPLAY indicates that a replayed request has pb_pre_versions[] filled with the prior object versions and can be used with Version Based Recovery. MSG_LOCK_REPLAY_DONE indicates the client has completed lock replay, and is ready to finish recovery. The 'pb_op_flags' values are specific to a particular 'pb_opc', but are currently only used by the *_CONNECT RPCs.The 'pb_op_flags' value for connect operations governs the client connection status state machine. ---- #define MSG_CONNECT_RECOVERING 0x00000001 #define MSG_CONNECT_RECONNECT 0x00000002 #define MSG_CONNECT_REPLAYABLE 0x00000004 #define MSG_CONNECT_LIBCLIENT 0x00000010 #define MSG_CONNECT_INITIAL 0x00000020 #define MSG_CONNECT_ASYNC 0x00000040 #define MSG_CONNECT_NEXT_VER 0x00000080 #define MSG_CONNECT_TRANSNO 0x00000100 ---- MGS_CONNECT_RECOVERING indicate the server is in recovery MGS_CONNECT_RECONNECT indicates the client is reconnecting after non-responsiveness from the server. MGS_CONNECT_REPLAYABLE indicates the server connection supports RPC replay (only OSTs support non-recoverable connections, but that is not the default). The MGS_CONNECT_LIBCLIENT is for the a 'liblustre' client. It is currently unused. The client sends MGS_CONNECT_INITIAL the first time the client is connecting to the server. MSG_CONNECT_INITIAL connections are not allowed during server recovery. MGS_CONNECT_ASYNC is currently unused. MSG_CONNECT_NEXT_VER indicates that the client can understand the next higher protocol version, and the server can reply to the connect with that RPC version if it is supported, otherwise it will reply with the same RPC version as the request. This allows RPC protocol versions to be negotiated during a transition period (e.g. upgrade from RPC from LUSTRE_MSG_MAGIC_V1 to LUSTRE_MSG_MAGIC_V2). In normal operation an initial request to connect will set 'pb_op_flags' to MSG_CONNECT_INITIAL (in some earlier versions MSG_CONNECT_NEXT_VER was mistakenly included, though it did no harm). The reply to that connection request (and all other, non-connect, requests and replies) will set 'pb_op_flags' to 0. The 'pb_conn_cnt' (connection count) value in a request message reports the client's "era", which is part of the client and server's shared state. The value of the era is initialized to one when it is first connected to the MDT. Each subsequent connection (after an eviction) increments the era for the client. Since the 'pb_conn_cnt' reflects the client's era at the time the message was composed the server can use this value to discard late-arriving messages requesting operations on out-of-date shared state. The 'pb_timeout' value in a request indicates how long (in seconds) the requester plans to wait before timing out the operation. That is, the corresponding reply for this message should arrive within this time frame. The service may extend this time frame via an "early reply", which is a reply to this message that notifies the requester that it should extend its timeout interval by the value of the 'pb_timeout' field in the reply. The "early reply" does not indicate the operation has actually been initiated. Clients maintain multiple request queues, called "portals", and each type of operation is assigned to one of these queues. There is a timeout value associated with each queue, and the timeout update affects all the messages associated with the given queue, not just the specific message that initiated the request. Finally, in a reply message (one that does indicate the operation has been initiated) the timeout value updates the timeout interval for the queue. Is this last point different from the "early reply" update? The 'pb_service_time' value is zero in a request. In a reply it indicates how long this particular operation actually took from the time it first arrived in the request queue (at the service) to the time the server replied. Note that the client can use this value and the local elapsed time for the operation to calculate network latency. The 'pb_limit' value is zero in a request. In a reply it is a value sent from a lock service to a client to set the maximum number of locks available to the client. When dynamic lock LRU's are enabled this allows for managing the size of the LRU. The 'pb_slv' value is zero in a request. On a DLM service, the "server lock volume" is a value that characterizes (estimates) the amount of traffic, or load, on that lock service. It is calculated as the product of the number of locks and their age. In a reply, the 'pb_slv' value indicates to the client the available share of the total lock load on the server that the client is allowed to consume. The client is then responsible for reducing its number or (or age) of locks to stay within this limit. The array of 'pb_pre_versions' values has four entries. They are always zero in a new request message. They are also zero in replies to operations that do not modify the file system. For an operation that does modify the file system, the reply encodes the most recent transaction numbers for the objects modified by this operation, and the 'pb_pre_versions' values are copied into the original request when the reply arrives. If the request needs to be replayed then the updated 'pb_pre_versions' values accompany the replayed request. 'pb_padding' is reserved for future use. The 'pb_jobid' (string) value gives a unique identifier associated with the process on behalf of which this message was generated. The identifier is assigned to the user process by a job scheduler, if any.