Ptlrpc_body - The Lustre RPC Descriptor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [[struct-ptlrpc-body]] Every Lustre message starts with a header (See <>) describing a few generic details about the RPC, the most important of which is the number and sizes of additional "buffers" that will follow. The buffers just organize subsections of the RPC, and each comprises a set of fields. The fields of a buffer are presented here as 'struct' definitions because that is how they are actually organized, and it provides a convenient way to refer to groups of fields that appear together in many different RPCs. The first buffer in every RPC is given by the 'ptlrpc_body' structure, also known as the RPC descriptor. This descriptor identifies the type of the RPC via a Lustre operation code. The value of that 'opcode', as well as whether it is an RPC 'request' or 'reply', determines what other buffers will be in the message following the descriptor. [source,c] ---- struct ptlrpc_body { struct lustre_handle pb_handle; __u32 pb_type; __u32 pb_version; __u32 pb_opc; __u32 pb_status; __u64 pb_last_xid; __u64 pb_last_seen; __u64 pb_last_committed; __u64 pb_transno; __u32 pb_flags; __u32 pb_op_flags; __u32 pb_conn_cnt; __u32 pb_timeout; __u32 pb_service_time; __u32 pb_limit; __u64 pb_slv; __u64 pb_pre_versions[4]; __u64 pb_padding[4]; char pb_jobid[32]; }; ---- include::struct_lustre_handle.txt[] The 'pb_handle' field of the RPC descriptor is a 'lustre_handle'. In a connection request RPC (MGS_CONNECT, MDS_CONNECT, or OST_CONNECT), sent by a client to a target, 'pb_handle' is 0. In the reply to that connection request 'pb_handle' is set to a new 'lustre_handle' that uniquely identifies the target (indeed, it uniquely identifies the 'export', since each client gets a unique 'lustre_handle' for a given target). Subsequent request messages sent this client to this target will use that 'lustre_handle' (the 'pb_handle' field will be set to that value) to to gain access to their shared state. In subsequent RPC reply messages (after teh *_CONNECT reply) the 'pb_handle' field is 0. The 'lustre_handle' is persistent across client reconnects to the same instance of the server, but if the client unmounts the filesystem or is evicted then it must re-connect as a new client, with a new 'lustre_handle'. The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is PTL_RPC_MSG_ERR in a reply to convey that a message was received that could not be interpreted, that is, if it was corrupt or incomplete. The encoding of those type values is given by: [source,c] ---- #define PTL_RPC_MSG_REQUEST 4711 #define PTL_RPC_MSG_ERR 4712 #define PTL_RPC_MSG_REPLY 4713 ---- The 'pb_type' = PTL_RPC_MSG_ERR is only for message handling errors. This may be a message that failed to be interpreted as an actual message, or it may indicate some more general problem with the connection between the client and the target. Note that other errors, such as those that emerge from processing the actual message content, do not use the PTL_RPC_MSG_ERR type. The 'pb_status' field provides an error return code for the RPC. When 'pb_type' = PTL_RPC_MSG_ERR the 'pb_status' will also be set to one of the following message handling errors: .'pb_type' = PTL_RPC_MSG_ERR status return codes [options="header"] |==== | pb_status | if | -ENOMEM | No memory for reply buffer | -ENOTSUPP | Invalid opcode | -EINVAL | Bad magic or version | -EPROTO | Request is malformed or cannot be processed in current context |==== A PTL_RPC_MSG_ERR message does not need to allocate memory, so it should normally be sent as a reply even if there is not enough memory to allocate the normal reply buffer. In most cases there is a reply with 'pb_type' = PTL_RPC_MSG_REPLY, indicating that the request was processed, but it may still have 'pb_status' set to a non-zero value to indicate that the request encountered an error during processing (see below). This may indicate something very specific to the particular RPC, but it may also be a very general sort of error. Those that are specific to particular RPCs will be documented with the respective RPCs, and those that are more generic are listed here: .'pb_type' = PTL_RPC_MSG_REPLY status return codes [options="header"] |==== | pb_status | meaning | -ENOTCONN | Client is not connected to the target, typically meaning the server was restarted or the client was evicted, and the client needs to reconnect. | -EINPROGRESS | The request cannot be processed currently due to some other factor, such as during initial mount, a delay contacting the quota master during a write, or LFSCK rebuilding the OI table, but the client should continue to retry after a delay until interrupted or successful. This avoids blocking the server threads with client requests that cannot currently be processed, but other requests might be processed in the meantime. | -ESHUTDOWN | The server is being stopped and no new connections are allowed. |==== A PtlRPC reply message with 'pb_status' = -ENOTCONN can be returned for any type of PtlRPC request message other than one of MGS_CONNECT, MDS_CONNECT, or OST_CONNECT (those RPCs establish connections). -ENOTCONN indicates that the server, the recipient of the request message, does not have a valid connection for the client. This can come about if a client has been evicted, or if, after a server reboots, the client failed to reestablish its connection during recovery. Establishing a connection is discussed more fully in <>, eviction in <>, and recovery in <>. The 'pb_version' identifies the version of the Lustre protocol and is derived from the following constants. The lower two bytes give the version of PtlRPC being employed in the message, and the upper two bytes encode the role of the host for the service being requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS. [source,c] ---- #define PTLRPC_MSG_VERSION 0x00000003 #define LUSTRE_VERSION_MASK 0xffff0000 #define LUSTRE_OBD_VERSION 0x00010000 #define LUSTRE_MDS_VERSION 0x00020000 #define LUSTRE_OST_VERSION 0x00030000 #define LUSTRE_DLM_VERSION 0x00040000 #define LUSTRE_LOG_VERSION 0x00050000 #define LUSTRE_MGS_VERSION 0x00060000 ---- The 'pb_opc' value (operation code) gives the actual Lustre operation that is the subject of this message. For example, MDS_CONNECT is a Lustre operation (number 38). The following list gives the name used and the value for each operation. .Lustre Operation Codes ("opcodes") [options="header"] |==== | name | 'pb_opc' | OST_REPLY | 0 | OST_GETATTR | 1 | <> | 2 | OST_READ | 3 | OST_WRITE | 4 | OST_CREATE | 5 | OST_DESTROY | 6 | OST_GET_INFO | 7 | <> | 8 | <> | 9 | OST_PUNCH | 10 | OST_OPEN | 11 | OST_CLOSE | 12 | OST_STATFS | 13 | OST_SYNC | 16 | OST_SET_INFO | 17 | OST_QUOTACHECK | 18 | OST_QUOTACTL | 19 | OST_QUOTA_ADJUST_QUNIT | 20 | | | <> | 33 | MDS_GETATTR_NAME | 34 | MDS_CLOSE | 35 | MDS_REINT | 36 | MDS_READPAGE | 37 | <> | 38 | <> | 39 | <> | 40 | > | 41 | MDS_PIN | 42 | MDS_UNPIN | 43 | MDS_SYNC | 44 | MDS_DONE_WRITING | 45 | MDS_SET_INFO | 46 | MDS_QUOTACHECK | 47 | MDS_QUOTACTL | 48 | <> | 49 | MDS_SETXATTR | 50 | MDS_WRITEPAGE | 51 | MDS_IS_SUBDIR | 52 | MDS_GET_INFO | 53 | MDS_HSM_STATE_GET | 54 | MDS_HSM_STATE_SET | 55 | MDS_HSM_ACTION | 56 | MDS_HSM_PROGRESS | 57 | MDS_HSM_REQUEST | 58 | MDS_HSM_CT_REGISTER | 59 | MDS_HSM_CT_UNREGISTER | 60 | MDS_SWAP_LAYOUTS | 61 | | | <> | 101 | LDLM_CONVERT | 102 | <> | 103 | <> | 104 | <> | 105 | LDLM_GL_CALLBACK | 106 | LDLM_SET_INFO | 107 | | | <> | 250 | <> | 251 | MGS_EXCEPTION | 252 | MGS_TARGET_REG | 253 | MGS_TARGET_DEL | 254 | MGS_SET_INFO | 255 | <> | 256 | | | OBD_PING | 400 | OBD_LOG_CANCEL | 401 | OBD_QC_CALLBACK | 402 | OBD_IDX_READ | 403 | | | <> | 501 | <> | 502 | <> | 503 | LLOG_ORIGIN_HANDLE_WRITE_REC | 504 | LLOG_ORIGIN_HANDLE_CLOSE | 505 | LLOG_ORIGIN_CONNECT | 506 | LLOG_ORIGIN_HANDLE_PREV_BLOCK | 508 | LLOG_ORIGIN_HANDLE_DESTROY | 509 | | | QUOTA_DQACQ | 601 | QUOTA_DQREL | 602 | | | SEQ_QUERY | 700 | | | SEC_CTX_INIT | 801 | SEC_CTX_INIT_CONT | 802 | SEC_CTX_FINI | 803 | | | FLD_QUERY | 900 | FLD_READ | 901 | | | UPDATE_OBJ | 1000 |==== The 'pb_status' field was already mentioned above in conjuction with the 'pb_type' field in replies. In a request message 'pb_status' is set to the 'pid' of the process making the request. In a reply message, a zero indicates that the service successfully initiated the requested operation. When an error is being reported the value will encode a standard Linux kernel (POSIX) error code as initially defined for the i386/x86_64 architecture. The 'pb_status' value is returned as a negative number. For example, a permissions error would be indicated as -EPERM. 'pb_last_xid' and 'pb_last_seen' are not used. The 'pb_last_committed' value is always zero in a request. In a reply it is the highest transaction number that has been committed to storage. See <>. This field is set in any kind of reply message including pings and non-modifying transactions. If 'pb_last_committed' is larger than, or equal to, any of the client's uncommitted requests (see 'pb_transno' below) then the server is confirming those requests have been committed to stable storage. At that point the client may free those request structures as they will no longer be necessary for replay during recovery. The 'pb_transno' value is zero in a new request. It is also zero for replies to operations that do not modify the file system. For replies to operations that do modify the file system it is the target-unique, server-assigned transaction number for the client request. See <>. Upon receipt of the reply, the client copies this transaction number from 'pb_transno' of the reply to 'pb_transno' of the saved request. If 'pb_transno' is larger than 'pb_last_commited' (above) then the request has been processed at the target but is not yet committed to stable storage. The client must save the request for later resend to the server in case the target fails before the modification can be committed to disk. If the request has to be replayed it will include the transaction number. The 'pb_flags' value governs the client state machine. Currently, only the least significant two bytes are used. The field encodes state according to the following values: [source,c] ---- #define MSG_GEN_FLAG_MASK 0x0000ffff #define MSG_LAST_REPLAY 0x0001 #define MSG_RESENT 0x0002 #define MSG_REPLAY 0x0004 #define MSG_DELAY_REPLAY 0x0010 #define MSG_VERSION_REPLAY 0x0020 #define MSG_REQ_REPLAY_DONE 0x0040 #define MSG_LOCK_REPLAY_DONE 0x0080 ---- MGS_LAST_REPLAY is currently unused. It had been used to indicate that this is the last RPC request to be replayed by this client during recovery. MGS_LAST_REPLAY has been replaced by MSG_REQ_REPLAY_DONE and MSG_LOCK_REPLAY_DONE. MGS_RESENT is set when this RPC request is being resent because no reply was received. MGS_REPLAY indicates this RPC request is being replayed after the client received a reply but before it was committed to storage. The 'pb_transno' field holds the server-assigned transaction number. MGS_DELAY_REPLAY is currently unused. MSG_VERSION_REPLAY indicates that a replayed request has the 'pb_pre_versions' array filled with the prior object versions and can be used with Version Based Recovery. MSG_LOCK_REPLAY_DONE indicates the client has completed lock replay, and is ready to finish recovery. The 'pb_op_flags' values are specific to a particular 'pb_opc', but are currently only used by the *_CONNECT RPCs.The 'pb_op_flags' value for connect operations governs the client connection status state machine. [source,c] ---- #define MSG_CONNECT_RECOVERING 0x00000001 #define MSG_CONNECT_RECONNECT 0x00000002 #define MSG_CONNECT_REPLAYABLE 0x00000004 #define MSG_CONNECT_LIBCLIENT 0x00000010 #define MSG_CONNECT_INITIAL 0x00000020 #define MSG_CONNECT_ASYNC 0x00000040 #define MSG_CONNECT_NEXT_VER 0x00000080 #define MSG_CONNECT_TRANSNO 0x00000100 ---- MSG_CONNECT_RECOVERING indicates the server is in recovery MSG_CONNECT_RECONNECT indicates the client is reconnecting after non-responsiveness from the server. MSG_CONNECT_REPLAYABLE indicates the server connection supports RPC replay (only OSTs support non-recoverable connections, but that is not the default). The MSG_CONNECT_LIBCLIENT is for the a 'liblustre' client. It is currently unused. The client sends MSG_CONNECT_INITIAL the first time the client is connecting to the server. MSG_CONNECT_INITIAL connections are not allowed during server recovery. MSG_CONNECT_ASYNC is currently unused. MSG_CONNECT_NEXT_VER indicates that the client can understand the next higher protocol version in addition to the currently used protocol, and the server can reply to the connect with that higher RPC version if it is supported, otherwise it will reply with the same RPC version as the request. This allows RPC protocol versions to be negotiated during a transition period (e.g. upgrade from RPC from LUSTRE_MSG_MAGIC_V1 to LUSTRE_MSG_MAGIC_V2). In normal operation an initial request to connect will set 'pb_op_flags' to MSG_CONNECT_INITIAL (in some earlier versions MSG_CONNECT_NEXT_VER was mistakenly included, though it did no harm). The reply to that connection request (and all other, non-connect, requests and replies) will set 'pb_op_flags' to 0. The 'pb_conn_cnt' (connection count) value in a request message reports the client's "era", which is part of the client and server's shared state. The value of the era is initialized to 1 when it is first connected to the target. Each subsequent connection (after an eviction) increments the era for the client. Since the 'pb_conn_cnt' reflects the client's era at the time the message was composed the server can use this value to discard late-arriving messages requesting operations on out-of-date shared state. The 'pb_timeout' value in a request indicates how long (in seconds) the requester plans to wait before timing out the operation. That is, the corresponding reply for this message should arrive within this time frame. The service may extend this time frame via an "early reply", which is a reply to this message that notifies the requester that it should extend its timeout interval by the value of the 'pb_timeout' field in the reply. The "early reply" does not indicate the operation has actually been initiated. Clients maintain multiple request queues, called "portals", and each type of operation is assigned to one of these queues. There is a timeout value associated with each queue, and the timeout update affects all the messages associated with the given queue, not just the specific message that initiated the request. Finally, in a reply message (one that does indicate the operation has been initiated) the timeout value updates the timeout interval for the queue. The 'pb_service_time' value is zero in a request. In a reply it indicates how long this particular operation actually took from the time it first arrived in the request queue (at the service) to the time the server replied. Note that the client can use this value and the local elapsed time for the operation to calculate network latency. The 'pb_limit' value is zero in a request. In a reply it is a value sent from a lock service to a client to set the maximum number of locks available to the client. When dynamic lock LRU's are enabled this allows for managing the size of the LRU. The 'pb_slv' value is zero in a request. On a DLM service, the "server lock volume" is a value that characterizes (estimates) the amount of traffic, or load, on that lock service. It is calculated as the product of the number of locks and their age. In a reply, the 'pb_slv' value indicates to the client the available share of the total lock load on the server that the client is allowed to consume. The client is then responsible for reducing its number or (or age) of locks to stay within this limit. The array of 'pb_pre_versions' values has four entries. They are always zero in a new request message. They are also zero in replies to operations that do not modify the file system. For an operation that does modify the file system, the reply encodes the most recent transaction numbers for the objects modified by this operation, and the 'pb_pre_versions' values are copied into the original request when the reply arrives. If the request needs to be replayed then the updated 'pb_pre_versions' values accompany the replayed request. 'pb_padding' is reserved for future use. The 'pb_jobid' (string) value gives a unique identifier associated with the process on behalf of which this message was generated. The identifier is assigned to the user process by a job scheduler, if any.