^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[[struct-ptlrpc-body]]
-Every Lustre message starts with both the above header and an
-additional set of fields (in its first "buffer") given by the 'struct
-ptlrpc_body_v3' structure. This preamble has information information
-relevant to every RPC type. In particular, the RPC type is itself
-encoded in the 'pb_opc' Lustre operation number. The value of that
-opcode, as well as whether it is an RPC 'request' or 'reply',
-determines what else will be in the message following the preamble.
+Every Lustre message starts with a header (not shown - see
+<<struct-lustre-msg>>) describing a few generic details about the RPC,
+the most important of which is the number and sizes of additional
+"buffers" that will follow. The buffers just organize subsections of
+the RPC, and each comprises a set of fields. The fields of given a buffer
+are presented together as 'struct' definitions in order to show how they
+are organized, and it provides a convenient way to refer to
+groups of fields that appear together in many different RPCs. The
+first buffer in every RPC is given by the 'ptlrpc_body' structure,
+also known as the RPC descriptor. This descriptor identifies the type
+of the RPC via a Lustre operation code. The value of that 'opcode', as
+well as whether it is an RPC 'request' or 'reply', determines what
+other buffers will be in the message following the descriptor.
[source,c]
----
-#define PTLRPC_NUM_VERSIONS 4
-#define JOBSTATS_JOBID_SIZE 32
struct ptlrpc_body {
struct lustre_handle pb_handle;
__u32 pb_type;
__u32 pb_service_time;
__u32 pb_limit;
__u64 pb_slv;
- __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
+ __u64 pb_pre_versions[4];
__u64 pb_padding[4];
- char pb_jobid[JOBSTATS_JOBID_SIZE];
+ char pb_jobid[32];
};
----
include::struct_lustre_handle.txt[]
-In a connection request, sent by a client to a server and regarding a
-specific target, the 'pb_handle' is 0. In the reply to a connection
-request, sent by the target, the handle is a value uniquely
-identifying the target. Subsequent messages between this client and
-this target will use this handle to to gain access to their shared
-state. The handle is persistent across client reconnects to the same
-instance of the server, but if the client unmounts the filesystem or
-is evicted then it must re-connect as a new client.
+The 'pb_handle' field of the RPC descriptor is a 'lustre_handle'. In a
+connection request RPC (MGS_CONNECT, MDS_CONNECT, or OST_CONNECT),
+sent by a client to a target, 'pb_handle' is 0. In the reply to that
+connection request 'pb_handle' is set to a new 'lustre_handle' that
+uniquely identifies the target (indeed, it uniquely identifies the
+'export', since each client gets a unique 'lustre_handle' for a given
+target). Subsequent request messages sent from this client to this target
+will use that 'lustre_handle' (the 'pb_handle' field will be set to
+that value) to to gain access to their shared state. In subsequent RPC
+reply messages (after the *_CONNECT reply) the 'pb_handle' field is
+0. The 'lustre_handle' is persistent across client reconnects to the
+same instance of the server, but if the client unmounts the filesystem
+or is evicted then it must re-connect as a new client, with a new
+'lustre_handle'.
The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are
initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is
such as those that emerge from processing the actual message content,
do not use the PTL_RPC_MSG_ERR type.
-The 'pb_status' field provides an error return code, if any, for the
-RPC. When 'pb_type' = PTL_RPC_MSG_ERR the 'pb_status' will also be
-set to one of the following message handling errors:
+The 'pb_status' field provides an error return code for the RPC. When
+'pb_type' = PTL_RPC_MSG_ERR the 'pb_status' will also be set to one of
+the following message handling errors:
-.format
+.'pb_type' = PTL_RPC_MSG_ERR status return codes
[options="header"]
|====
-| pb_type | pb_status | if
-| PTL_RPC_MSG_ERR | -ENOMEM | No memory for reply buffer
-| PTL_RPC_MSG_ERR | -ENOTSUPP | Invalid opcode
-| PTL_RPC_MSG_ERR | -EINVAL | Bad magic or version
-| PTL_RPC_MSG_ERR | -EPROTO | Request is malformed or cannot be
- processed in current context
+| pb_status | if
+| -ENOMEM | No memory for reply buffer
+| -ENOTSUPP | Invalid opcode
+| -EINVAL | Bad magic or version
+| -EPROTO | Request is malformed or cannot be
+ processed in current context
|====
A PTL_RPC_MSG_ERR message does not need to allocate memory, so it
should normally be sent as a reply even if there is not enough memory
-to allocate the normal reply buffer, unless the underlying network
-transport itself cannot allocate memory to send it. (fixme: and what
-happens then?)
+to allocate the normal reply buffer.
In most cases there is a reply with 'pb_type' = PTL_RPC_MSG_REPLY,
indicating that the request was processed, but it may still have
will be documented with the respective RPCs, and those that are more
generic are listed here:
-.format
+.'pb_type' = PTL_RPC_MSG_REPLY status return codes
[options="header"]
|====
-| pb_type | pb_status | meaning
-| PTL_RPC_MSG_REPLY | -ENOTCONN | Client is not connected to the
- target, typically meaning the
- server was restarted or the
- client was evicted, and the
- client needs to reconnect.
-| PTL_RPC_MSG_REPLY | -EINPROGRESS | The request cannot be processed
- currently due to some other
- factor, such as during initial
- mount, a delay contacting the
- quota master during a write, or
- LFSCK rebuilding the OI table,
- but the client should continue
- to retry after a delay until
- interrupted or successful. This
- avoids blocking the server
- threads with client requests
- that cannot currently be
- processed, but other requests
- might be processed in the
- meantime.
-| PTL_RPC_MSG_REPLY | -ESHUTDOWN | The server is being stopped and
- no new connections are allowed.
+| pb_status | meaning
+| -ENOTCONN | Client is not connected to the
+ target, typically meaning the
+ server was restarted or the
+ client was evicted, and the
+ client needs to reconnect.
+| -EINPROGRESS | The request cannot be processed
+ currently due to some other
+ factor, such as during initial
+ mount, a delay contacting the
+ quota master during a write, or
+ LFSCK rebuilding the OI table,
+ but the client should continue
+ to retry after a delay until
+ interrupted or successful. This
+ avoids blocking the server
+ threads with client requests
+ that cannot currently be
+ processed, but other requests
+ might be processed in the
+ meantime.
+| -ESHUTDOWN | The server is being stopped and
+ no new connections are allowed.
|====
-The significance of -ENOTCONN is discussed more fully in
-<<connection>>, but a brief comment may be useful here. The networking
-layers supporting the exchange of RPCs can be in good working order
-when 'pb_status' = -ENOTCONN is returned in an RPC reply message. The
-connection refered to by that status is the Lustre connection. That
-connection is part of the shared state between Lustre clients and
-servers that gets established via *_CONNECT RPCs,
-and can be lost due to an 'eviction' in the face of temporary connection
-failure or in case of an unresponsive client (from the server's point
-of view). So, even when that Lusre connection is lost (or has not been
-established, yet), RPC messages such as *_CONNECT can be exchanged.
+A PtlRPC reply message with 'pb_status' = -ENOTCONN can be returned
+for any type of PtlRPC request message other than one of MGS_CONNECT,
+MDS_CONNECT, or OST_CONNECT (those RPCs establish connections).
+-ENOTCONN indicates that the server, the recipient of the request
+message, does not have a valid connection for the client. This can
+come about if a client has been evicted, or if, after a server
+reboots, the client failed to reestablish its connection during
+recovery. Establishing a connection is discussed more fully in
+<<connection>>, eviction in <<eviction>>, and recovery in
+<<recovery>>.
The 'pb_version' identifies the version of the Lustre protocol and is
derived from the following constants. The lower two bytes give the
Lustre operation (number 38). The following list gives the name used
and the value for each operation.
-[source,c]
-----
-typedef enum {
- OST_REPLY = 0,
- OST_GETATTR = 1,
- <<ost-setattr-rpc,OST_SETATTR>> = 2,
- OST_READ = 3,
- OST_WRITE = 4,
- OST_CREATE = 5,
- OST_DESTROY = 6,
- OST_GET_INFO = 7,
- <<ost-connect-rpc,OST_CONNECT>> = 8,
- <<ost-connect-rpc,OST_DISCONNECT>> = 9,
- OST_PUNCH = 10,
- OST_OPEN = 11,
- OST_CLOSE = 12,
- OST_STATFS = 13,
- OST_SYNC = 16,
- OST_SET_INFO = 17,
- OST_QUOTACHECK = 18,
- OST_QUOTACTL = 19,
- OST_QUOTA_ADJUST_QUNIT = 20,
-
- <<mds-getattr-rpc,MDS_GETATTR>> = 33,
- MDS_GETATTR_NAME = 34,
- MDS_CLOSE = 35,
- MDS_REINT = 36,
- MDS_READPAGE = 37,
- <<mds-connect-rpc,MDS_CONNECT>> = 38,
- <<mds-disconnect,MDS_DISCONNECT>> = 39,
- <<mds-getstatus-rpc,MDS_GETSTATUS>> = 40,
- <mds-statfs-rpc,MDS_STATFS>> = 41,
- MDS_PIN = 42,
- MDS_UNPIN = 43,
- MDS_SYNC = 44,
- MDS_DONE_WRITING = 45,
- MDS_SET_INFO = 46,
- MDS_QUOTACHECK = 47,
- MDS_QUOTACTL = 48,
- <<mds-getxattr-rpc,MDS_GETXATTR>> = 49,
- MDS_SETXATTR = 50,
- MDS_WRITEPAGE = 51,
- MDS_IS_SUBDIR = 52,
- MDS_GET_INFO = 53,
- MDS_HSM_STATE_GET = 54,
- MDS_HSM_STATE_SET = 55,
- MDS_HSM_ACTION = 56,
- MDS_HSM_PROGRESS = 57,
- MDS_HSM_REQUEST = 58,
- MDS_HSM_CT_REGISTER = 59,
- MDS_HSM_CT_UNREGISTER = 60,
- MDS_SWAP_LAYOUTS = 61,
-
- <<ldlm-enqueue-rpc,LDLM_ENQUEUE>> = 101,
- LDLM_CONVERT = 102,
- <ldlm-cancel-rpc,LDLM_CANCEL>> = 103,
- <ldlm_bl_callback-rpc,LDLM_BL_CALLBACK>> = 104,
- <ldlm-cp-callback-rpc,LDLM_CP_CALLBACK>> = 105,
- LDLM_GL_CALLBACK = 106,
- LDLM_SET_INFO = 107,
-
- <<mgs-connect-rpc,MGS_CONNECT>> = 250,
- <<mgs-disconnect-rpc,MGS_DISCONNECT>> = 251,
- MGS_EXCEPTION = 252,
- MGS_TARGET_REG = 253,
- MGS_TARGET_DEL = 254,
- MGS_SET_INFO = 255,
- <<mgs-config-read-rpc,MGS_CONFIG_READ>> = 256,
-
- OBD_PING = 400,
- OBD_LOG_CANCEL = 401,
- OBD_QC_CALLBACK = 402,
- OBD_IDX_READ = 403,
-
- <<llog-origin-handle-create-rpc,LLOG_ORIGIN_HANDLE_CREATE>> = 501,
- <<llog-origin-handle-next-block,LLOG_ORIGIN_HANDLE_NEXT_BLOCK>> = 502,
- <<llog-origin-handle-read-header,LLOG_ORIGIN_HANDLE_READ_HEADER>> = 503,
- LLOG_ORIGIN_HANDLE_WRITE_REC = 504,
- LLOG_ORIGIN_HANDLE_CLOSE = 505,
- LLOG_ORIGIN_CONNECT = 506,
- LLOG_ORIGIN_HANDLE_PREV_BLOCK = 508,
- LLOG_ORIGIN_HANDLE_DESTROY = 509,
-
- QUOTA_DQACQ = 601,
- QUOTA_DQREL = 602,
-
- SEQ_QUERY = 700,
-
- SEC_CTX_INIT = 801,
- SEC_CTX_INIT_CONT = 802,
- SEC_CTX_FINI = 803,
-
- FLD_QUERY = 900,
- FLD_READ = 901,
-
- UPDATE_OBJ = 1000
-} cmd_t;
-----
-The symbols and values above identify the operations Lustre uses in
-its protocol. They are examined in detail in the
-<<lustre-prcs,Lustre RPCs>> section. Lustre carries out
-each of these operations via the exchange of a pair of messages: a
-request and a reply.
+.Lustre Operation Codes ("opcodes")
+[options="header"]
+|====
+| name | 'pb_opc'
+| OST_REPLY | 0
+| OST_GETATTR | 1
+| <<ost-setattr-rpc>> | 2
+| OST_READ | 3
+| OST_WRITE | 4
+| OST_CREATE | 5
+| OST_DESTROY | 6
+| OST_GET_INFO | 7
+| <<ost-connect-rpc>> | 8
+| <<ost-disconnect-rpc>> | 9
+| OST_PUNCH | 10
+| OST_OPEN | 11
+| OST_CLOSE | 12
+| OST_STATFS | 13
+| OST_SYNC | 16
+| OST_SET_INFO | 17
+| OST_QUOTACHECK | 18
+| OST_QUOTACTL | 19
+| OST_QUOTA_ADJUST_QUNIT | 20
+| |
+| <<mds-getattr-rpc>> | 33
+| MDS_GETATTR_NAME | 34
+| MDS_CLOSE | 35
+| MDS_REINT | 36
+| MDS_READPAGE | 37
+| <<mds-connect-rpc>> | 38
+| <<mds-disconnect-rpc>> | 39
+| <<mds-getstatus-rpc>> | 40
+| <mds-statfs-rpc>> | 41
+| MDS_PIN | 42
+| MDS_UNPIN | 43
+| MDS_SYNC | 44
+| MDS_DONE_WRITING | 45
+| MDS_SET_INFO | 46
+| MDS_QUOTACHECK | 47
+| MDS_QUOTACTL | 48
+| <<mds-getxattr-rpc>> | 49
+| MDS_SETXATTR | 50
+| MDS_WRITEPAGE | 51
+| MDS_IS_SUBDIR | 52
+| MDS_GET_INFO | 53
+| MDS_HSM_STATE_GET | 54
+| MDS_HSM_STATE_SET | 55
+| MDS_HSM_ACTION | 56
+| MDS_HSM_PROGRESS | 57
+| MDS_HSM_REQUEST | 58
+| MDS_HSM_CT_REGISTER | 59
+| MDS_HSM_CT_UNREGISTER | 60
+| MDS_SWAP_LAYOUTS | 61
+| |
+| <<ldlm-enqueue-rpc>> | 101
+| LDLM_CONVERT | 102
+| <<ldlm-cancel-rpc>> | 103
+| <<ldlm-bl-callback-rpc>> | 104
+| <<ldlm-cp-callback-rpc>> | 105
+| LDLM_GL_CALLBACK | 106
+| LDLM_SET_INFO | 107
+| |
+| <<mgs-connect-rpc>> | 250
+| <<mgs-disconnect-rpc>> | 251
+| MGS_EXCEPTION | 252
+| MGS_TARGET_REG | 253
+| MGS_TARGET_DEL | 254
+| MGS_SET_INFO | 255
+| <<mgs-config-read-rpc>> | 256
+| |
+| OBD_PING | 400
+| OBD_LOG_CANCEL | 401
+| OBD_QC_CALLBACK | 402
+| OBD_IDX_READ | 403
+| |
+| <<llog-origin-handle-create-rpc>> | 501
+| <<llog-origin-handle-next-block-rpc>> | 502
+| <<llog-origin-handle-read-header-rpc>> | 503
+| LLOG_ORIGIN_HANDLE_WRITE_REC | 504
+| LLOG_ORIGIN_HANDLE_CLOSE | 505
+| LLOG_ORIGIN_CONNECT | 506
+| LLOG_ORIGIN_HANDLE_PREV_BLOCK | 508
+| LLOG_ORIGIN_HANDLE_DESTROY | 509
+| |
+| QUOTA_DQACQ | 601
+| QUOTA_DQREL | 602
+| |
+| SEQ_QUERY | 700
+| |
+| SEC_CTX_INIT | 801
+| SEC_CTX_INIT_CONT | 802
+| SEC_CTX_FINI | 803
+| |
+| FLD_QUERY | 900
+| FLD_READ | 901
+| |
+| UPDATE_OBJ | 1000
+|====
The 'pb_status' field was already mentioned above in conjuction with
the 'pb_type' field in replies. In a request message 'pb_status' is
requested operation. When an error is being reported the value will
encode a standard Linux kernel (POSIX) error code as initially
defined for the i386/x86_64 architecture. The 'pb_status' value is
-returned as a negative number, so for example, a permissions error
+returned as a negative number. For example, a permissions error
would be indicated as -EPERM.
'pb_last_xid' and 'pb_last_seen' are not used.
The 'pb_last_committed' value is always zero in a request. In a reply
it is the highest transaction number that has been committed to
-storage. The transaction numbers are maintained on a per-target basis
-and each series of transaction numbers is a strictly increasing
-sequence for modifications originating from any client. This field is
-set in any kind of reply message including pings and non-modifying
-transactions. If 'pb_last_committed' is larger than, or equal to, any
-of the client's uncommitted requests (see 'pb_transno' below) then the
-server is confirming those requests have been committed to stable
-storage. At that point the client may free those request structures
-as they will no longer be necessary for replay during recovery.
-
-The 'pb_transno' value is always zero in a new request. It is also
-zero for replies to operations that do not modify the file system. For
-replies to operations that do modify the file system it is the
-target-unique, server-assigned transaction number for the client
-request. The 'pb_transno' assigned to each modifying request is in
-strictly increasing order, but may not be sequential for a single
-client, and the client may receive replies in a different order than
-they were processed by the server.Upon receipt of the reply, the
-client copies this transaction number from 'pb_transno' of the reply
-to 'pb_transno' of the saved request. If 'pb_transno' is larger than
-'pb_last_commited' (above) then the request has only been processed at
-the target and is not yet committed to stable storage. The client
-must save the request for later resend to the server in case the
-target fails before the modification can be committed to disk.If the
-request has to be replayed it will include the transaction number.
-
-The 'pb_flags' value governs the client state machine. Fixme: document
-what the states and transitions are of this state machine. Currently,
-only the bottom two bytes are used, and they encode state according to
-the following values:
+storage. See <<transno>>. This field is set in any kind of reply
+message including pings and non-modifying transactions. If
+'pb_last_committed' is larger than, or equal to, any of the client's
+uncommitted requests (see 'pb_transno' below) then the server is
+confirming those requests have been committed to stable storage. At
+that point the client may free those request structures as they will
+no longer be necessary for replay during recovery.
+
+The 'pb_transno' value is zero in a new request. It is also zero for
+replies to operations that do not modify the file system. For replies
+to operations that do modify the file system it is the target-unique,
+server-assigned transaction number for the client request. See
+<<transno>>. Upon receipt of the reply, the client copies this
+transaction number from 'pb_transno' of the reply to 'pb_transno' of
+the saved request. If 'pb_transno' is larger than 'pb_last_commited'
+(above) then the request has been processed at the target but is not
+yet committed to stable storage. The client must save the request for
+later resend to the server in case the target fails before the
+modification can be committed to disk. If the request has to be
+replayed it will include the transaction number.
+
+The 'pb_flags' value governs the client state machine. Currently, only
+the least significant two bytes are used. The field encodes state
+according to the following values:
[source,c]
----
MGS_DELAY_REPLAY is currently unused.
-MSG_VERSION_REPLAY indicates that a replayed request has
-pb_pre_versions[] filled with the prior object versions and can be
-used with Version Based Recovery.
+MSG_VERSION_REPLAY indicates that a replayed request has the
+'pb_pre_versions' array filled with the prior object versions and can
+be used with Version Based Recovery.
MSG_LOCK_REPLAY_DONE indicates the client has completed lock replay,
and is ready to finish recovery.
The 'pb_op_flags' values are specific to a particular 'pb_opc', but
-are currently only used by the *_CONNECT RPCs.The 'pb_op_flags' value
+are currently only used by the *_CONNECT RPCs. The 'pb_op_flags' value
for connect operations governs the client connection status state
machine.
#define MSG_CONNECT_TRANSNO 0x00000100
----
-MGS_CONNECT_RECOVERING indicate the server is in recovery
+MSG_CONNECT_RECOVERING indicates the server is in recovery
-MGS_CONNECT_RECONNECT indicates the client is reconnecting after
+MSG_CONNECT_RECONNECT indicates the client is reconnecting after
non-responsiveness from the server.
-MGS_CONNECT_REPLAYABLE indicates the server connection supports RPC
+MSG_CONNECT_REPLAYABLE indicates the server connection supports RPC
replay (only OSTs support non-recoverable connections, but that is not
the default).
-The MGS_CONNECT_LIBCLIENT is for the a 'liblustre' client. It is
+The MSG_CONNECT_LIBCLIENT is for the a 'liblustre' client. It is
currently unused.
-The client sends MGS_CONNECT_INITIAL the first time the client is
+The client sends MSG_CONNECT_INITIAL the first time the client is
connecting to the server. MSG_CONNECT_INITIAL connections are not
allowed during server recovery.
-MGS_CONNECT_ASYNC is currently unused.
+MSG_CONNECT_ASYNC is currently unused.
MSG_CONNECT_NEXT_VER indicates that the client can understand the next
higher protocol version in addition to the currently used protocol, and
The 'pb_conn_cnt' (connection count) value in a request message
reports the client's "era", which is part of the client and server's
-shared state. The value of the era is initialized to one when it is
-first connected to the MDT. Each subsequent connection (after an
+shared state. The value of the era is initialized to 1 when it is
+first connected to the target. Each subsequent connection (after an
eviction) increments the era for the client. Since the 'pb_conn_cnt'
reflects the client's era at the time the message was composed the
server can use this value to discard late-arriving messages requesting
associated with the given queue, not just the specific message that
initiated the request. Finally, in a reply message (one that does
indicate the operation has been initiated) the timeout value updates
-the timeout interval for the queue. Is this last point different from
-the "early reply" update?
+the timeout interval for the queue.
The 'pb_service_time' value is zero in a request. In a reply it
indicates how long this particular operation actually took from the