Ptlrpc_body - The Lustre RPC Descriptor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[[struct-ptlrpc-body]]

Every Lustre message starts with both the above header and an
additional set of fields (in its first "buffer") given by the 'struct
ptlrpc_body_v3' structure. This preamble has information information
relevant to every RPC type. In particular, the RPC type is itself
encoded in the 'pb_opc' Lustre operation number. The value of that
opcode, as well as whether it is an RPC 'request' or 'reply',
determines what else will be in the message following the preamble.
----
#define PTLRPC_NUM_VERSIONS     4
#define JOBSTATS_JOBID_SIZE     32
struct ptlrpc_body {
    struct lustre_handle pb_handle;
    __u32 pb_type;
    __u32 pb_version;
    __u32 pb_opc;
    __u32 pb_status;
    __u64 pb_last_xid;
    __u64 pb_last_seen;
    __u64 pb_last_committed;
    __u64 pb_transno;
    __u32 pb_flags;
    __u32 pb_op_flags;
    __u32 pb_conn_cnt;
    __u32 pb_timeout;
    __u32 pb_service_time;
    __u32 pb_limit;
    __u64 pb_slv;
    __u64 pb_pre_versions[PTLRPC_NUM_VERSIONS];
    __u64 pb_padding[4];
    char  pb_jobid[JOBSTATS_JOBID_SIZE];
};
----

In a connection request, sent by a client to a server and regarding a
specific target, the 'pb_handle' is 0. In the reply to a connection
request, sent by the target, the handle is a value uniquely
identifying the target. Subsequent messages between this client and
this target will use this handle to to gain access to their shared
state. The handle is persistent across client reconnects to the same
instance of the server, but if the client unmounts the filesystem or
is evicted then it must re-connect as a new client.

The 'pb_type' is PTL_RPC_MSG_REQUEST in messages when they are
initiated, it is PTL_RPC_MSG_REPLY in a reply, and it is
PTL_RPC_MSG_ERR in a reply to convey that a message was received that
could not be interpreted, that is, if it was corrupt or
incomplete. The encoding of those type values is given by:
----
#define PTL_RPC_MSG_REQUEST 4711
#define PTL_RPC_MSG_ERR     4712
#define PTL_RPC_MSG_REPLY   4713
----
The 'pb_type' = PTL_RPC_MSG_ERR is only for message handling errors.
This may be a message that failed to be interpreted as an actual
message, or it may indicate some more general problem with the
connection between the client and the target. Note that other errors,
such as those that emerge from processing the actual message content,
do not use the PTL_RPC_MSG_ERR type.

The 'pb_status' field provides an error return code, if any, for the
RPC. When 'pb_type' = PTL_RPC_MSG_ERR the 'pb_status' will also be
set to one of the following message handling errors:

.format
[options="header"]
|====
| pb_type          | pb_status | if
| PTL_RPC_MSG_ERR  | -ENOMEM    | No memory for reply buffer
| PTL_RPC_MSG_ERR  | -ENOTSUPP  | Invalid opcode
| PTL_RPC_MSG_ERR  | -EINVAL    | Bad magic or version
| PTL_RPC_MSG_ERR  | -EPROTO    | Request is malformed or cannot be
                                  processed in current context
|====

A PTL_RPC_MSG_ERR message does not need to allocate memory, so it
should normally be sent as a reply even if there is not enough memory
to allocate the normal reply buffer, unless the underlying network
transport itself cannot allocate memory to send it. (fixme: and what
happens then?)

In most cases there is a reply with 'pb_type' = PTL_RPC_MSG_REPLY,
indicating that the request was processed, but it may still have
'pb_status' set to a non-zero value to indicate that the request
encountered an error during processing (see below).  This may indicate
something very specific to the particular RPC, but it may also be a
very general sort of error. Those that are specific to particular RPCs
will be documented with the respective RPCs, and those that are more
generic are listed here:

.format
[options="header"]
|====
| pb_type            | pb_status    | meaning
| PTL_RPC_MSG_REPLY  | -ENOTCONN    | Client is not connected to the
                                      target, typically meaning the
				      server was restarted or the
				      client was evicted, and the
				      client needs to reconnect.
| PTL_RPC_MSG_REPLY  | -EINPROGRESS | The request cannot be processed
                                      currently due to some other
				      factor, such as during initial
				      mount, a delay contacting the
				      quota master during a write, or
				      LFSCK rebuilding the OI table,
				      but the client should continue
				      to retry after a delay until
				      interrupted or successful.  This
				      avoids blocking the server
				      threads with client requests
				      that cannot currently be
				      processed, but other requests
				      might be processed in the
				      meantime.
| PTL_RPC_MSG_REPLY  | -ESHUTDOWN   | The server is being stopped and
                                      no new connections are allowed.
|====

The significance of -ENOTCONN is discussed more fully in
<<connection>>, but a brief comment may be useful here. The networking
layers supporting the exchange of RPCs can be in good working order
when 'pb_status' = -ENOTCONN is returned in an RPC reply message. The
connection refered to by that status is the Lustre connection. That
connection is part of the shared state between Lustre clients and
servers that gets established via MDS_CONNECT and OST_CONNECT RPCs,
and can be lost due to an 'eviction'. So, even when that Lusre
connection is lost (or has not been established, yet), RPC messages
can be exchanged.

The 'pb_version' identifies the version of the Lustre protocol and is
derived from the following constants. The lower two bytes give the
version of PtlRPC being employed in the message, and the upper two
bytes encode the role of the host for the service being
requested. That role is one of OBD, MDS, OST, DLM, LOG, or MGS.
----
#define PTLRPC_MSG_VERSION  0x00000003
#define LUSTRE_VERSION_MASK 0xffff0000
#define LUSTRE_OBD_VERSION  0x00010000
#define LUSTRE_MDS_VERSION  0x00020000
#define LUSTRE_OST_VERSION  0x00030000
#define LUSTRE_DLM_VERSION  0x00040000
#define LUSTRE_LOG_VERSION  0x00050000
#define LUSTRE_MGS_VERSION  0x00060000
----

The 'pb_opc' value (operation code) gives the actual Lustre operation
that is the subject of this message. For example, MDS_CONNECT is a
Lustre operation (number 38). The following list gives the name used
and the value for each operation.
----
typedef enum {
    OST_REPLY                       =  0,
    OST_GETATTR                     =  1,
    <<ost-setattr-rpc,OST_SETATTR>>                     =  2,
    OST_READ                        =  3,
    OST_WRITE                       =  4,
    OST_CREATE                      =  5,
    OST_DESTROY                     =  6,
    OST_GET_INFO                    =  7,
    <<ost-connect-rpc,OST_CONNECT>>                     =  8,
    <<ost-connect-rpc,OST_DISCONNECT>>                  =  9,
    OST_PUNCH                       = 10,
    OST_OPEN                        = 11,
    OST_CLOSE                       = 12,
    OST_STATFS                      = 13,
    OST_SYNC                        = 16,
    OST_SET_INFO                    = 17,
    OST_QUOTACHECK                  = 18,
    OST_QUOTACTL                    = 19,
    OST_QUOTA_ADJUST_QUNIT          = 20,

    <<mds-getattr-rpc,MDS_GETATTR>>                     = 33,
    MDS_GETATTR_NAME                = 34,
    MDS_CLOSE                       = 35,
    MDS_REINT                       = 36,
    MDS_READPAGE                    = 37,
    <<mds-connect-rpc,MDS_CONNECT>>                     = 38,
    <<mds-disconnect,MDS_DISCONNECT>>                  = 39,
    <<mds-getstatus-rpc,MDS_GETSTATUS>>                   = 40,
    <mds-statfs-rpc,MDS_STATFS>>                      = 41,
    MDS_PIN                         = 42,
    MDS_UNPIN                       = 43,
    MDS_SYNC                        = 44,
    MDS_DONE_WRITING                = 45,
    MDS_SET_INFO                    = 46,
    MDS_QUOTACHECK                  = 47,
    MDS_QUOTACTL                    = 48,
    <<mds-getxattr-rpc,MDS_GETXATTR>>                    = 49,
    MDS_SETXATTR                    = 50,
    MDS_WRITEPAGE                   = 51,
    MDS_IS_SUBDIR                   = 52,
    MDS_GET_INFO                    = 53,
    MDS_HSM_STATE_GET               = 54,
    MDS_HSM_STATE_SET               = 55,
    MDS_HSM_ACTION                  = 56,
    MDS_HSM_PROGRESS                = 57,
    MDS_HSM_REQUEST                 = 58,
    MDS_HSM_CT_REGISTER             = 59,
    MDS_HSM_CT_UNREGISTER           = 60,
    MDS_SWAP_LAYOUTS                = 61,

    <<ldlm-enqueue-rpc,LDLM_ENQUEUE>>                    = 101,
    LDLM_CONVERT                    = 102,
    <ldlm-cancel-rpc,LDLM_CANCEL>>                     = 103,
    <ldlm_bl_callback-rpc,LDLM_BL_CALLBACK>>                = 104,
    <ldlm-cp-callback-rpc,LDLM_CP_CALLBACK>>                = 105,
    LDLM_GL_CALLBACK                = 106,
    LDLM_SET_INFO                   = 107,

    <<mgs-connect-rpc,MGS_CONNECT>>                     = 250,
    <<mgs-disconnect-rpc,MGS_DISCONNECT>>                  = 251,
    MGS_EXCEPTION                   = 252,
    MGS_TARGET_REG                  = 253,
    MGS_TARGET_DEL                  = 254,
    MGS_SET_INFO                    = 255,
    <<mgs-config-read-rpc,MGS_CONFIG_READ>>                 = 256,

    OBD_PING                        = 400,
    OBD_LOG_CANCEL                  = 401,
    OBD_QC_CALLBACK                 = 402,
    OBD_IDX_READ                    = 403,

    <<llog-origin-handle-create-rpc,LLOG_ORIGIN_HANDLE_CREATE>>       = 501,
    <<llog-origin-handle-next-block,LLOG_ORIGIN_HANDLE_NEXT_BLOCK>>   = 502,
    <<llog-origin-handle-read-header,LLOG_ORIGIN_HANDLE_READ_HEADER>>  = 503,
    LLOG_ORIGIN_HANDLE_WRITE_REC    = 504,
    LLOG_ORIGIN_HANDLE_CLOSE        = 505,
    LLOG_ORIGIN_CONNECT             = 506,
    LLOG_ORIGIN_HANDLE_PREV_BLOCK   = 508,
    LLOG_ORIGIN_HANDLE_DESTROY      = 509,

    QUOTA_DQACQ                     = 601,
    QUOTA_DQREL                     = 602,

    SEQ_QUERY                       = 700,

    SEC_CTX_INIT                    = 801,
    SEC_CTX_INIT_CONT               = 802,
    SEC_CTX_FINI                    = 803,

    FLD_QUERY                       = 900,
    FLD_READ                        = 901,

    UPDATE_OBJ                      = 1000
} cmd_t;
----
The symbols and values above identify the operations Lustre uses in
its protocol. They are examined in detail in the
<<lustre-operations,Lustre Operations>> section. Lustre carries out
each of these operations via the exchange of a pair of messages: a
request and a reply. The details of each message are specific to each
operation.  The <<lustre-messages,Lustre Messages>> chapter discusses
each message and its contents.

The 'pb_status' field was already mentioned above in conjuction with
the 'pb_type' field in replies.  In a request message 'pb_status' is
set to the 'pid' of the process making the request. In a reply
message, a zero indicates that the service successfully initiated the
requested operation. When an error is being reported the value will
encode a standard Linux kernel (POSIX) error code as initially
defined for the i386/x86_64 architecture. The 'pb_status' value is
returned as a negative number, so for example, a permissions error
would be indicated as -EPERM.

'pb_last_xid' and 'pb_last_seen' are not used.

The 'pb_last_committed' value is always zero in a request. In a reply
it is the highest transaction number that has been committed to
storage. The transaction numbers are maintained on a per-target basis
and each series of transaction numbers is a strictly increasing
sequence for modifications originating from any client. This field is
set in any kind of reply message including pings and non-modifying
transactions. If 'pb_last_committed' is larger than, or equal to, any
of the client's uncommitted requests (see 'pb_transno' below) then the
server is confirming those requests have been committed to stable
storage. At that point the client will free the request structures.

The 'pb_transno' value is always zero in a new request. It is also
zero for replies to operations that do not modify the file system. For
replies to operations that do modify the file system it is the
target-unique, server-assigned transaction number for the client
request.  The 'pb_transno' assigned to each modifying request is in
strictly increasing order, but may not be sequential for a single
client, and the client may receive replies in a different order than
they were processed by the server.Upon receipt of the reply, the
client copies this transaction number from 'pb_transno' of the reply
to 'pb_transno' of the saved request.  If 'pb_transno' is larger than
'pb_last_commited' (above) then the request has only been processed at
the target and is not yet committed to stable storage.  The client
must save the request for later resend to the server in case the
target fails before the modification can be committed to disk.If the
request has to be replayed it will include the transaction number.

The 'pb_flags' value governs the client state machine. Fixme: document
what the states and transitions are of this state machine. Currently,
only the bottom two bytes are used, and they encode state according to
the following values:
----
#define MSG_GEN_FLAG_MASK     0x0000ffff
#define MSG_LAST_REPLAY           0x0001
#define MSG_RESENT                0x0002
#define MSG_REPLAY                0x0004
#define MSG_DELAY_REPLAY          0x0010
#define MSG_VERSION_REPLAY        0x0020
#define MSG_REQ_REPLAY_DONE       0x0040
#define MSG_LOCK_REPLAY_DONE      0x0080
----

MGS_LAST_REPLAY is currently unused. It had been used to indicate that
this is the last RPC request to be replayed by this client during
recovery.  MGS_LAST_REPLAY has been replaced by MSG_REQ_REPLAY_DONE
and MSG_LOCK_REPLAY_DONE.

MGS_RESENT is set when this RPC request is being resent because no
reply was received.

MGS_REPLAY indicates this RPC request is being replayed after the
client received a reply but before it was committed to storage.  The
'pb_transno' field holds the server-assigned transaction number.

MGS_DELAY_REPLAY is currently unused.

MSG_VERSION_REPLAY indicates that a replayed request has
pb_pre_versions[] filled with the prior object versions and can be
used with Version Based Recovery.

MSG_LOCK_REPLAY_DONE indicates the client has completed lock replay,
and is ready to finish recovery.

The 'pb_op_flags' values are specific to a particular 'pb_opc', but
are currently only used by the *_CONNECT RPCs.The 'pb_op_flags' value
for connect operations governs the client connection status state
machine.

----
#define MSG_CONNECT_RECOVERING  0x00000001
#define MSG_CONNECT_RECONNECT   0x00000002
#define MSG_CONNECT_REPLAYABLE  0x00000004
#define MSG_CONNECT_LIBCLIENT   0x00000010
#define MSG_CONNECT_INITIAL     0x00000020
#define MSG_CONNECT_ASYNC       0x00000040
#define MSG_CONNECT_NEXT_VER    0x00000080
#define MSG_CONNECT_TRANSNO     0x00000100
----

MGS_CONNECT_RECOVERING indicate the server is in recovery

MGS_CONNECT_RECONNECT indicates the client is reconnecting after
non-responsiveness from the server.

MGS_CONNECT_REPLAYABLE indicates the server connection supports RPC
replay (only OSTs support non-recoverable connections, but that is not
the default).

The MGS_CONNECT_LIBCLIENT is for the a 'liblustre' client. It is
currently unused.

The client sends MGS_CONNECT_INITIAL the first time the client is
connecting to the server.  MSG_CONNECT_INITIAL connections are not
allowed during server recovery.

MGS_CONNECT_ASYNC is currently unused.

MSG_CONNECT_NEXT_VER indicates that the client can understand the next
higher protocol version, and the server can reply to the connect with
that RPC version if it is supported, otherwise it will reply with the
same RPC version as the request.  This allows RPC protocol versions to
be negotiated during a transition period (e.g. upgrade from RPC from
LUSTRE_MSG_MAGIC_V1 to LUSTRE_MSG_MAGIC_V2).

In normal operation an initial request to connect will set
'pb_op_flags' to MSG_CONNECT_INITIAL (in some earlier versions
MSG_CONNECT_NEXT_VER was mistakenly included, though it did no
harm). The reply to that connection request (and all other,
non-connect, requests and replies) will set 'pb_op_flags' to 0.

The 'pb_conn_cnt' (connection count) value in a request message
reports the client's "era", which is part of the client and server's
shared state. The value of the era is initialized to one when it is
first connected to the MDT. Each subsequent connection (after an
eviction) increments the era for the client. Since the 'pb_conn_cnt'
reflects the client's era at the time the message was composed the
server can use this value to discard late-arriving messages requesting
operations on out-of-date shared state.

The 'pb_timeout' value in a request indicates how long (in seconds)
the requester plans to wait before timing out the operation. That is,
the corresponding reply for this message should arrive within this
time frame. The service may extend this time frame via an "early
reply", which is a reply to this message that notifies the requester
that it should extend its timeout interval by the value of the
'pb_timeout' field in the reply. The "early reply" does not indicate
the operation has actually been initiated.  Clients maintain multiple
request queues, called "portals", and each type of operation is
assigned to one of these queues. There is a timeout value associated
with each queue, and the timeout update affects all the messages
associated with the given queue, not just the specific message that
initiated the request. Finally, in a reply message (one that does
indicate the operation has been initiated) the timeout value updates
the timeout interval for the queue. Is this last point different from
the "early reply" update?

The 'pb_service_time' value is zero in a request. In a reply it
indicates how long this particular operation actually took from the
time it first arrived in the request queue (at the service) to the
time the server replied. Note that the client can use this value and
the local elapsed time for the operation to calculate network latency.

The 'pb_limit' value is zero in a request. In a reply it is a value
sent from a lock service to a client to set the maximum number of
locks available to the client. When dynamic lock LRU's are enabled
this allows for managing the size of the LRU.

The 'pb_slv' value is zero in a request. On a DLM service, the "server
lock volume" is a value that characterizes (estimates) the amount of
traffic, or load, on that lock service. It is calculated as the
product of the number of locks and their age. In a reply, the 'pb_slv'
value indicates to the client the available share of the total lock
load on the server that the client is allowed to consume. The client
is then responsible for reducing its number or (or age) of locks to
stay within this limit.

The array of 'pb_pre_versions' values has four entries. They are
always zero in a new request message. They are also zero in replies to
operations that do not modify the file system. For an operation that
does modify the file system, the reply encodes the most recent
transaction numbers for the objects modified by this operation, and
the 'pb_pre_versions' values are copied into the original request when
the reply arrives. If the request needs to be replayed then the
updated 'pb_pre_versions' values accompany the replayed request.

'pb_padding' is reserved for future use.

The 'pb_jobid' (string) value gives a unique identifier associated
with the process on behalf of which this message was generated. The
identifier is assigned to the user process by a job scheduler, if any.