~~~~~~
Host systems that mount the Lustre file system and are generally
-refered to as "clients" of the file system services. Most, but not
+referred to as "clients" of the file system services. Most, but not
all, Lustre protocol actions (remote procedure calls, or RPCs) are
initiated by clients.
| OBD_MD_FLACL | ACL
|====
-As a bonus the MDT returnes layout information about the file, so that
+As a bonus the MDT returns layout information about the file, so that
Client1 can get attribute information from the OST(s) responsible
for the file's objects (if any).
See <<ldlm-enqueue-rpc>>.
-*4 - The OST invokes a glimps lock callback on Client2.*
+*4 - The OST invokes a glimpse lock callback on Client2.*
Client2 previously had a lock on the desired resource, and the glimpse
induces Client2 to flush its buffers, if needed, and update the OST
*5 - Client2 replies with LVB data for the OST.*
The OST is waiting to hear back from Client2 to update size and time
-attributes, if needed, due to Client2 chache being flushed to the
+attributes, if needed, due to Client2 cache being flushed to the
OST. The glimpse allows the information to return to the OST, and
thereby get passed to Client1, without taking the lock from Client2.
-------------------------
//////////////////////////////////////////////////////////////////////
-The 'ost_lvb' data from Client2 has atribute data to update the OST.
+The 'ost_lvb' data from Client2 has attribute data to update the OST.
*6 - The OST replies with the updated attribute information.*
-------------------------------------------------------
//////////////////////////////////////////////////////////////////////
-See <<ldlm-enqueue.txt>>.
+See <<ldlm-enqueue-rpc>>.
*2 - The MDT replies with an LDLM_ENQUEUE with the extended
attributes data.*
[glossary]
Object Storage Device (OSD)::
The OSD is the storage on a server that holds objects or metadata. The
-two kinds of file system available ofr OSDs are ldiskfs and ZFS.
+two kinds of file system available for OSDs are ldiskfs and ZFS.
Client::
A Lustre client is a computer taking advantage of the services
information is a small as possible given the maximum stripe count on
the file system. Clients, servers, and the distributed lock manager
will all need to be aware of this size, which is communicated in the
-'ocd_max_easize' fieled of the <<struct-obd-connect-data>> structure.
+'ocd_max_easize' field of the <<struct-obd-connect-data>> structure.
LNet::
A lower level protocol employed by PtlRPC to abstract the mechanisms
clients and servers. Lustre, in turn, layers its own protocol atop
LNet. This document describes the Lustre protocol.
-The remainder of the introduciton presents several concepts that
+The remainder of the introduction presents several concepts that
illuminate the operation of the Lustre protocol. In
<<file-system-operations>> a subsection is devoted to each of several
semantic operations (setattr, statfs, ...). That discussion introduces
'ldlm_request'::
The request RPC identifies the lock being canceled. Only the first
-'lock_handle' field is used. The rest of the 'ldlm_request' sturcture
+'lock_handle' field is used. The rest of the 'ldlm_request' structure
is not used. <<struct-ldlm-request>>
.LDLM_CANCEL Reply Packet Structure
'EA data'::
The names of any extended attributes associated with the resource. The
-names are null-terminated strings concatenated into a single sequnce.
+names are null-terminated strings concatenated into a single sequence.
'EA vals'::
A block of data concatenating the values for the extended attributes
listed in "EA vals".
'EA lens'::
-The sizes of the extended attirbute values. This is a sequence of
+The sizes of the extended attribute values. This is a sequence of
32-bit unsigned integers, one for each extended
attribute. The sizes give the length of the corresponding extended
attribute in the "EA vals" block of data. Thus the sum of those sizes
image::ldlm-gl-callback-request.png["LDLM_GL_CALLBACK Request Packet Structure",height=50]
//////////////////////////////////////////////////////////////////////
-The ldlm-gl-callback-request.png diagram resemgles this text
+The ldlm-gl-callback-request.png diagram resembles this text
art:
LDLM_GL_CALLBACK:
image::ldlm-gl-callback-reply.png["LDLM_GL_CALLBACK Reply Packet Structure",height=50]
//////////////////////////////////////////////////////////////////////
-The ldlm-gl-callback-reply.png diagram resemgles this text
+The ldlm-gl-callback-reply.png diagram resembles this text
art:
LDLM_GL_CALLBACK:
The Lustre log (llog) file may contain a number of different types of
data structures, including redo records for uncommitted distributed
-transactions such as unlink or ownership changes, configuration records
-for targets and clients, or ChangeLog records to track changes to the
-filesystem for external consumption, among others.
+transactions such as unlink or ownership changes, configuration
+records for targets and clients, or 'ChangeLog' records to track
+changes to the file system for external consumption, among others.
'client_uuid'::
A string with the client's own UUID. This is also an
<<struct-obd-uuid>>. The target sets the 'exp_client_uuid' field of
-its 'eport' structure to this value.
+its 'export' structure to this value.
'lustre_handle'::
See <<struct-lustre-handle>>. This 'lustre_handle' is distinct from
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[[mds-getstatus-rpc]]
-Get the attributes of the mountpoint of the file system.
+Get the attributes of the mount point of the file system.
.MDS_GETSTATUS Request Packet Structure
image::mds-getstatus-request.png["MDS_GETSTATUS Request Packet Structure",height=50]
See <<struct-mdt-body>>.
In the reply message, the 'mdt_body' contains only the FID of the
-filesystem ROOT in 'mbo_fid1'. The client can then use the returned
+file system ROOT in 'mbo_fid1'. The client can then use the returned
FID to fetch inode attributes for the root inode of the local
-mountpoint.
+mount point.
include::struct_mdt_rec_setxattr.txt[]
-Retruning to the remaining buffers in the REINT_SETXATTR RPC we again
+Returning to the remaining buffers in the REINT_SETXATTR RPC we again
have several optional buffers followed by the 'ldlm_request'.
'lustre_capa'::
[[mgs-connect-rpc]]
The general behavior of MGS_CONNECT RPCs closely resembles that of
-OST_CONNECT RPCs (See <<ost-connect-rpc>>) and MDS_CONNECt RPCs. The
+OST_CONNECT RPCs (See <<ost-connect-rpc>>) and MDS_CONNECT RPCs. The
actual RPC's 'pb_opc' value will be different as are the names of the
targets and the specific values in the 'obd_connect_data' structure.
'client_uuid'::
A string with the client's own UUID. This is also a
<<struct-obd-uuid>>. The target sets the 'exp_client_uuid' field of
-its 'eport' structure to this value.
+its 'export' structure to this value.
'lustre_handle'::
See <<struct-lustre-handle>>. This 'lustre_handle' is distinct from
the 'pb_handle' field in the 'ptlrpc_body'. This 'lustre_handle'
-provied a unique 'cookie' to identify this client for this connection
+provides a unique 'cookie' to identify this client for this connection
attempt. After a disconnect, a subsequent connect RPC will have a
different value. The target preserves this cookie in the 'exp_handle'
field of its 'obd_export' structure for this client. In that way it
Lustre file system the client must 'mount' the file system, and Lustre
services must already be started on the servers. A file system
mount may be initiated via the 'mount()' system call, passing the
-name of the MGS and Lustre filesystem to the kernel. The Lustre
+name of the MGS and Lustre file system to the kernel. The Lustre
client exchanges a series of messages with the servers, beginning with
messages that retrieve the file system configuration <<llog>> from the
management server (MGS). This provides the client with the identities
include::struct_ost_body.txt[]
The above discussion described the structure of the OST_SETATTR
-request message. In this case the reply message sturcture looks much
-the same. It uses the same to buffers, and oly changes their contents
-to reflect that the message is a reply rather than a requsest.
+request message. In this case the reply message structure looks much
+the same. It uses the same to buffers, and only changes their contents
+to reflect that the message is a reply rather than a request.
.OST_SETATTR Reply Packet Structure
image::ost-setattr-reply.png["OST_SETATTR Reply Packet Structure",height=50]
Client1 it has to interact with Client2. The OST sends an
LDLM_BL_CALLBACK request to Client2 asking Client 2 to finish up with
the lock it has. Client2 replies with a simple acknowledgment. When
-Client2 is no longer using the lock it will send an LDLM_CANEL RPC to
+Client2 is no longer using the lock it will send an LDLM_CANCEL RPC to
the OST. At that point the OST grants the original request sending an
LDLM_CP_CALLBACK request to Client1 to notify it. With that taken care
of Client1 is finally able to issue the OST_PUNCH request that
Its effect is to notify the OST that the lock has been returned.
-*6 - The OST replies acknowleging the lock request.*
+*6 - The OST replies acknowledging the lock request.*
The ldlm_reply's lock descriptor acknowledges the request for an
extent write lock without granting it ('l_req_mode' == LCK_PW,
*11 - The OST acknowledges the LDLM_CANCEL (step 7) from Client2*
The OST finishes up with the lock cancel (after having notified
-Client1) by replying to Clietn2. This happens asynchronously with the
+Client1) by replying to Client2. This happens asynchronously with the
arrival of the OST_PUNCH request, and in <<truncate-rpcs>> it is shown
-occuring after the OST_PUNCH, but that is not required.
+occurring after the OST_PUNCH, but that is not required.
.LDLM_CANCEL Reply Packet Structure
image::ldlm-cancel-reply.png["LDLM_CANCEL Reply Packet Structure",height=50]
//////////////////////////////////////////////////////////////////////
The LDLM_CANCEL reply is a so-called "empty" RPC. Its only purpose is
-to acknowldge receipt of the LDLM_CANCEL request.
+to acknowledge receipt of the LDLM_CANCEL request.
*12 - The OST an OST_PUNCH reply.*
difference 'f_files' - 'f_ffree', which is the current number of
used objects. This is what "df" displays.
-The number of OST free objects is divided by the filesystem-wide
+The number of OST free objects is divided by the file-system-wide
default stripe count (i.e. the expected number of OST objects used per
MDT file), so that 'f_ffree' represents the expected minimum number of
files that can be created at the current time.
types of layouts for different files, either 'lov_mds_md_v1' or
'lov_mds_md_v3' as of Lustre 2.7, though they are very similar in
structure. In an intent request (as opposed to a reply and as yet
-unimplemanted) it will modify the layout. It will not be included
+unimplemented) it will modify the layout. It will not be included
(zero length) in requests in current releases.
[source,c]
that there is a hole (missing object) within the layout, which is normally
caused by corruption or loss of the file layout that had to be rebuilt
by LFSCK. LOV_PATTERN_F_RELEASED is used by HSM to indicate that the
-file data is not resident in the filesystem, but rather in an external
+file data is not resident in the file system, but rather in an external
archive, so the layout is only a stub that describes the layout to use
when the file is restored.
The 'lmm_stripe_count' field gives how many OST objects the file is striped
over.
-The 'lmm_layout_gen' field is updated as the layout of the obeject is
+The 'lmm_layout_gen' field is updated as the layout of the Object is
updated. If the 'lmm_layout_gen' is modified, then clients can detect
the layout has changed when refreshing the layout after having lost the
layout lock.
****
There are several reserved ranges of FID sequence values
(summarized in the list above), to allow for interoperability with
-older Lustre filesystems, to identify "well known" objects for
+older Lustre file systems, to identify "well known" objects for
internal or external use, as well as for future expansion.
The 'FID_SEQ_OST_MDT0' (0x0) range is reserved for OST objects created
-by MDT0 in non-DNE filesystems. Since all such OST objects used an
-'f_seq' value of zero these FIDs are not unique across the filesystem,
+by MDT0 in non-DNE file systems. Since all such OST objects used an
+'f_seq' value of zero these FIDs are not unique across the file system,
but the reservation of 'FID_SEQ_OST_MDT0' allows these FIDs to co-exist
with other FIDs in the same 128-bit identifier space.
The 'FID_SEQ_IGIF' (0xb-0xffffffff) range is reserved for 'inode
generation in FID' (IGIF) inodes allocated by MDSs before Lustre 2.0.
This corresponds to the 4 billion maximum inode number that could be
-allocated for such filesystems. The 'f_oid' field for IGIF FIDs
+allocated for such file systems. The 'f_oid' field for IGIF FIDs
contains the inode version number, and as such there is normally only
a single object for each 'f_seq' value.
The 'FID_SEQ_IDIF' (0x100000000-0x1fffffffff) range is reserved for
mapping OST objects that were created by MDT0 using 'FID_SEQ_OST_MDT0'
-to filesystem-unique FIDs. The second 16-bit field (bits 16-31) of the
+to file-system-unique FIDs. The second 16-bit field (bits 16-31) of the
'f_seq' field contains the OST index (0-65535). The low 16-bit field
(bits 0-15) of 'f_seq' contains the high (bits 32-47) bits of the OST
object ID, and the 32-bit 'f_oid' field contains the low 32 bits of
known" objects internal to the server and is not exposed to the network.
The 'FID_SEQ_DOT_LUSTRE' (0x200000002) range is reserved for files
-under the hidden ".lustre" directory in the root of the filesystem.
+under the hidden ".lustre" directory in the root of the file system.
The 'FID_SEQ_LOCAL_NAME' (0x200000003) range is reserved for objects
internal to the server that are allocated by name.
The 'mbo_aclsize' field indicates the size of the POSIX access control
lists (ACLs) in bytes.
-The 'mbo_max_mdsize' field indicates the maximu size of the file layout,
+The 'mbo_max_mdsize' field indicates the maximum size of the file layout,
as described in <<struct-lov-mds-md>>
The 'mbo_max_cookiesize' field is unused since Lustre 2.4. It
[[struct-mdt-rec-reint]]
An 'mdt_rec_reint' structure specifies the generic form for MDS_REINT
-requests. Each sub-operation, as defned by the 'rr_opcode' field, has
+requests. Each sub-operation, as defined by the 'rr_opcode' field, has
its own variant of this structure. Each variant has the same size as
the generic 'mdt_rec_reint', but interprets its fields slightly
differently. Note that in order for swabbing to take place correctly
the sequence of field sizes must the same in every variant as it is in
-the generic version (not just the overal size of the sturcture).
+the generic version (not just the overall size of the structure).
[source,c]
----
The 'mgs_config_body' structure has information identifying to the MGS
which Lustre file system the client is requesting configuration information
-from. 'mcb_name' contains the filesystem name (fsname). 'mcb_offset'
+from. 'mcb_name' contains the file system name (fsname). 'mcb_offset'
contains the next record number in the configuration llog to process
(see <<llog>> for details), not the byte offset or bulk transfer units.
'mcb_bits' is the log2 of the units of minimum bulk transfer size,
example 'OBD_CONNECT_RDONLY' is optional depending on client mount
options). The request also contains other fields that are only valid
if the matching flag is set. The server replies in 'ocd_connect_flags'
-with the subset of feature flags that it understands and intends to honour.
+with the subset of feature flags that it understands and intends to honor.
The server may set fields in the reply for mutually-understood features.
The 'ocd_connect_flags' field encodes the connect flags giving the
If the OBD_CONNECT_INDEX flag is set then the 'ocd_index' field is
valid. The 'ocd_index' value is set in a request to hold the LOV
index of the OST or the LMV index of the MDT. The server's export for
-the target holds the correct value, and if the client send a value
-that does not match the server returns the -EBDF error.
+the target holds the correct value, and if the client sends a value
+that does not match the server returns the -EBADF error.
If the OBD_CONNECT_BRW_SIZE flag is set then the 'ocd_brw_size' field
is valid. The 'ocd_brw_size' value sets the maximum supported bulk
If the OBD_CONNECT_GRANT_PARAM flag is set then the 'ocd_blocksize',
'ocd_inodespace', and 'ocd_grant_extent' fields are honored. A server
reply uses the 'ocd_blocksize' value to inform the client of the log
-base two of the size in bytes of the backend file system's blocks.
+base two of the size in bytes of the OSD's blocks.
A server reply uses the 'ocd_inodespace' value to inform the client of
the log base two of the size of an inode.
of the maximum OST object size for this target. A file that is striped
uniformly across multiple OST objects (RAID-0) cannot be larger than the
number of stripes times the minimum 'ocd_maxbytes' value from any of its
-consituent objects.
+constituent objects.
The additional space in the 'obd_connect_data' structure is unused and
reserved for future use.
from this client.
If the OBD_CONNECT_SRVLOCK flag is set then the client and server
-support lockless IO. The server will take locks for client IO requests
+support lock-less IO. The server will take locks for client IO requests
with the OBD_BRW_SRVLOCK flag in the 'niobuf_remote' structure
flags. This is used for Direct IO or when there is significant lock
contention on a single OST object. The client takes no LDLM lock and
delegates locking to the server.
If the OBD_CONNECT_ACL flag is set then the server supports the ACL
-mount option for its filesystem. If the server is mounted with ACL
+mount option for its file system. If the server is mounted with ACL
support but the client does not pass OBD_CONNECT_ACL then the client
mount is refused.
If the OBD_CONNECT_XATTR flag is set then the server supports user
extended attributes. This is requested by the client if mounted
with the appropriate mount option, but is enabled or disabled by the
-mount options of the backend file system of MDT0000.
+mount options of the OSD for MDT0000.
If the OBD_CONNECT_TRUNCLOCK flag is set then the client and the
-server support lockless truncate. This is realized in an OST_PUNCH RPC
+server support lock-less truncate. This is realized in an OST_PUNCH RPC
by setting the 'obdo' structure's 'o_flag' field to include the
OBD_FL_SRVLOCK. In that circumstance the client takes no lock, and the
server must take a lock on the resource while performing the truncate.
If the OBD_CONNECT_CANCELSET is set then early batched cancels are
enabled. The ELC (Early Lock Cancel) feature allows client locks to
-be cancelled prior the cancellation callback if it is clear that lock
+be canceled prior the cancellation callback if it is clear that lock
is not needed anymore, for example after rename, after removing file
or directory, link, etc. This can reduce amount of RPCs significantly.
lock to the old clients that do not support this feature.
If the OBD_CONNECT_64BITHASH is set then the client supports 64-bit
-directory readdir cookie. The server will also use 64-bit hash mode
+directory 'readdir' cookie. The server will also use 64-bit hash mode
while working with ldiskfs.
If the OBD_CONNECT_JOBSTATS is set then the client fills jobid in
optimize a recovery of open requests.
If the OBD_CONNECT_OPEN_BY_FID is set then an open by FID won't pack
-the name in a request. This is used by HSM or other ChangeLog consumers
+the name in a request. This is used by HSM or other 'ChangeLog' consumers
for accessing objects by their FID via .lustre/fid/ instead of by name.
If the OBD_CONNECT_MDS_MDS is set then the current connection is an
MDS avoiding glimpse request to OSTs. This feature was implemented as
demo feature and wasn't enabled by default. Finally it was removed in
Lustre 2.7 because it causes quite complex recovery cases to handle
-with relatevely small benefits.
+with relatively small benefits.
The OBD_CONNECT_QUOTA64 was used prior Lustre 2.4 for quota purposes,
it is obsoleted due to new quota design.
RPC error properly. The quota design requires that client must resend
request with -EINPROGRESS error indefinitely, until successful
completion or another error. This flag is set on both client and
-server by default. Meanwhile this flag is not checked anywere, so does
+server by default. Meanwhile this flag is not checked anywhere, so does
nothing.
If the OBD_CONNECT_FLOCK_OWNER is set then 1.8 clients has fixed flock
gets taken will increment the refcount. Callback invocations and
replay also lead to incrementing the 'ref_count'. The next four fields
- 'exp_rpc_count', exp_cb_count', and 'exp_replay_count', and
-'exp_locks_count' - all subcategorize the 'exp_refcount'. The
+'exp_locks_count' - all sub-categorize the 'exp_refcount'. The
reference counter keeps the export alive while there are any users of
that export. The reference counter is also used for debug
purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard'
include::struct_obd_uuid.txt[]
The 'exp_client_uuid' holds the UUID of the client connected to this
-export. This UUID is randomly generated by the client and the same
-UUID is used by the client for connecting to all servers, so that the
-servers may identify the client amongst themselves if necessary. The
-client's UID appears in the *_CONNECT message (See
-<<ost-connect-rpc>>, <<mds-connect-rpc>>, and <<mgs-connect-rpc>>).
+export. This UUID is randomly generated by the client, and the same
+UUID is used by the client for connecting to all servers. The client's
+UID appears in the *_CONNECT message (See <<ost-connect-rpc>>,
+<<mds-connect-rpc>>, and <<mgs-connect-rpc>>).
//////////////////////////////////////////////////////////////////////
////vvvv
For those requests that initiate file system modifying transactions
the request and its attendant locks need to be preserved until either
-a) the client acknowleges recieving the reply, or b) the transaction
+a) the client acknowledges receiving the reply, or b) the transaction
has been committed locally. This ensures a request can be replayed in
the event of a failure. The LDLM lock is being kept until one of these
event occurs to prevent any other modifications of the same object.
When the import is itself initialized it is set to
LUSTRE_IMP_NEW. When a client initiates a *_CONNECT RPC it sets the
state to LUSTRE_IMP_CONNECTING. Similarly, it sets the state to
-LUSTRE_IMP_DISCON as it initiates a *_DISCONNECT RPC. Reciving the
+LUSTRE_IMP_DISCON as it initiates a *_DISCONNECT RPC. Receiving the
reply to the *DISCONNECT RPC will set the state to
LUSTRE_IMP_CLOSED. When a (successful) *_CONNECT RPC reply arrives the
state is set to LUSTRE_IMP_FULL. If a target signals a problem or a
recovery condition then the state will proceed through the replay and
recover states. When the target signals that the client connection is
-invalid for some reaon the state will be set to
+invalid for some reason the state will be set to
LUSTRE_IMP_EVICTED. See <<eviction>> and <<recovery>>.
//////////////////////////////////////////////////////////////////////
| imp_delayed_recovery | VBR: imp in delayed recovery
| imp_no_lock_replay | VBR: if gap was found then no lock replays
| imp_vbr_failed | recovery by versions was failed
-| imp_force_verify | force an immidiate ping
+| imp_force_verify | force an immediate ping
| imp_force_next_verify | force a scheduled ping
| imp_pingable | target is pingable
| imp_resend_replay | resend for replay
OSD has a RAID configuration that is degraded or
rebuilding the state is returned with the OS_STATE_DEGRADED (0x1) flag
set. If the file system has been set to read-only, either manually at
-mount or automatcially due to detected corruption of the underlying
-target filesystem, then 'os_state' is returned with OS_STATE_READONLY (0x2)
+mount or automatically due to detected corruption of the underlying
+target file system, then 'os_state' is returned with OS_STATE_READONLY (0x2)
set.
The 'os_fprecreated' field counts the number of pre-created objects
that value) to to gain access to their shared state. In subsequent RPC
reply messages (after the *_CONNECT reply) the 'pb_handle' field is
0. The 'lustre_handle' is persistent across client reconnects to the
-same instance of the server, but if the client unmounts the filesystem
+same instance of the server, but if the client unmounts the file system
or is evicted then it must re-connect as a new client, with a new
'lustre_handle'.
| UPDATE_OBJ | 1000
|====
-The 'pb_status' field was already mentioned above in conjuction with
+The 'pb_status' field was already mentioned above in conjunction with
the 'pb_type' field in replies. In a request message 'pb_status' is
set to the 'pid' of the process making the request. In a reply
message, a zero indicates that the service successfully initiated the
server-assigned transaction number for the client request. See
<<transno>>. Upon receipt of the reply, the client copies this
transaction number from 'pb_transno' of the reply to 'pb_transno' of
-the saved request. If 'pb_transno' is larger than 'pb_last_commited'
+the saved request. If 'pb_transno' is larger than 'pb_last_committed'
(above) then the request has been processed at the target but is not
yet committed to stable storage. The client must save the request for
later resend to the server in case the target fails before the
Lustre servers manage permanent storage on behalf of the clients. That
storage is divided into logical resources for management data,
metadata (namespace and index data), and object storage. Each such
-storage resource is called a target. Managment data is stored on a
+storage resource is called a target. Management data is stored on a
single management server (MGS) with a single target, also called the
MGS. Namespace and index data are maintained on one or more metadata
servers (MDSs), and each MDS may have several targets (MDTs). Object