File System Operations ---------------------- [[file-system-operations]] Lustre is a POSIX compliant file system that provides namespace and data storage services to clients. It implements all the usual file system functionality including creating, writing, reading, and removing files and directories. These file system operations are implemented via <>, which carry out communication and coordination with the servers. In this section we present the sequence of Lustre Operations, along with their effects, of a variety of file system operations. Mount ~~~~~ Before any other interaction can take place between a client and a Lustre file system the client must 'mount' the file system, and Lustre services must already be in place (on the servers). A file system mount may be initiated at the Linux shell command line, which in turn invokes the 'mount()' system call. Kernel modules for Lustre exchange a series of messages with the servers, beginning with messages that retrieve details about the file system from the management server (MGS). This provides the client with the identities of all the metadata servers (MDSs) and targets (MDTs) as well as all the object storage servers (OSSs) and targets (OSTs). The client then sequences through each of the targets exchanging additional messages to initiate the connections with them. The following sections present the details of the Lustre operations that accomplish the file system mount. Messages Between the Client and the MGS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to be able to mount the Lustre file system the client needs to know the identities of the various servers and targets so that it can initiate connections to them. The following sequence of operations accomplishes this. ---- MGS_CONNECT LDLM_ENQUEUE (concurrent read) LLOG_ORIGIN_HANDLE_CREATE (filename: lfs-sptlrpc) LDLM_ENQUEUE (concurrent read) LLOG_ORIGIN_HANDLE_CREATE (filename: lfs-client) LLOG_ORIGIN_HANDLE_READ_HEADER LLOG_ORIGIN_HANDLE_NEXT_BLOCK LDLM_ENQUEUE (concurrent read) MGS_CONFIG_READ (name: lfs-cliir) LDLM_ENQUEUE (concurrent read) LLOG_ORIGIN_HANDLE_CREATE (filename: params) LLOG_ORIGIN_HANDLE_READ_HEADER ---- Prior to any other interaction between a client and a Lustre server (or between two servers) the client must establish a 'connection'. The connection establishes shared state between the two hosts. On the client this connection state information is called an 'import', and there is an import on the client for each target it connects to. On the server this connection state is referred to as an 'export', and again the server has an export for each client that has connected to it. There a separate export for each client for each target. The client begins by carrying out the MGS_CONNECT Lustre operation, which establishes the connection (creates the import and the export) between the client and the MGS. The connect message from the client includes a 'handle' to uniquely identify itself (subsequent messages to the LDLM will refer to that client-handle). The connection data from the client also proposes the set of <> appropriate to connecting to an MGS. .Flags for the client connection to an MGS [options="header"] |==== | obd_connect_data->ocd_connect_flags | OBD_CONNECT_VERSION | OBD_CONNECT_AT | OBD_CONNECT_FULL20 | OBD_CONNECT_IMP_RECOV | OBD_CONNECT_MNE_SWAB | OBD_CONNECT_PINGLESS |==== The MGS's reply to the connection request will include the handle that the server and client will both use to identify this connection in subsequent messages. This is the 'connection-handle' (as opposed to the client-handle mentioned a moment ago). The MGS also replies with the same set of connection flags. Once the connection is established the client gets configuration information for the file system from the MGS in four stages. First, the two exchange messages establishing the file system wide security policy that will be followed in all subsequent communications. Second, the client gets a bitmap instructing it as to which among the configuration records on the MGS it needs. Third, reading those records from the MGS gives the client the list of all the servers and targets it will need to communicate with. Fourth, the client reads cluster wide configuration data (the sort that might be set at the client command line with a 'lctl conf_param' command). The following paragraphs go into these four stages in more detail. Each time the client is going to read information from server storage it needs to first acquire the appropriate lock. Since the client is only reading data, the locks will be 'concurrent read' locks. The LDLM_ENQUEUE command communicates this lock request to the MGS target. The request identifies the target via the connection-handle from the connection reply, and identifies the client (itself) with the client-handle from its original connection request. The MGS's reply grants that lock, if appropriate. If other clients were making some sort of modification to the MGS data then the lock exchange might result in a delay while the client waits. More details about the behavior of the <> are in that section. For now, let's assume the locks are granted for each of these four operations. The first LLOG_ORIGIN_HANDLE_CREATE operation (the client is creating its own local handle not the target's file) asks for the security configuration file ("lfs-sptlrpc"). <> discusses security, and for now let's assume there is nothing to be done for security. That is, subsequent messages will all use an "empty security flavor" and no encryption will take place. In this case the MGS's reply ('pb_status' == -2, ENOENT) indicated that there was no such file, so nothing actually gets read. Another LDLM_ENQUEUE and LLOG_ORIGIN_HANDLE_CREATE pair of operations identifies the configuration client data ("lfs-client") file, and in this case there is data to read. The LLOG_ORIGIN_HANDLE_CREATE reply identifies the actual object of interest on the MGS via the 'llog_logid' field in the 'struct llogd_body'. The MGS stores configuration data in log records. A header at the beginning of "lfs-client" uses a bitmap to identify the log records that are actually needed. The header includes both which records to retrieve and how large those records are. The LLOG_ORIGIN_HANDLE_READ_HEADER request uses the 'llog_logid' to identify desired log file, and the reply provides the bitmap and size information identifying the records that are actually needed. The LLOG_ORIGIN_HANDLE_NEXT_BLOCK operations retrieves the data thus identified. Knowing the specific configuration records it wants, the client then proceeds to retrieve them. This requires another LDLM_ENQUEUE operation, followed this time by the MGS_CONFIG_READ operation, which get the UUIDs for the servers and targets from the configuration log ("lfs-cliir"). A final LDLM_ENQUEUE, LLOG_ORIGIN_HANDLE_CREATE, and LLOG_ORIGIN_HANDLE_READ_HEADER then retrieve the cluster wide configuration data ("params"). Messages Between the Client and the MDSs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After the foregoing interaction with the MGS the client has a list of the MDSs and MDTs in the file system. Next, the client invokes four Lustre operations with each MDT on the list. ---- MDS_CONNECT MDS_STATFS MDS_GETSTATUS MDS_GETATTR ---- The MDS_CONNECT operation establishes a connection between the client and a specific target (MDT) on an MDS. Thus, if an MDS has multiple targets, there is a separate MDS_CONNECT operation for each. This creates an import for the target on the client and an export for the client and target on the MDS. As with the connect operation for the MGS, the connect message from the client includes a UUID to uniquely identify this connection, and subsequent messages to the lock manager on the server will refer to that UUID. The connection data from the client also proposes the set of <> appropriate to connecting to an MDS. The following are the flags always included. .Always included flags for the client connection to an MDS [options="header"] |==== | obd_connect_data->ocd_connect_flags | OBD_CONNECT_RDONLY | OBD_CONNECT_VERSION | OBD_CONNECT_ACL | OBD_CONNECT_XATTR | OBD_CONNECT_IBITS | OBD_CONNECT_NODEVOH | OBD_CONNECT_ATTRFID | OBD_CONNECT_CANCELSET | OBD_CONNECT_AT | OBD_CONNECT_RMT_CLIENT | OBD_CONNECT_RMT_CLIENT_FORCE | OBD_CONNECT_BRW_SIZE | OBD_CONNECT_MDS_CAPA | OBD_CONNECT_OSS_CAPA | OBD_CONNECT_MDS_MDS | OBD_CONNECT_FID | LRU_RESIZE_CONNECT_FLAG | OBD_CONNECT_VBR | OBD_CONNECT_LOV_V3 | OBD_CONNECT_SOM | OBD_CONNECT_FULL20 | OBD_CONNECT_64BITHASH | OBD_CONNECT_JOBSTATS | OBD_CONNECT_EINPROGRESS | OBD_CONNECT_LIGHTWEIGHT | OBD_CONNECT_UMASK | OBD_CONNECT_LVB_TYPE | OBD_CONNECT_LAYOUTLOCK | OBD_CONNECT_PINGLESS | OBD_CONNECT_MAX_EASIZE | OBD_CONNECT_FLOCK_DEAD | OBD_CONNECT_DISP_STRIPE | OBD_CONNECT_LFSCK | OBD_CONNECT_OPEN_BY_FID | OBD_CONNECT_DIR_STRIPE |==== .Optional flags for the client connection to an MDS [options="header"] |==== | obd_connect_data->ocd_connect_flags | OBD_CONNECT_SOM | OBD_CONNECT_LRU_RESIZE | OBD_CONNECT_ACL | OBD_CONNECT_UMASK | OBD_CONNECT_RDONLY | OBD_CONNECT_XATTR | OBD_CONNECT_XATTR | OBD_CONNECT_RMT_CLIENT_FORCE |==== The MDS replies to the connect message with a subset of the flags proposed by the client, and the client notes those values in its import. The MDS's reply to the connection request will include a UUID that the server and client will both use to identify this connection in subsequent messages. The client next uses an MDS_STATFS operation to request 'statfs' information from the target, and that data is returned in the reply message. The actual fields closely resemble the results of a 'statfs' system call. See the 'obd_statfs' structure in the <>. The client uses the MDS_GETSTATUS operation to request information about the mount point of the file system. fixme: Does MDS_GETSTATUS only ask about the root (so it would seem)? The server reply contains the 'fid' of the root directory of the file system being mounted. If there is a security policy the capabilities of that security policy are included in the reply. The client then uses the MDS_GETATTR operation to get get further information about the root directory of the file system. The request message includes the above fid. It will also include the security capability (if appropriate). The reply also holds the same fid, and in this case the 'mdt_body' has several additional fields filled in. These include the mtime, atime, ctime, mode, uid, and gid. It also includes the size of the extended attributes and the size of the ACL information. The reply message also includes the extended attributes and the ACL. From the extended attributes the client can find out about striping information for the root, if any. Messages Between the Client and the OSSs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Additional CONNECT messages flow between the client and each OST enumerated by the MGS. ---- OST_CONNECT ---- Unmount ~~~~~~~ ---- OST_DISCONNECT MDS_DISCONNECT MGS_DISCONNECT ---- Create ~~~~~~ Further discussion of the 'creat()' system call. include::getattr.txt[] include::setattr.txt[] include::statfs.txt[] include::getxattr.txt[]