Export ^^^^^^ [[obd-export]] An 'obd_export' structure for a given target is created on a server for each client that connects to that target. The exports for all the clients for a given target are managed together. The export represents the connection state between the client and target as well as the current state of any ongoing activity. Thus each pending request will have a reference to the export. The export is discarded if the connection goes away, but only after all the references to it have been cleaned up. Some state information for each export is also maintained on disk. In the event of a server failure and recovery, the server can read the export data from disk to allow the client to reconnect and participate in recovery, otherwise a client without any export data will not be allowed to participate in recovery. ---- struct obd_export { struct portals_handle exp_handle; struct obd_uuid exp_client_uuid; struct obd_connect_data exp_connect_data; }; ---- ////////////////////////////////////////////////////////////////////// ////vvvv This is the full obd_export. struct obd_export { struct portals_handle exp_handle; atomic_t exp_refcount; atomic_t exp_rpc_count; atomic_t exp_cb_count; atomic_t exp_replay_count; atomic_t exp_locks_count; struct obd_uuid exp_client_uuid; cfs_list_t exp_obd_chain; cfs_hlist_node_t exp_uuid_hash; cfs_hlist_node_t exp_nid_hash; cfs_list_t exp_obd_chain_timed; struct obd_device *exp_obd; struct obd_import *exp_imp_reverse; struct nid_stat *exp_nid_stats; struct ptlrpc_connection *exp_connection; __u32 exp_conn_cnt; cfs_hash_t *exp_lock_hash; cfs_hash_t *exp_flock_hash; cfs_list_t exp_outstanding_replies; cfs_list_t exp_uncommitted_replies; spinlock_t exp_uncommitted_replies_lock; __u64 exp_last_committed; cfs_time_t exp_last_request_time; cfs_list_t exp_req_replay_queue; spinlock_t exp_lock; struct obd_connect_data exp_connect_data; enum obd_option exp_flags; unsigned long exp_failed:1, exp_in_recovery:1, exp_disconnected:1, exp_connecting:1, exp_delayed:1, exp_vbr_failed:1, exp_req_replay_needed:1, exp_lock_replay_needed:1, exp_need_sync:1, exp_flvr_changed:1, exp_flvr_adapt:1, exp_libclient:1, exp_need_mne_swab:1; enum lustre_sec_part exp_sp_peer; struct sptlrpc_flavor exp_flvr; struct sptlrpc_flavor exp_flvr_old[2]; cfs_time_t exp_flvr_expire[2]; spinlock_t exp_rpc_lock; cfs_list_t exp_hp_rpcs; cfs_list_t exp_reg_rpcs; cfs_list_t exp_bl_list; spinlock_t exp_bl_list_lock; union { struct tg_export_data eu_target_data; struct mdt_export_data eu_mdt_data; struct filter_export_data eu_filter_data; struct ec_export_data eu_ec_data; struct mgs_export_data eu_mgs_data; } u; struct nodemap *exp_nodemap; }; ////^^^^ ////////////////////////////////////////////////////////////////////// The 'exp_handle' holds the cookie that the server generates at *_CONNECT time to uniquely identify this connection from the client. This cookie is also sent back to the client in the reply and is then stored in the client's import. ////////////////////////////////////////////////////////////////////// ////vvvv The 'exp_refcount' gets incremented whenever some aspect of the export is "in use". The arrival of an otherwise unprocessed message for this target will increment the refcount. A reference by an LDLM lock that gets taken will increment the refcount. Callback invocations and replay also lead to incrementing the 'ref_count'. The next four fields - 'exp_rpc_count', exp_cb_count', and 'exp_replay_count', and 'exp_locks_count' - all subcategorize the 'exp_refcount'. The reference counter keeps the export alive while there are any users of that export. The reference counter is also used for debug purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard' are further debug info that list the actual locks accounted for in 'exp_locks_count'. ////^^^^ ////////////////////////////////////////////////////////////////////// The 'exp_client_uuid' holds the UUID of the client connected to this export. This UUID is randomly generated by the client and the same UUID is used by the client for connecting to all servers, so that the servers may identify the client amongst themselves if necessary. ////////////////////////////////////////////////////////////////////// ////vvvv The server maintains all the exports for a given target on a circular list. Each export's place on that list is maintained in the 'exp_obd_chain'. A common activity is to look up the export based on the UUID or the nid of the client, and the 'exp_uuid_hash' and 'exp_nid_hash' fields maintain this export's place in hashes constructed for that purpose. Exports are also maintained on a list sorted by the last time the corresponding client was heard from. The 'exp_obd_chain_timed' field maintains the export's place on that list. When a message arrives from the client the time is "now" so the export gets put at the end of the list. Since it is circular, the next export is then the oldest. If it has not been heard of within its timeout interval that export is marked for later eviction. The 'exp_obd' points to the 'obd_device' structure for the device that is the target of this export. In the event of an LDLM call-back the export needs to have a the ability to initiate messages back to the client. The 'exp_imp_reverse' provides a "reverse" import that manages this capability. The '/proc' stats for the export (and the target) get updated via the 'exp_nid_stats'. The 'exp_connection' points to the connection information for this export. This is the information about the actual networking pathway(s) that get used for communication. The 'exp_conn_cnt' notes the connection count value from the client at the time of the connection. In the event that more than one connection request is issued before the connection is established then the 'exp_conn_cnt' will list the highest value. If a previous connection attempt (with a lower value) arrives later it may be safely discarded. Every request lists its connection count, so non-connection requests with lower connection count values can also be discarded. Note that this does not count how many times the client has connected to the target. If a client is evicted the export is deleted once it has been cleaned up and its 'exp_ref_count' reduced to zero. A new connection from the client will get a new export. The 'exp_lock_hash' provides access to the locks granted to the corresponding client for this target. If a lock cannot be granted it is discarded. A file system lock ("flock") is also implemented through the LDLM lock system, but not all LDLM locks are flocks. The ones that are flocks are gathered in a hash 'exp_flock_hash'. This supports deadlock detection. For those requests that initiate file system modifying transactions the request and its attendant locks need to be preserved until either a) the client acknowleges recieving the reply, or b) the transaction has been committed locally. This ensures a request can be replayed in the event of a failure. The LDLM lock is being kept until one of these event occurs to prevent any other modifications of the same object. The reply is kept on the 'exp_outstanding_replies' list until the LNet layer notifies the server that the reply has been acknowledged. A reply is kept on the 'exp_uncommitted_replies' list until the transaction (if any) has been committed. The 'exp_last_committed' value keeps the transaction number of the last committed transaction. Every reply to a client includes this value as a means of early-as-possible notification of transactions that have been committed. The 'exp_last_request_time' is self explanatory. During reply a request that is waiting for reply is maintained on the list 'exp_req_replay_queue'. The 'exp_lock' spin-lock is used for access control to the exports flags, as well as the 'exp_outstanding_replies' list and the revers import, if any. ////^^^^ ////////////////////////////////////////////////////////////////////// The 'exp_connect_data' refers to an 'obd_connect_data' structure for the connection established between this target and the client this export refers to. The 'exp_connect_data' describes the mutually supported features that were negotiated between the client and server at connect time. See also the corresponding entry in the import and in the connect messages passed between the hosts. ////////////////////////////////////////////////////////////////////// ////vvvv The 'exp_flags' field encodes three directives as follows: ---- enum obd_option { OBD_OPT_FORCE = 0x0001, OBD_OPT_FAILOVER = 0x0002, OBD_OPT_ABORT_RECOV = 0x0004, }; ---- fixme: Are the set for some exports and a condition of their existence? or do they reflect a transient state the export is passing through? The 'exp_failed' flag gets set whenever the target has failed for any reason or the export is otherwise due to be cleaned up. Once set it will not be unset in this export. Any subsequent connection between the client and the target would be governed by a new export. After a failure export data is retrieved from disk and the exports recreated. Exports created in this way will have their 'exp_in_recovery' flag set. Once any outstanding requests and locks have been recovered for the client, then the export is recovered and 'exp_in_recovery' can be cleared. When all the client exports for a given target have been recovered then the target is considered recovered, and when all targets have been recovered the server is considered recovered. A *_DISCONNECT message from the client will set the 'exp_disconnected' flag, as will any sort of failure of the target. Once set the export will be cleaned up and deleted. When a *_CONNECT message arrives the 'exp_connecting' flag is set. If for some reason a second *_CONNECT request arrives from the client it can be discarded when this flag is set. The 'exp_delayed' flag is no longer used. In older code it indicated that recovery had not completed in a timely fashion, but that a tardy recovery would still be possible, since there were no dependencies on the export. The 'exp_vbr_failed' flag indicates a failure during the recovery process. See <> for a more detailed discussion of recovery and transaction replay. For a file system modifying request, the server composes its reply including the 'pb_pre_versions' entries in 'ptlrpc_body', which indicate the most recent updates to the object. The client updates the request with the 'pb_transno' and 'pb_pre_versions' from the reply, and keeps that request until the target signals that the transaction has been committed to disk. If the client times-out without that confirmation then it will 'replay' the request, which now includes the 'pb_pre_versions' information. During a replay the target checks that the object has the same version as 'pb_pre_versions' in replay. If this check fails then the object can't be restored in the same state as it was in before failure. Usually that happens if the recovery process fails for the connection between some other client and this target, so part of change needed for this client wasn't restored. At that point the 'exp_vbr_failed' flag is set to indicate version based recovery failed. This will lead to the client being evicted and this export being cleaned up and deleted. At the start of recovery both the 'exp_req_replay_needed' and 'exp_lock_replay_needed' flags are set. As request replay is completed the 'exp_req_replay_needed' flag is cleared. As lock replay is completed the 'exp_lock_replay_needed' flag is cleared. Once both are cleared the 'exp_in_recovery' flag can be cleared. The 'exp_need_sync' supports an optimization. At mount time it is likely that every client (potentially thousands) will create an export and that export will need to be saved to disk synchronously. This can lead to an unusually high and poorly performing interaction with the disk. When the export is created the 'exp_need_sync' flag is set and the actual writing to disk is delayed. As transactions arrive from clients (in a much less coordinated fashion) the 'exp_need_sync' flag indicates a need to have the export data on disk before proceeding with a new transaction, so as it is next updated the transaction is done synchronously to commit all changes on disk. At that point the flag is cleared (except see below). In DNE (phase I) the export for an MDT managing the connection from another MDT will want to always keep the 'exp_need_sync' flag set. For that special case such an export sets the 'exp_keep_sync', which then prevents the 'exp_need_sync' flag from ever being cleared. This will no longer be needed in DNE Phase II. The 'exp_flvr_changed' and 'exp_flvr_adapt' flags along with 'exp_sp_peer', 'exp_flvr', 'exp_flvr_old', and 'exp_flvr_expire' fields are all used to manage the security settings for the connection. Security is discussed in the <> section. (fixme: or will be.) The 'exp_libclient' flag indicates that the export is for a client based on "liblustre". This allows for simplified handling on the server. (fixme: how is processing simplified? It sounds like I may need a whole special section on liblustre.) The 'exp_need_mne_swab' flag indicates the presence of an old bug that affected one special case of failed swabbing. It is not part of current processing. As RPCs arrive they are first subjected to triage. Each request is placed on the 'exp_hp_rpcs' list and examined to see if it is high priority (PING, truncate, bulk I/O). If it is not high priority then it is moved to the 'exp_reg_prcs' list. The 'exp_rpc_lock' protects both lists from concurrent access. All arriving LDLM requests get put on the 'exp_bl_list' and access to that list is controlled via the 'exp_bl_list_lock'. The union provides for target specific data. The 'eu_target_data' is for a common core of fields for a generic target. The others are specific to particular target types: 'eu_mdt_data' for MDTs, 'eu_filter_data' for OSTs, 'eu_ec_data' for an "echo client" (fixme: describe what an echo client is somewhere), and 'eu_mgs_data' is for an MGS. The 'exp_bl_lock_at' field supports adaptive timeouts which will be discussed separately. (fixme: so discuss it somewhere.) ////^^^^ //////////////////////////////////////////////////////////////////////