5 An 'obd_export' structure for a given target is created on a server
6 for each client that connects to that target. The exports for all the
7 clients for a given target are managed together. The export represents
8 the connection state between the client and target as well as the
9 current state of any ongoing activity. Thus each pending request will
10 have a reference to the export. The export is discarded if the
11 connection goes away, but only after all the references to it have
12 been cleaned up. Some state information for each export is also
13 maintained on disk. In the event of a server failure and recovery,
14 the server can read the export data from disk to allow the client
15 to reconnect and participate in recovery, otherwise a client without
16 any export data will not be allowed to participate in recovery.
21 struct portals_handle exp_handle;
22 struct obd_uuid exp_client_uuid;
23 struct obd_connect_data exp_connect_data;
27 //////////////////////////////////////////////////////////////////////
29 This is the full obd_export.
32 struct portals_handle exp_handle;
33 atomic_t exp_refcount;
34 atomic_t exp_rpc_count;
35 atomic_t exp_cb_count;
36 atomic_t exp_replay_count;
37 atomic_t exp_locks_count;
38 struct obd_uuid exp_client_uuid;
39 cfs_list_t exp_obd_chain;
40 cfs_hlist_node_t exp_uuid_hash;
41 cfs_hlist_node_t exp_nid_hash;
42 cfs_list_t exp_obd_chain_timed;
43 struct obd_device *exp_obd;
44 struct obd_import *exp_imp_reverse;
45 struct nid_stat *exp_nid_stats;
46 struct ptlrpc_connection *exp_connection;
48 cfs_hash_t *exp_lock_hash;
49 cfs_hash_t *exp_flock_hash;
50 cfs_list_t exp_outstanding_replies;
51 cfs_list_t exp_uncommitted_replies;
52 spinlock_t exp_uncommitted_replies_lock;
53 __u64 exp_last_committed;
54 cfs_time_t exp_last_request_time;
55 cfs_list_t exp_req_replay_queue;
57 struct obd_connect_data exp_connect_data;
58 enum obd_option exp_flags;
66 exp_req_replay_needed:1,
67 exp_lock_replay_needed:1,
73 enum lustre_sec_part exp_sp_peer;
74 struct sptlrpc_flavor exp_flvr;
75 struct sptlrpc_flavor exp_flvr_old[2];
76 cfs_time_t exp_flvr_expire[2];
77 spinlock_t exp_rpc_lock;
78 cfs_list_t exp_hp_rpcs;
79 cfs_list_t exp_reg_rpcs;
80 cfs_list_t exp_bl_list;
81 spinlock_t exp_bl_list_lock;
83 struct tg_export_data eu_target_data;
84 struct mdt_export_data eu_mdt_data;
85 struct filter_export_data eu_filter_data;
86 struct ec_export_data eu_ec_data;
87 struct mgs_export_data eu_mgs_data;
89 struct nodemap *exp_nodemap;
92 //////////////////////////////////////////////////////////////////////
94 The 'exp_handle' holds the cookie that the server generates at
95 *_CONNECT time to uniquely identify this connection from the client.
96 This cookie is also sent back to the client in the *_CONNECT reply and
97 is then stored in the client's import.
99 //////////////////////////////////////////////////////////////////////
101 The 'exp_refcount' gets incremented whenever some aspect of the export
102 is "in use". The arrival of an otherwise unprocessed message for this
103 target will increment the refcount. A reference by an LDLM lock that
104 gets taken will increment the refcount. Callback invocations and
105 replay also lead to incrementing the 'ref_count'. The next four fields
106 - 'exp_rpc_count', exp_cb_count', and 'exp_replay_count', and
107 'exp_locks_count' - all sub-categorize the 'exp_refcount'. The
108 reference counter keeps the export alive while there are any users of
109 that export. The reference counter is also used for debug
110 purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard'
111 are further debug info that list the actual locks accounted for in
114 //////////////////////////////////////////////////////////////////////
116 include::struct_obd_uuid.txt[]
118 The 'exp_client_uuid' holds the UUID of the client connected to this
119 export. This UUID is randomly generated by the client, and the same
120 UUID is used by the client for connecting to all servers. The client's
121 UID appears in the *_CONNECT message (See <<ost-connect-rpc>>,
122 <<mds-connect-rpc>>, and <<mgs-connect-rpc>>).
124 //////////////////////////////////////////////////////////////////////
126 The server maintains all the exports for a given target on a circular
127 list. Each export's place on that list is maintained in the
128 'exp_obd_chain'. A common activity is to look up the export based on
129 the UUID or the nid of the client, and the 'exp_uuid_hash' and
130 'exp_nid_hash' fields maintain this export's place in hashes
131 constructed for that purpose.
133 Exports are also maintained on a list sorted by the last time the
134 corresponding client was heard from. The 'exp_obd_chain_timed' field
135 maintains the export's place on that list. When a message arrives from
136 the client the time is "now" so the export gets put at the end of the
137 list. Since it is circular, the next export is then the oldest. If it
138 has not been heard of within its timeout interval that export is
139 marked for later eviction.
141 The 'exp_obd' points to the 'obd_device' structure for the device that
142 is the target of this export.
144 In the event of an LDLM call-back the export needs to have a the ability to
145 initiate messages back to the client. The 'exp_imp_reverse' provides a
146 "reverse" import that manages this capability.
148 The '/proc' stats for the export (and the target) get updated via the
151 The 'exp_connection' points to the connection information for this
152 export. This is the information about the actual networking pathway(s)
153 that get used for communication.
155 The 'exp_conn_cnt' notes the connection count value from the client at
156 the time of the connection. In the event that more than one connection
157 request is issued before the connection is established then the
158 'exp_conn_cnt' will list the highest value. If a previous connection
159 attempt (with a lower value) arrives later it may be safely
160 discarded. Every request lists its connection count, so non-connection
161 requests with lower connection count values can also be discarded.
162 Note that this does not count how many times the client has connected
163 to the target. If a client is evicted the export is deleted once it
164 has been cleaned up and its 'exp_ref_count' reduced to zero. A new
165 connection from the client will get a new export.
167 The 'exp_lock_hash' provides access to the locks granted to the
168 corresponding client for this target. If a lock cannot be granted it
169 is discarded. A file system lock ("flock") is also implemented through
170 the LDLM lock system, but not all LDLM locks are flocks. The ones that
171 are flocks are gathered in a hash 'exp_flock_hash'. This supports
174 For those requests that initiate file system modifying transactions
175 the request and its attendant locks need to be preserved until either
176 a) the client acknowledges receiving the reply, or b) the transaction
177 has been committed locally. This ensures a request can be replayed in
178 the event of a failure. The LDLM lock is being kept until one of these
179 event occurs to prevent any other modifications of the same object.
180 The reply is kept on the 'exp_outstanding_replies' list until the LNet
181 layer notifies the server that the reply has been acknowledged. A reply
182 is kept on the 'exp_uncommitted_replies' list until the transaction
183 (if any) has been committed.
185 The 'exp_last_committed' value keeps the transaction number of the
186 last committed transaction. Every reply to a client includes this
187 value as a means of early-as-possible notification of transactions that
190 The 'exp_last_request_time' is self explanatory.
192 During reply a request that is waiting for reply is maintained on the
193 list 'exp_req_replay_queue'.
195 The 'exp_lock' spin-lock is used for access control to the exports
196 flags, as well as the 'exp_outstanding_replies' list and the revers
199 //////////////////////////////////////////////////////////////////////
201 The 'exp_connect_data' refers to an 'obd_connect_data' structure for
202 the connection established between this target and the client this
203 export refers to. The 'exp_connect_data' describes the mutually
204 supported features that were negotiated between the client and server
205 at connect time. See also the corresponding entry in the import
206 (<<struct-obd-import>>) and the connect messages
207 (<<ost-connect-rpc>>, <<mds-connect-rpc>>, and <mgs-connect-rpc>>)
208 passed between the hosts.
210 //////////////////////////////////////////////////////////////////////
212 The 'exp_flags' field encodes three directives as follows:
216 OBD_OPT_FORCE = 0x0001,
217 OBD_OPT_FAILOVER = 0x0002,
218 OBD_OPT_ABORT_RECOV = 0x0004,
221 fixme: Are the set for some exports and a condition of their
222 existence? or do they reflect a transient state the export is passing
225 The 'exp_failed' flag gets set whenever the target has failed for any
226 reason or the export is otherwise due to be cleaned up. Once set it
227 will not be unset in this export. Any subsequent connection between
228 the client and the target would be governed by a new export.
230 After a failure export data is retrieved from disk and the exports
231 recreated. Exports created in this way will have their
232 'exp_in_recovery' flag set. Once any outstanding requests and locks
233 have been recovered for the client, then the export is recovered and
234 'exp_in_recovery' can be cleared. When all the client exports for a
235 given target have been recovered then the target is considered
236 recovered, and when all targets have been recovered the server is
237 considered recovered.
239 A *_DISCONNECT message from the client will set the 'exp_disconnected'
240 flag, as will any sort of failure of the target. Once set the export
241 will be cleaned up and deleted.
243 When a *_CONNECT message arrives the 'exp_connecting' flag is set. If
244 for some reason a second *_CONNECT request arrives from the client it can
245 be discarded when this flag is set.
247 The 'exp_delayed' flag is no longer used. In older code it indicated
248 that recovery had not completed in a timely fashion, but that a tardy
249 recovery would still be possible, since there were no dependencies on
252 The 'exp_vbr_failed' flag indicates a failure during the recovery
253 process. See <<recovery>> for a more detailed discussion of recovery
254 and transaction replay. For a file system modifying request, the
255 server composes its reply including the 'pb_pre_versions' entries in
256 'ptlrpc_body', which indicate the most recent updates to the
257 object. The client updates the request with the 'pb_transno' and
258 'pb_pre_versions' from the reply, and keeps that request until the
259 target signals that the transaction has been committed to disk. If the
260 client times-out without that confirmation then it will 'replay' the
261 request, which now includes the 'pb_pre_versions' information. During
262 a replay the target checks that the object has the same version as
263 'pb_pre_versions' in replay. If this check fails then the object can't
264 be restored in the same state as it was in before failure. Usually that
265 happens if the recovery process fails for the connection between some
266 other client and this target, so part of change needed for this client
267 wasn't restored. At that point the 'exp_vbr_failed' flag is set
268 to indicate version based recovery failed. This will lead to the client
269 being evicted and this export being cleaned up and deleted.
271 At the start of recovery both the 'exp_req_replay_needed' and
272 'exp_lock_replay_needed' flags are set. As request replay is completed
273 the 'exp_req_replay_needed' flag is cleared. As lock replay is
274 completed the 'exp_lock_replay_needed' flag is cleared. Once both are
275 cleared the 'exp_in_recovery' flag can be cleared.
277 The 'exp_need_sync' supports an optimization. At mount time it is
278 likely that every client (potentially thousands) will create an export
279 and that export will need to be saved to disk synchronously. This can
280 lead to an unusually high and poorly performing interaction with the
281 disk. When the export is created the 'exp_need_sync' flag is set and
282 the actual writing to disk is delayed. As transactions arrive from
283 clients (in a much less coordinated fashion) the 'exp_need_sync' flag
284 indicates a need to have the export data on disk before proceeding
285 with a new transaction, so as it is next updated the transaction is
286 done synchronously to commit all changes on disk. At that point the
287 flag is cleared (except see below).
289 In DNE (phase I) the export for an MDT managing the connection from
290 another MDT will want to always keep the 'exp_need_sync' flag set. For
291 that special case such an export sets the 'exp_keep_sync', which then
292 prevents the 'exp_need_sync' flag from ever being cleared. This will
293 no longer be needed in DNE Phase II.
295 The 'exp_flvr_changed' and 'exp_flvr_adapt' flags along with
296 'exp_sp_peer', 'exp_flvr', 'exp_flvr_old', and 'exp_flvr_expire'
297 fields are all used to manage the security settings for the
298 connection. Security is discussed in the <<security>> section. (fixme:
301 The 'exp_libclient' flag indicates that the export is for a client
302 based on "liblustre". This allows for simplified handling on the
303 server. (fixme: how is processing simplified? It sounds like I may
304 need a whole special section on liblustre.)
306 The 'exp_need_mne_swab' flag indicates the presence of an old bug that
307 affected one special case of failed swabbing. It is not part of
310 As RPCs arrive they are first subjected to triage. Each request is
311 placed on the 'exp_hp_rpcs' list and examined to see if it is high
312 priority (PING, truncate, bulk I/O). If it is not high priority then
313 it is moved to the 'exp_reg_prcs' list. The 'exp_rpc_lock' protects
314 both lists from concurrent access.
316 All arriving LDLM requests get put on the 'exp_bl_list' and access to
317 that list is controlled via the 'exp_bl_list_lock'.
319 The union provides for target specific data. The 'eu_target_data' is
320 for a common core of fields for a generic target. The others are
321 specific to particular target types: 'eu_mdt_data' for MDTs,
322 'eu_filter_data' for OSTs, 'eu_ec_data' for an "echo client" (fixme:
323 describe what an echo client is somewhere), and 'eu_mgs_data' is for
326 The 'exp_bl_lock_at' field supports adaptive timeouts which will be
327 discussed separately. (fixme: so discuss it somewhere.)
329 //////////////////////////////////////////////////////////////////////