5 The 'obd_import' structure holds the connection state for between each
6 client and each target it is connected to.
11 enum lustre_imp_state imp_state;
14 struct lustre_handle imp_remote_handle;
15 struct obd_connect_data imp_connect_data;
19 //////////////////////////////////////////////////////////////////////
20 This is the rest of the info associated with obd_import:
22 #define IMP_STATE_HIST_LEN 16
23 struct import_state_hist {
24 enum lustre_imp_state ish_state;
28 struct portals_handle imp_handle;
29 atomic_t imp_refcount;
30 struct lustre_handle imp_dlm_handle;
31 struct ptlrpc_connection *imp_connection;
32 struct ptlrpc_client *imp_client;
33 cfs_list_t imp_pinger_chain;
34 cfs_list_t imp_zombie_chain;
35 cfs_list_t imp_replay_list;
36 cfs_list_t imp_sending_list;
37 cfs_list_t imp_delayed_list;
38 cfs_list_t imp_committed_list;
39 cfs_list_t *imp_replay_cursor;
40 struct obd_device *imp_obd;
41 struct ptlrpc_sec *imp_sec;
42 struct mutex imp_sec_mutex;
43 cfs_time_t imp_sec_expire;
44 wait_queue_head_t imp_recovery_waitq;
45 atomic_t imp_inflight;
46 atomic_t imp_unregistering;
47 atomic_t imp_replay_inflight;
48 atomic_t imp_inval_count;
49 atomic_t imp_timeouts;
50 enum lustre_imp_state imp_state;
51 struct import_state_hist imp_state_hist[IMP_STATE_HIST_LEN];
52 int imp_state_hist_idx;
55 int imp_last_generation_checked;
56 __u64 imp_last_replay_transno;
57 __u64 imp_peer_committed_transno;
58 __u64 imp_last_transno_checked;
59 struct lustre_handle imp_remote_handle;
60 cfs_time_t imp_next_ping;
61 __u64 imp_last_success_conn;
62 cfs_list_t imp_conn_list;
63 struct obd_import_conn *imp_conn_current;
73 imp_delayed_recovery:1,
77 imp_force_next_verify:1,
80 imp_no_pinger_recover:1,
82 imp_force_reconnect:1,
85 struct obd_connect_data imp_connect_data;
86 __u64 imp_connect_flags_orig;
87 int imp_connect_error;
89 __u32 imp_msghdr_flags; /* adjusted based on server capability */
90 struct ptlrpc_request_pool *imp_rq_pool; /* emergency request pool */
91 struct imp_at imp_at; /* adaptive timeout data */
92 time_t imp_last_reply_time; /* for health check */
94 //////////////////////////////////////////////////////////////////////
96 //////////////////////////////////////////////////////////////////////
98 The 'imp_handle' value is the unique id for the import, and is used as
99 a hash key to it. It is not used in any of the Lustre
100 protocol messages, but rather is just for internal reference.
102 The 'imp_refcount' is also for internal use. The value is incremented
103 with each RPC created, and decremented as the request is freed. When
104 the reference count is zero the import can be freed, as when the
105 target is being disconnected.
107 The 'imp_dlm_handle' is a reference to the LDLM export for this
110 There can be multiple paths through the network to a given
111 target, in which case there would be multiple 'obd_import_conn' items
112 on the 'imp_conn_list'. Each 'obd_imp_conn' includes a
113 'ptlrpc_connection', so 'imp_connection' points to the one that is
116 The 'imp_client' identifies the (local) portals for sending and
117 receiving messages as well as the client's name. The information is
118 specific to either an MDC or an OSC.
120 The 'imp_ping_chain' places the import on a linked list of imports
121 that need periodic pings.
123 The 'imp_zombie_chain' places the import on a list ready for being
124 freed. Unused imports ('imp_refcount' is zero) are deleted
125 asynchronously by a garbage collecting process.
127 In order to support recovery the client must keep requests that are in
128 the process of being handled by the target. The target replies to a
129 request as soon as the target has made its local update to
130 memory. When the client receives that reply the request is put on the
131 'imp_replay_list'. In the event of a failure (target crash, lost
132 message) this list is then replayed for the target during the recovery
133 process. When a request has been sent but has not yet received a reply
134 it is placed on the 'imp_sending_list'. In the event of a failure
135 those will simply be replayed after any recovery has been
136 completed. Finally, there may be requests that the client is delaying
137 before it sends them. This can happen if the client is in a degraded
138 mode, as when it is in recovery after a failure. These requests are
139 put on the 'imp_delayed_list' and not processed until recovery is
140 complete and the 'imp_sending_list' has been replayed.
142 In order to support recovery 'open' requests must be preserved even
143 after they have completed. Those requests are placed on the
144 'imp_committed_list' and the 'imp_replay_cursor' allows for
145 accelerated access to those items.
147 The 'imp_obd' is a reference to the details about the target device
148 that is the subject of this import. There is a lot of state info in
149 there along with many implementation details that are not relevant to
150 the actual Lustre protocol. fixme: I'll want to go through all of the
151 fields in that structure to see which, if any need more
154 The security policy and settings are kept in 'imp_sec', and
155 'imp_sec_mutex' helps manage access to that info. The 'imp_sec_expire'
156 setting is in support of security policies that have an expiration
159 Some processes may need the import to be in a fully connected state in
160 order to proceed. The 'imp_recovery_waitq' is where those threads will
161 wait during recovery.
163 The 'imp_inflight' field counts the number of in-flight requests. It
164 is incremented with each request sent and decremented with each reply
167 The client reserves buffers for the processing of requests and
168 replies, and then informs LNet about those buffers. Buffers may get
169 reused during subsequent processing, but then a point may come when
170 the buffer is no longer going to be used. The client increments the
171 'imp_unregistering' counter and informs LNet the buffer is no longer
172 needed. When LNet has freed the buffer it will notify the client and
173 then the 'imp_unregistering' can be decremented again.
175 During recovery the 'imp_reply_inflight' counts the number of requests
176 from the reply list that have been sent and have not been replied to.
178 The 'imp_inval_count' field counts how many threads are in the process
179 of cleaning up this connection or waiting for cleanup to complete. The
180 cleanup itself may be needed in the case there is an eviction or other
181 problem (fixme what other problem?). The cleanup may involve freeing
182 allocated resources, updating internal state, running replay lists,
183 and invalidating cache. Since it could take a while there may end up
184 multiple threads waiting on this process to complete.
186 The 'imp_timeout' field is a counter that is incremented every time
187 there is a timeout in communication with the target.
189 //////////////////////////////////////////////////////////////////////
191 The 'imp_state' tracks the state of the import. It draws from the
192 enumerated set of values:
194 .enum_lustre_imp_state
198 | LUSTRE_IMP_CLOSED | 1
200 | LUSTRE_IMP_DISCON | 3
201 | LUSTRE_IMP_CONNECTING | 4
202 | LUSTRE_IMP_REPLAY | 5
203 | LUSTRE_IMP_REPLAY_LOCKS | 6
204 | LUSTRE_IMP_REPLAY_WAIT | 7
205 | LUSTRE_IMP_RECOVER | 8
206 | LUSTRE_IMP_FULL | 9
207 | LUSTRE_IMP_EVICTED | 10
210 //////////////////////////////////////////////////////////////////////
212 fixme: what are the transitions between these states? The
213 'imp_state_hist' array maintains a list of the last 16
214 (IMP_STATE_HIST_LEN) states the import was in, along with the time it
215 entered each (fixme: or is it when it left that state?). The list is
216 maintained in a circular manner, so the 'imp_state_hist_idx' points to
217 the entry in the list for the most recently visited state.
219 //////////////////////////////////////////////////////////////////////
221 The 'imp_generation' and 'imp_conn_cnt' fields are monotonically
222 increasing counters. Every time a connection request is sent to the
223 target the 'imp_conn_cnt' counter is incremented, and every time a
224 reply is received for the connection request the 'imp_generation'
225 counter is incremented.
227 //////////////////////////////////////////////////////////////////////
229 The 'imp_last_generation_checked' implements an optimization. When a
230 replay process has successfully traversed the reply list the
231 'imp_generation' value is noted here. If the generation has not
232 incremented then the replay list does not need to be traversed again.
234 During replay the 'imp_last_replay_transno' is set to the transaction
235 number of the last request being replayed, and
236 'imp_peer_committed_transno' is set to the 'pb_last_committed' value
237 (of the <<ptlrpc_body>>) from replies if that value is higher than the
238 previous 'imp_peer_committed_transno'. The 'imp_last_transno_checked'
239 field implements an optimization. It is set to the
240 'imp_last_replay_transno' as its replay is initiated.
242 If 'imp_last_transno_checked' is still 'imp_last_replay_transno' and
243 'imp_generation' is still 'imp_last_generation_checked' then there
244 are no additional requests ready to be removed from the replay
245 list. Furthermore, 'imp_last_transno_checked' may no longer be needed,
246 since the committed transactions are now maintained on a separate list.
248 //////////////////////////////////////////////////////////////////////
250 The 'imp_remote_handle' is the handle sent by the target in a
251 connection reply message to uniquely identify the export for this
252 target and client that is maintained on the server. This is the handle
253 used in all subsequent messages to the target.
255 //////////////////////////////////////////////////////////////////////
257 There are two separate ping intervals. If there are no uncommitted
258 messages for the target then the default ping interval, based on the
259 Adaptive Timeout value, is used to set the 'imp_next_ping' to the time
260 the next ping needs to be sent. If there are uncommitted requests then
261 a "short interval" of 7s is used to set the time for the next ping.
263 The 'imp_last_success_conn' value is set to the time of the last
264 successful connection. fixme: The source says it is in 64 bit
265 jiffies, but does not further indicate how that value is calculated.
267 Since there can actually be multiple connection paths for a target
268 (due to failover or multihomed configurations) the import maintains a
269 list of all the possible connection paths in the list pointed to by
270 the 'imp_conn_list' field. The 'imp_conn_current' points to the one
271 currently in use. Compare with the 'imp_connection' fields. They point
272 to different structures, but each is reachable from the other.
274 Most of the flag, state, and list information in the import needs to
275 be accessed atomically. The 'imp_lock' is used to maintain the
276 consistency of the import while it is manipulated by multiple threads.
278 The various flags are documented in the source code and are largely
279 obvious from those short comments, reproduced here:
285 | imp_no_timeout | timeouts are disabled
286 | imp_invalid | client has been evicted
287 | imp_deactive | client administratively disabled
288 | imp_replayable | try to recover the import
289 | imp_dlm_fake | don't run recovery (timeout instead)
290 | imp_server_timeout | use 1/2 timeout on MDSs and OSCs
291 | imp_delayed_recovery | VBR: imp in delayed recovery
292 | imp_no_lock_replay | VBR: if gap was found then no lock replays
293 | imp_vbr_failed | recovery by versions was failed
294 | imp_force_verify | force an immidiate ping
295 | imp_force_next_verify | force a scheduled ping
296 | imp_pingable | target is pingable
297 | imp_resend_replay | resend for replay
298 | imp_no_pinger_recover | disable normal recovery, for test only.
299 | imp_need_mne_swab | need IR MNE swab
300 | imp_force_reconnect | import must be reconnected, not new connection
301 | imp_connect_tried | import has tried to connect with server
303 A few additional notes are in order. The 'imp_dlm_fake' flag signifies
304 that this is not a "real" import, but rather it is a "reverse"import
305 in support of the LDLM. When the LDLM invokes callback operations the
306 messages are initiated at the other end, so there need to a fake
307 import to receive the replies from the operation. Prior to the
308 introduction of adaptive timeouts the servers were given fixed timeout
309 value that were half those used for the clients. The
310 'imp_server_timeout' flag indicated that the import should use the
311 half-sized timeouts, but with the introduction of adaptive timeouts
312 this facility is no longer used. "VBR" is "version based recovery",
313 and it introduces a new possibility for handling requests. Previously,
314 f there were a gap in the transaction number sequence the the requests
315 associated with the missing transaction numbers would be
316 discarded. With VBR those transaction only need to be discarded if
317 there is an actual dependency between the ones that were skipped and
318 the currently latest committed transaction number. fixme: What are the
319 circumstances that would lead to setting the 'imp_force_next_verify'
320 or 'imp_pingable' flags? During recovery, the client sets the
321 'imp_no_pinger_recover' flag, which tells the process to proceed from
322 the current value of 'imp_replay_last_transno'. The
323 'imp_need_mne_swab' flag indicates a version dependent circumstance
324 where swabbing was inadvertently left out of one processing step.
326 //////////////////////////////////////////////////////////////////////