connection.txt

   1 Connections Between Lustre Entities
   2 -----------------------------------
   3 [[connection]]
   4
   5 The Lustre protocol is connection-based in that each two entities
   6 maintain shared, coordinated state information. The most common
   7 example of two such entities are a client and a target on some
   8 server. The target is identified by name to the client through an
   9 interaction with the management server. The client then 'connects' to
  10 the given target on the indicated server by sending the appropriate
  11 version of the *_CONNECT message (MGS_CONNECT, MDS_CONNECT, or
  12 OST_CONNECT - colectively *_CONNECT) and receiving back the
  13 corresponding *_CONNECT reply. The server creates an 'export' for the
  14 connection between the target and the client, and the export holds the
  15 server state information for that connection. When the client gets the
  16 reply it creates an 'import', and the import holds the client state
  17 information for that connection. Note that if a server has N targets
  18 and M clients have connected to them, the server will have N x M
  19 exports and each client will have N imports.
  20
  21 There are also connections between the servers: Each MDS and OSS has a
  22 connection to the MGS, where the MDS (respectively the OSS) plays the
  23 role of the client in the above discussion. That is, the MDS initiates
  24 the connection and has an import for the MGS, while the MGS has an
  25 export for each MDS. Each MDS connects to each OST, with an import on
  26 the MDS and an export on the OSS. This connection supports requests
  27 from the MDS to the OST to create and destroy data objects, to set
  28 attributes (such as permission bits), and for 'statfs' information for
  29 precreation needs.  Each OSS also connects to the first MDS to get
  30 access to auxiliary services, with an import on the OSS and an export
  31 on the first MDS.  The auxiliary services are: the File ID Location
  32 Database (FLDB), the quota master service, and the sequence
  33 controller. This connection for auxiliary services is a 'lightweight'
  34 one in that it has no replay functionality and consumes no space on
  35 the MDS for client data. Each MDS connects also to all other MDSs for
  36 DNE needs.
  37
  38 Finally, for some communications the roles of message initiation and
  39 message reply are reversed. This is the case, for instance, with
  40 call-back operations. In that case the entity which would normally
  41 have an import has, instead, a 'reverse-export' and the
  42 other end of the connection maintains a 'reverse-import'. The
  43 reverse-import uses the same structure as a regular import, and the
  44 reverse-export uses the same structure as a regular export.
  45
  46 Connection Structures
  47 ~~~~~~~~~~~~~~~~~~~~~
  48
  49 Connect Data
  50 ^^^^^^^^^^^^
  51
  52 An 'obd_connect_data' structure accompanies every connect operation in
  53 both the request message and in the reply message.
  54
  55 ----
  56 struct obd_connect_data {
  57     __u64 ocd_connect_flags;
  58     __u32 ocd_version;      /* OBD_CONNECT_VERSION */
  59     __u32 ocd_grant;        /* OBD_CONNECT_GRANT */
  60     __u32 ocd_index;        /* OBD_CONNECT_INDEX */
  61     __u32 ocd_brw_size;     /* OBD_CONNECT_BRW_SIZE */
  62     __u64 ocd_ibits_known;  /* OBD_CONNECT_IBITS */
  63     __u8  ocd_blocksize;    /* OBD_CONNECT_GRANT_PARAM */
  64     __u8  ocd_inodespace;   /* OBD_CONNECT_GRANT_PARAM */
  65     __u16 ocd_grant_extent; /* OBD_CONNECT_GRANT_PARAM */
  66     __u32 ocd_unused;
  67     __u64 ocd_transno;      /* OBD_CONNECT_TRANSNO */
  68     __u32 ocd_group;        /* OBD_CONNECT_MDS */
  69     __u32 ocd_cksum_types;  /* OBD_CONNECT_CKSUM */
  70     __u32 ocd_max_easize;   /* OBD_CONNECT_MAX_EASIZE */
  71     __u32 ocd_instance;
  72     __u64 ocd_maxbytes;     /* OBD_CONNECT_MAXBYTES */
  73     __u64 padding1;
  74     __u64 padding2;
  75     __u64 padding3;
  76     __u64 padding4;
  77     __u64 padding5;
  78     __u64 padding6;
  79     __u64 padding7;
  80     __u64 padding8;
  81     __u64 padding9;
  82     __u64 paddingA;
  83     __u64 paddingB;
  84     __u64 paddingC;
  85     __u64 paddingD;
  86     __u64 paddingE;
  87     __u64 paddingF;
  88 };
  89 ----
  90
  91 The 'ocd_connect_flags' field encodes the connect flags giving the
  92 capabilities of a connection between client and target. Several of
  93 those flags (noted in comments above and the discussion below)
  94 actually control whether the remaining fields of 'obd_connect_data'
  95 get used. The [[connect-flags]] flags are:
  96
  97 ----
  98 #define OBD_CONNECT_RDONLY                0x1ULL /*client has read-only access*/
  99 #define OBD_CONNECT_INDEX                 0x2ULL /*connect specific LOV idx */
 100 #define OBD_CONNECT_MDS                   0x4ULL /*connect from MDT to OST */
 101 #define OBD_CONNECT_GRANT                 0x8ULL /*OSC gets grant at connect */
 102 #define OBD_CONNECT_SRVLOCK              0x10ULL /*server takes locks for cli */
 103 #define OBD_CONNECT_VERSION              0x20ULL /*Lustre versions in ocd */
 104 #define OBD_CONNECT_REQPORTAL            0x40ULL /*Separate non-IO req portal */
 105 #define OBD_CONNECT_ACL                  0x80ULL /*access control lists */
 106 #define OBD_CONNECT_XATTR               0x100ULL /*client use extended attr */
 107 #define OBD_CONNECT_CROW                0x200ULL /*MDS+OST create obj on write*/
 108 #define OBD_CONNECT_TRUNCLOCK           0x400ULL /*locks on server for punch */
 109 #define OBD_CONNECT_TRANSNO             0x800ULL /*replay sends init transno */
 110 #define OBD_CONNECT_IBITS              0x1000ULL /*support for inodebits locks*/
 111 #define OBD_CONNECT_JOIN               0x2000ULL /*files can be concatenated.
 112                                                   *We do not support JOIN FILE
 113                                                   *anymore, reserve this flags
 114                                                   *just for preventing such bit
 115                                                   *to be reused.*/
 116 #define OBD_CONNECT_ATTRFID            0x4000ULL /*Server can GetAttr By Fid*/
 117 #define OBD_CONNECT_NODEVOH            0x8000ULL /*No open hndl on specl nodes*/
 118 #define OBD_CONNECT_RMT_CLIENT        0x10000ULL /*Remote client */
 119 #define OBD_CONNECT_RMT_CLIENT_FORCE  0x20000ULL /*Remote client by force */
 120 #define OBD_CONNECT_BRW_SIZE          0x40000ULL /*Max bytes per rpc */
 121 #define OBD_CONNECT_QUOTA64           0x80000ULL /*Not used since 2.4 */
 122 #define OBD_CONNECT_MDS_CAPA         0x100000ULL /*MDS capability */
 123 #define OBD_CONNECT_OSS_CAPA         0x200000ULL /*OSS capability */
 124 #define OBD_CONNECT_CANCELSET        0x400000ULL /*Early batched cancels. */
 125 #define OBD_CONNECT_SOM              0x800000ULL /*Size on MDS */
 126 #define OBD_CONNECT_AT              0x1000000ULL /*client uses AT */
 127 #define OBD_CONNECT_LRU_RESIZE      0x2000000ULL /*LRU resize feature. */
 128 #define OBD_CONNECT_MDS_MDS         0x4000000ULL /*MDS-MDS connection */
 129 #define OBD_CONNECT_REAL            0x8000000ULL /*real connection */
 130 #define OBD_CONNECT_CHANGE_QS      0x10000000ULL /*Not used since 2.4 */
 131 #define OBD_CONNECT_CKSUM          0x20000000ULL /*support several cksum algos*/
 132 #define OBD_CONNECT_FID            0x40000000ULL /*FID is supported by server */
 133 #define OBD_CONNECT_VBR            0x80000000ULL /*version based recovery */
 134 #define OBD_CONNECT_LOV_V3        0x100000000ULL /*client supports LOV v3 EA */
 135 #define OBD_CONNECT_GRANT_SHRINK  0x200000000ULL /* support grant shrink */
 136 #define OBD_CONNECT_SKIP_ORPHAN   0x400000000ULL /* don't reuse orphan objids */
 137 #define OBD_CONNECT_MAX_EASIZE    0x800000000ULL /* preserved for large EA */
 138 #define OBD_CONNECT_FULL20       0x1000000000ULL /* it is 2.0 client */
 139 #define OBD_CONNECT_LAYOUTLOCK   0x2000000000ULL /* client uses layout lock */
 140 #define OBD_CONNECT_64BITHASH    0x4000000000ULL /* client supports 64-bits
 141                                                   * directory hash */
 142 #define OBD_CONNECT_MAXBYTES     0x8000000000ULL /* max stripe size */
 143 #define OBD_CONNECT_IMP_RECOV   0x10000000000ULL /* imp recovery support */
 144 #define OBD_CONNECT_JOBSTATS    0x20000000000ULL /* jobid in ptlrpc_body */
 145 #define OBD_CONNECT_UMASK       0x40000000000ULL /* create uses client umask */
 146 #define OBD_CONNECT_EINPROGRESS 0x80000000000ULL /* client handles -EINPROGRESS
 147                                                   * RPC error properly */
 148 #define OBD_CONNECT_GRANT_PARAM 0x100000000000ULL/* extra grant params used for
 149                                                   * finer space reservation */
 150 #define OBD_CONNECT_FLOCK_OWNER 0x200000000000ULL /* for the fixed 1.8
 151                            * policy and 2.x server */
 152 #define OBD_CONNECT_LVB_TYPE    0x400000000000ULL /* variable type of LVB */
 153 #define OBD_CONNECT_NANOSEC_TIME 0x800000000000ULL /* nanosecond timestamps */
 154 #define OBD_CONNECT_LIGHTWEIGHT 0x1000000000000ULL/* lightweight connection */
 155 #define OBD_CONNECT_SHORTIO     0x2000000000000ULL/* short io */
 156 #define OBD_CONNECT_PINGLESS    0x4000000000000ULL/* pings not required */
 157 #define OBD_CONNECT_FLOCK_DEAD    0x8000000000000ULL/* deadlock detection */
 158 #define OBD_CONNECT_DISP_STRIPE 0x10000000000000ULL/* create stripe disposition*/
 159 #define OBD_CONNECT_OPEN_BY_FID    0x20000000000000ULL /* open by fid won't pack
 160                                name in request */
 161 ----
 162
 163 Each flag corresponds to a particular capability that the client and
 164 target together will honor. A client will send a message including
 165 some subset of these capabilities during a connection request to a
 166 specific target. It tells the server what capabilities it has. The
 167 server then replies with the subset of those capabilities it agrees to
 168 honor (for the given target).
 169
 170 If the OBD_CONNECT_VERSION flag is set then the 'ocd_version' field is
 171 honored. The 'ocd_version' gives an encoding of the Lustre
 172 version. For example, Version 2.7.32 would be hexadecimal number
 173 0x02073200.
 174
 175 If the OBD_CONNECT_GRANT flag is set then the 'ocd_grant' field is
 176 honored. The 'ocd_grant' value in a reply (to a connection request)
 177 sets the client's grant.
 178
 179 If the OBD_CONNECT_INDEX flag is set then the 'ocd_index' field is
 180 honored. The 'ocd_index' value is set in a reply to a connection
 181 request. It holds the LOV index of the target.
 182
 183 If the OBD_CONNECT_BRW_SIZE flag is set then the 'ocd_brw_size' field
 184 is honored. The 'ocd_brw_size' value sets the size of the maximum
 185 supported RPC. The client proposes a value in its connection request,
 186 and the server's reply will either agree or further limit the size.
 187
 188 If the OBD_CONNECT_IBITS flag is set then the 'ocd_ibits_known' field
 189 is honored. The 'ocd_ibits_known' value determines the handling of
 190 locks on inodes. See the discussion of inodes and extended attributes.
 191
 192 If the OBD_CONNECT_GRANT_PARAM flag is set then the 'ocd_blocksize',
 193 'ocd_inodespace', and 'ocd_grant_extent' fields are honored. A server
 194 reply uses the 'ocd_blocksize' value to inform the client of the log
 195 base two of the size in bytes of the backend file system's blocks.
 196
 197 A server reply uses the 'ocd_inodespace' value to inform the client of
 198 the log base two of the size of an inode.
 199
 200 Under some circumstances (for example when ZFS is the back end file
 201 system) there may be additional overhead in handling writes for each
 202 extent. The server uses the 'ocd_grant_extent' value to inform the
 203 client of the size in bytes consumed from its grant on the server when
 204 creating a new file. The client uses this value in calculating how
 205 much dirty write cache it has and whether it has reached the limit
 206 established by the target's grant.
 207
 208 If the OBD_CONNECT_TRANSNO flag is set then the 'ocd_transno' field is
 209 honored. A server uses the 'ocd_transno' value during recovery to
 210 inform the client of the transaction number at which it should begin
 211 replay.
 212
 213 If the OBD_CONNECT_MDS flag is set then the 'ocd_group' field is
 214 honored. When an MDT connects to an OST the 'ocd_group' field informs
 215 the OSS of the MDT's index. Objects on that OST for that MDT will be
 216 in a common namespace served by that MDT.
 217
 218 If the OBD_CONNECT_CKSUM flag is set then the 'ocd_cksum_types' field
 219 is honored. The client uses the 'ocd_checksum_types' field to propose
 220 to the server the client's available (presumably hardware assisted)
 221 checksum mechanisms. The server replies with the checksum types it has
 222 available. Finally, the client will employ the fastest of the agreed
 223 mechanisms.
 224
 225 If the OBD_CONNECT_MAX_EASIZE flag is set then the 'ocd_max_easize'
 226 field is honored. The server uses 'ocd_max_easize' to inform the
 227 client about the amount of space that can be allocated in each inode
 228 for extended attributes. The 'ocd_max_easize' specifically refers to
 229 the space used for striping information. This allows the client to
 230 determine the maximum layout size (and hence stripe count) that can be
 231 stored on the MDT.
 232
 233 The 'ocd_instance' field (alone) is not governed by an OBD_CONNECT_*
 234 flag. The MGS uses the 'ocd_instance' value in its reply to a
 235 connection request to inform the server and target of the "era" of its
 236 connection. The MGS initializes the era value for each server to zero
 237 and increments that value every time the target connects. This
 238 supports imperative recovery.
 239
 240 If the OBD_CONNECT_MAXBYTES flag is set then the 'ocd_maxbytes' field
 241 is honored. An OSS uses the 'ocd_maxbytes' value to inform the client
 242 of the maximum OST object size for this target.  A stripe on any OST
 243 for a multi-striped file cannot be larger than the minimum maxbytes
 244 value.
 245
 246 The additional space in the 'obd_connect_data' structure is unused and
 247 reserved for future use.
 248
 249 Other OBD_CONNECT_* flags have no corresponding field in
 250 obd_connect_data but still control client-server supported features.
 251
 252 If the OBD_CONNECT_RDONLY flag is set then the client is mounted in
 253 read-only mode and the server honors that by denying any modification
 254 from this client.
 255
 256 If the OBD_CONNECT_SRVLOCK flag is set then the client and server
 257 support lockless IO. The server will take locks for client IO requests
 258 with the OBD_BRW_SRVLOCK flag in the 'niobuf_remote' structure
 259 flags. This is used for Direct IO.  The client takes no LDLM lock and
 260 delegates locking to the server.
 261
 262 If the OBD_CONNECT_ACL flag is set then the server supports the ACL
 263 mount option for its filesystem. The client supports this mount option
 264 as well, in that case.
 265
 266 If the OBD_CONNECT_XATTR flag is set then the server supports user
 267 extended attributes. This is defined by the mount options of the
 268 servers of the backend file systems and is reflected on the client
 269 side by the same mount option but for the Lustre file system itself.
 270
 271 If the OBD_CONNECT_TRUNCLOCK flag is set then the client and the
 272 server support lockless truncate. This is realized in an OST_PUNCH RPC
 273 by setting the 'obdo' sturcture's 'o_flag' field to include the
 274 OBD_FL_SRVLOCK. In that circumstance the client takes no lock, and the
 275 server must take a lock on the resource.
 276
 277 If the OBD_CONNECT_ATTRFID flag is set then the server supports
 278 getattr requests by FID of file instead of name. This reduces
 279 unnecessary RPCs for DNE.
 280
 281 If the OBD_CONNECT_NODEVOH flag is set then the server provides no
 282 open handle for special inodes.
 283
 284 If the OBD_CONNECT_RMT_CLIENT is set then the client is set as
 285 'remote' with respect to the server. The client is considered as
 286 'local' if the user/group database on the client is identical to that
 287 on the server, otherwise the client is set as 'remote'. This
 288 terminology is part of Lustre Kerberos feature which is not supported
 289 now.
 290
 291 If the OBD_CONNECT_RMT_CLIENT_FORCE is set then client is set as
 292 remote client forcefully. If the server security level doesn't support
 293 remote clients then this client connect reply will return an -EACCESS
 294 error.
 295
 296 If the OBD_CONNECT_MDS_CAPA is set then MDS supports capability.
 297 Capabilities are part of Lustre Kerberos. The MDS prepares the
 298 capability when a file is opened and sends it to a client. A client
 299 has to present a capability when it wishes to perform an operation on
 300 that object.
 301
 302 If the OBD_CONNECT_OSS_CAPA is set then OSS supports capability.
 303 Capabilities are part of Lustre Kerberos. When the clients request the
 304 OSS to perform a modification operations on objects the capability
 305 authorizes these operations.
 306
 307 If the OBD_CONNECT_CANCELSET is set then early batched cancels are
 308 enabled.  The ELC (Early Lock Cancel) feature allows client locks to
 309 be cancelled prior the cancellation callback if it is clear that lock
 310 is not needed anymore, for example after rename, after removing file
 311 or directory, link, etc. This can reduce amount of RPCs significantly.
 312
 313 If the OBD_CONNECT_AT is set then client and server use Adaptive
 314 Timeout while request processing. Servers keep track of the RPCs
 315 processing time and report this information back to clients to
 316 estimate the time needed for future requests and set appropriate RPC
 317 timeouts.
 318
 319 If the OBD_CONNECT_LRU_RESIZE is set then the LRU self-adjusting is
 320 enabled.  This is set by the Lustre configurable option
 321 --enable-lru-resize, and is enabled by default.
 322
 323 If the OBD_CONNECT_FID is set then FID support is required by
 324 server. This compatibility flag was introduced in Lustre 2.0. All
 325 servers and clients are using FIDs nowadays. This flag is always set
 326 on server and used to filter out clients without FID support.
 327
 328 If the OBD_CONNECT_VBR is set then version based recovery is used on
 329 server.  The VBR uses an object version to track its changes on the
 330 server and to decide if the replay can be applied during recovery
 331 based on that version. This helps to complete recovery even if some
 332 clients were missed or evicted. That flag is always set on server
 333 since Lustre 1.8 and is used just to notify the server if client
 334 doesn't support VBR.
 335
 336 If the OBD_CONNECT_LOV_V3 is set then the client supports LOV vs
 337 EA. This type of the LOV extended attribute was introduced along with
 338 OST pools support and changed the internal structure of that EA. The
 339 OBD_CONNECT_LOV_V3 flag notifies a server if client doesn't support
 340 this type of LOV EA to handle requests from it properly.
 341
 342 If the OBD_CONNECT_GRANT_SHRINK is set then the client can release
 343 grant space when idle.
 344
 345 If the OBD_CONNECT_SKIP_ORPHAN is set then OST doesn't reuse orphan
 346 object ids after recovery. This connection flag is used between MDS
 347 and OST to agree about an object pre-creation policy after MDS
 348 recovery. If some of precreated objects weren't used but an MDT was
 349 restarted then an OST may re-use not used objects for new pre-create
 350 request or may not. The latter is preferred and is used by default
 351 nowadays.
 352
 353 If the OBD_CONNECT_FULL20 is set then the client is Lustre 2.x client.
 354 Clients that are using old 1.8 format protocol conventions are not
 355 allowed to connect. This flag should be set on all connections since
 356 2.0, it is no longer affects behaviour and will be disabled completely
 357 once Lustre interoperation with old clients is no longer needed.
 358
 359 If the OBD_CONNECT_LAYOUTLOCK is set then the client supports layout
 360 lock. The server will not grant a layout lock to the old clients
 361 having no such flag.
 362
 363 If the OBD_CONNECT_64BITHASH is set then the client supports 64-bit
 364 directory hash. The server will also use 64-bit hash mode while
 365 working with ldiskfs.
 366
 367 If the OBD_CONNECT_JOBSTATS is set then the client fills jobid in
 368 'ptlrpc_body' so server can provide extended statistics per jobid.
 369
 370 If the OBD_CONNECT_UMASK is set then create uses client umask. This is
 371 default flag for MDS but not for OST.
 372
 373 If the OBD_CONNECT_LVB_TYPE is set then the variable type of LVB is
 374 supported by a client. This flag was introduced along with DNE to
 375 recognize DNE-aware clients.
 376
 377 If the OBD_CONNECT_LIGHTWEIGHT is set then this connection is the
 378 'lightweight' one. A lightweight connection has no entry in last_rcvd
 379 file, so no recovery is possible, at the same time a lightweight
 380 connection can be set up while the target is in recovery, locks can
 381 still be acquired through this connection, although they won't be
 382 replayed. Such type of connection is used by services like quota
 383 manager, FLDB, etc.
 384
 385 If the OBD_CONNECT_PINGLESS is set then pings can be suppressed. If
 386 the client and server have this flag during connection and the ptlrpc
 387 module on server has the option "suppress_pings", then pings will be
 388 suppressed for this client.  There must be an external mechanism to
 389 notify the targets of client deaths, via the targets "evict_client"
 390 'procfs' entries. Pings can be disabled on OSTs only.
 391
 392 If the OBD_CONNECT_FLOCK_DEAD is set then the client support flock
 393 cancellation, which is used for the flock deadlock detection mechanism.
 394
 395 If the OBD_CONNECT_DISP_STRIPE is set then server returns a 'create
 396 stripe' disposition for open request from the client. This helps to
 397 optimize a recovery of open requests.
 398
 399 If the OBD_CONNECT_OPEN_BY_FID is set then an open by FID won't pack
 400 the name in a request. This is used by DNE.
 401
 402 If the OBD_CONNECT_MDS_MDS is set then the current connection is a
 403 MDS-MDS one. Such connections are distinguished because they provide
 404 more functionality specific to MDS-MDS interoperation.
 405
 406 If the OBD_CONNECT_IMP_RECOV is set then the Imperative Recovery is
 407 supported. Imperative recovery means the clients are notified
 408 explicitly when and where a failed target has restarted.
 409
 410 The OBD_CONNECT_REQPORTAL was used to specify that client may use
 411 OST_REQUEST_PORTAL for requests to don't interfere with IO portal,
 412 e.g. for MDS-OST interaction. Now it is default request portal for OSC
 413 and this flag does nothing though it is still set on client side
 414 during connection process.
 415
 416 The OBD_CONNECT_CROW flag was used for create-on-write functionality
 417 on OST, when data objects were created upon first write from the
 418 client. This wasn't implemented because of complex recovery problems.
 419
 420 The OBD_CONNECT_SOM flag was used to signal that MDS is capable to
 421 store file size in file attributes, so client may get it directly from
 422 MDS avoiding glimpse request to OSTs. This feature was implemented as
 423 demo feature and wasn't enabled by default. Finally it was disabled in
 424 Lustre 2.7 because it causes quite complex recovery cases to handle
 425 with relatevely small benefits.
 426
 427 The OBD_CONNECT_JOIN flag was used for the 'join files' feature, which
 428 allowed files to be concatenated. Lustre no longer supports that
 429 feature.
 430
 431 The OBD_CONNECT_QUOTA64 was used prior Lustre 2.4 for quota purposes,
 432 it is obsoleted due to new quota design.
 433
 434 The OBD_CONNECT_REAL is not real connection flag but used locally on
 435 client to distinguish real connection from local connections between
 436 layers.
 437
 438 The OBD_CONNECT_CHANGE_QS was used prior Lustre 2.4 for quota needs
 439 and it is obsoleted now due to new quota design.
 440
 441 If the OBD_CONNECT_EINPROGRESS is set then client handles -EINPROGRESS
 442 RPC error properly. The quota design requires that client must resend
 443 request with -EINPROGRESS error indefinitely, until successful
 444 completion or another error.  This flag is set on both client and
 445 server by default. Meanwhile this flag is not checked anywere, so does
 446 nothing.
 447
 448 If the OBD_CONNECT_FLOCK_OWNER is set then 1.8 clients has fixed flock
 449 policy and 2.x servers recognize them correctly. Meanwhile this flag
 450 is not checked anywhere, so does nothing.
 451
 452 If the OBD_CONNECT_NANOSEC_TIME is set then nanosecond timestamps are
 453 enabled.  This flag is not used nowadays, but reserved for future use.
 454
 455 If the OBD_CONNECT_SHORTIO is set then short IO feature is enabled on
 456 server.  The server will avoid bulk IO for small amount of data but
 457 data is incapsulated into ptlrpc request/reply. This flag is just
 458 reserved for future use and does nothing nowadays.
 459
 460 Import
 461 ^^^^^^
 462
 463 ----
 464 #define IMP_STATE_HIST_LEN 16
 465 struct import_state_hist {
 466         enum lustre_imp_state ish_state;
 467         time_t                ish_time;
 468 };
 469 struct obd_import {
 470     struct portals_handle     imp_handle;
 471     atomic_t                  imp_refcount;
 472     struct lustre_handle      imp_dlm_handle;
 473     struct ptlrpc_connection *imp_connection;
 474     struct ptlrpc_client     *imp_client;
 475     cfs_list_t        imp_pinger_chain;
 476     cfs_list_t        imp_zombie_chain;
 477     cfs_list_t        imp_replay_list;
 478     cfs_list_t        imp_sending_list;
 479     cfs_list_t        imp_delayed_list;
 480     cfs_list_t      imp_committed_list;
 481     cfs_list_t     *imp_replay_cursor;
 482     struct obd_device    *imp_obd;
 483     struct ptlrpc_sec    *imp_sec;
 484     struct mutex      imp_sec_mutex;
 485     cfs_time_t        imp_sec_expire;
 486     wait_queue_head_t     imp_recovery_waitq;
 487     atomic_t          imp_inflight;
 488     atomic_t          imp_unregistering;
 489     atomic_t          imp_replay_inflight;
 490     atomic_t          imp_inval_count;
 491     atomic_t          imp_timeouts;
 492     enum lustre_imp_state     imp_state;
 493     struct import_state_hist  imp_state_hist[IMP_STATE_HIST_LEN];
 494     int               imp_state_hist_idx;
 495     int               imp_generation;
 496     __u32             imp_conn_cnt;
 497     int               imp_last_generation_checked;
 498     __u64             imp_last_replay_transno;
 499     __u64             imp_peer_committed_transno;
 500     __u64             imp_last_transno_checked;
 501     struct lustre_handle      imp_remote_handle;
 502     cfs_time_t        imp_next_ping;
 503     __u64             imp_last_success_conn;
 504     cfs_list_t        imp_conn_list;
 505     struct obd_import_conn   *imp_conn_current;
 506     spinlock_t      imp_lock;
 507     /* flags */
 508     unsigned long
 509       imp_no_timeout:1,
 510       imp_invalid:1,
 511       imp_deactive:1,
 512       imp_replayable:1,
 513       imp_dlm_fake:1,
 514       imp_server_timeout:1,
 515       imp_delayed_recovery:1,
 516       imp_no_lock_replay:1,
 517       imp_vbr_failed:1,
 518       imp_force_verify:1,
 519       imp_force_next_verify:1,
 520       imp_pingable:1,
 521       imp_resend_replay:1,
 522       imp_no_pinger_recover:1,
 523       imp_need_mne_swab:1,
 524       imp_force_reconnect:1,
 525       imp_connect_tried:1;
 526     __u32             imp_connect_op;
 527     struct obd_connect_data   imp_connect_data;
 528     __u64             imp_connect_flags_orig;
 529     int               imp_connect_error;
 530     __u32             imp_msg_magic;
 531     __u32             imp_msghdr_flags;       /* adjusted based on server capability */
 532     struct ptlrpc_request_pool *imp_rq_pool;      /* emergency request pool */
 533     struct imp_at         imp_at;         /* adaptive timeout data */
 534     time_t            imp_last_reply_time;    /* for health check */
 535 };
 536 ----
 537
 538 The 'imp_handle' value is the unique id for the import, and is used as
 539 a hash key to gain access to it. It is not used in any of the Lustre
 540 protocol messages, but rather is just for internal reference.
 541
 542 The 'imp_refcount' is also for internal use. The value is incremented
 543 with each RPC created, and decremented as the request is freed. When
 544 the reference count is zero the import can be freed, as when the
 545 target is being disconnected.
 546
 547 The 'imp_dlm_handle' is a reference to the LDLM export for this
 548 client.
 549
 550 There can be multiple paths through the network to a given
 551 target, in which case there would be multiple 'obd_import_conn' items
 552 on the 'imp_conn_list'. Each 'obd_imp_conn' includes a
 553 'ptlrpc_connection', so 'imp_connection' points to the one that is
 554 actually in use.
 555
 556 The 'imp_client' identifies the (local) portals for sending and
 557 receiving messages as well as the client's name. The information is
 558 specific to either an MDC or an OSC.
 559
 560 The 'imp_ping_chain' places the import on a linked list of imports
 561 that need periodic pings.
 562
 563 The 'imp_zombie_chain' places the import on a list ready for being
 564 freed. Unused imports ('imp_refcount' is zero) are deleted
 565 asynchronously by a garbage collecting process.
 566
 567 In order to support recovery the client must keep requests that are in
 568 the process of being handled by the target.  The target replies to a
 569 request as soon as the target has made its local update to
 570 memory. When the client receives that reply the request is put on the
 571 'imp_replay_list'. In the event of a failure (target crash, lost
 572 message) this list is then replayed for the target during the recovery
 573 process. When a request has been sent but has not yet received a reply
 574 it is placed on the 'imp_sending_list'. In the event of a failure
 575 those will simply be replayed after any recovery has been
 576 completed. Finally, there may be requests that the client is delaying
 577 before it sends them. This can happen if the client is in a degraded
 578 mode, as when it is in recovery after a failure. These requests are
 579 put on the 'imp_delayed_list' and not processed until recovery is
 580 complete and the 'imp_sending_list' has been replayed.
 581
 582 In order to support recovery 'open' requests must be preserved even
 583 after they have completed. Those requests are placed on the
 584 'imp_committed_list' and the 'imp_replay_cursor' allows for
 585 accelerated access to those items.
 586
 587 The 'imp_obd' is a reference to the details about the target device
 588 that is the subject of this import. There is a lot of state info in
 589 there along with many implementation details that are not relevant to
 590 the actual Lustre protocol. fixme: I'll want to go through all of the
 591 fields in that structure to see which, if any need more
 592 documentation.
 593
 594 The security policy and settings are kept in 'imp_sec', and
 595 'imp_sec_mutex' helps manage access to that info. The 'imp_sec_expire'
 596 setting is in support of security policies that have an expiration
 597 strategy.
 598
 599 Some processes may need the import to be in a fully connected state in
 600 order to proceed. The 'imp_recovery_waitq' is where those threads will
 601 wait during recovery.
 602
 603 The 'imp_inflight' field counts the number of in-flight requests. It
 604 is incremented with each request sent and decremented with each reply
 605 received.
 606
 607 The client reserves buffers for the processing of requests and
 608 replies, and then informs LNet about those buffers. Buffers may get
 609 reused during subsequent processing, but then a point may come when
 610 the buffer is no longer going to be used. The client increments the
 611 'imp_unregistering' counter and informs LNet the buffer is no longer
 612 needed. When LNet has freed the buffer it will notify the client and
 613 then the 'imp_unregistering' can be decremented again.
 614
 615 During recovery the 'imp_reply_inflight' counts the number of requests
 616 from the reply list that have been sent and have not been replied to.
 617
 618 The 'imp_inval_count' field counts how many threads are in the process
 619 of cleaning up this connection or waiting for cleanup to complete. The
 620 cleanup itself may be needed in the case there is an eviction or other
 621 problem (fixme what other problem?). The cleanup may involve freeing
 622 allocated resources, updating internal state, running replay lists,
 623 and invalidating cache. Since it could take a while there may end up
 624 multiple threads waiting on this process to complete.
 625
 626 The 'imp_timeout' field is a counter that is incremented every time
 627 there is a timeout in communication with the target.
 628
 629 The 'imp_state' tracks the state of the import. It draws from the
 630 enumerated set of values:
 631
 632 .enum_lustre_imp_state
 633 [options="header"]
 634 |=====
 635 | state name              | value
 636 | LUSTRE_IMP_CLOSED       | 1
 637 | LUSTRE_IMP_NEW          | 2
 638 | LUSTRE_IMP_DISCON       | 3
 639 | LUSTRE_IMP_CONNECTING   | 4
 640 | LUSTRE_IMP_REPLAY       | 5
 641 | LUSTRE_IMP_REPLAY_LOCKS | 6
 642 | LUSTRE_IMP_REPLAY_WAIT  | 7
 643 | LUSTRE_IMP_RECOVER      | 8
 644 | LUSTRE_IMP_FULL         | 9
 645 | LUSTRE_IMP_EVICTED      | 10
 646 |=====
 647 fixme: what are the transitions between these states? The
 648 'imp_state_hist' array maintains a list of the last 16
 649 (IMP_STATE_HIST_LEN) states the import was in, along with the time it
 650 entered each (fixme: or is it when it left that  state?). The list is
 651 maintained in a circular manner, so the 'imp_state_hist_idx' points to
 652 the entry in the list for the most recently visited state.
 653
 654 The 'imp_generation' and 'imp_conn_cnt' fields are monotonically
 655 increasing counters. Every time a connection request is sent to the
 656 target the 'imp_conn_cnt' counter is incremented, and every time a
 657 reply is received for the connection request the 'imp_generation'
 658 counter is incremented.
 659
 660 The 'imp_last_generation_checked' implements an optimization. When a
 661 replay process has successfully traversed the reply list the
 662 'imp_generation' value is noted here. If the generation has not
 663 incremented then the replay list does not need to be traversed again.
 664
 665 During replay the 'imp_last_replay_transno' is set to the transaction
 666 number of the last request being replayed, and
 667 'imp_peer_committed_transno is set to the 'pb_last_committed' value
 668 (of the 'ptlrpc_body') from replies if that value is higher than the
 669 previous 'imp_peer_committed_transno'.  The 'imp_last_transno_checked'
 670 field implements an optimization. It is set to the
 671 'imp_last_replay_transno' as its replay is initiated. If
 672 'imp_last_transno_checked' is still 'imp_last_replay_transno' and
 673 'imp_generation' is still 'imp_last_generation_checked' then  there
 674 are no additional requests ready to be removed from the replay
 675 list. Furthermore, 'imp_last_transno_checked' may no longer be needed,
 676 since the committed transactions are now maintained on a separate list.
 677
 678 The 'imp_remote_handle' is the handle sent by the target in a
 679 connection reply message to uniquely identify the export for this
 680 target and client that is maintained on the server. This is the handle
 681 used in all subsequent messages to the target.
 682
 683 There are two separate ping intervals (fixme: what are the
 684 values?). If there are no uncommitted messages for the target then the
 685 default ping interval is used to set the 'imp_next_ping' to the time
 686 the next ping needs to be sent. If there are uncommitted requests then
 687 a "short interval" is used to set the time for the next ping.
 688
 689 The 'imp_last_success_conn' value is set to the time of the last
 690 successful connection. fixme: The source says it is in 64 bit
 691 jiffies, but does not further indicate how that value is calculated.
 692
 693 Since there can actually be multiple connection paths for a target
 694 (due to failover or multihomed configurations) the import maintains a
 695 list of all the possible connection paths in the list pointed to by
 696 the 'imp_conn_list' field. The 'imp_conn_current' points to the one
 697 currently in use. Compare with the 'imp_connection' fields. They point
 698 to different structures, but each is reachable from the other.
 699
 700 Most of the flag, state, and list information in the import needs to
 701 be accessed atomically. The 'imp_lock' is used to maintain the
 702 consistency of the import while it is manipulated by multiple threads.
 703
 704 The various flags are documented in the source code and are largely
 705 obvious from those short comments, reproduced here:
 706
 707 .import flags
 708 [options="header"]
 709 |=====
 710 | flag                    | explanation
 711 | imp_no_timeout          | timeouts are disabled
 712 | imp_invalid             | client has been evicted
 713 | imp_deactive            | client administratively disabled
 714 | imp_replayable          | try to recover the import
 715 | imp_dlm_fake            | don't run recovery (timeout instead)
 716 | imp_server_timeout      | use 1/2 timeout on MDSs and OSCs
 717 | imp_delayed_recovery    | VBR: imp in delayed recovery
 718 | imp_no_lock_replay      | VBR: if gap was found then no lock replays
 719 | imp_vbr_failed          | recovery by versions was failed
 720 | imp_force_verify        | force an immidiate ping
 721 | imp_force_next_verify   | force a scheduled ping
 722 | imp_pingable            | target is pingable
 723 | imp_resend_replay       | resend for replay
 724 | imp_no_pinger_recover   | disable normal recovery, for test only.
 725 | imp_need_mne_swab       | need IR MNE swab
 726 | imp_force_reconnect     | import must be reconnected, not new connection
 727 | imp_connect_tried       | import has tried to connect with server
 728 |=====
 729 A few additional notes are in order. The 'imp_dlm_fake' flag signifies
 730 that this is not a "real" import, but rather it is a "reverse"import
 731 in support of the LDLM. When the LDLM invokes callback operations the
 732 messages are initiated at the other end, so there need to a fake
 733 import to receive the replies from the operation. Prior to the
 734 introduction of adaptive timeouts the servers were given fixed timeout
 735 value that were half those used for the clients. The
 736 'imp_server_timeout' flag indicated that the import should use the
 737 half-sized timeouts, but with the introduction of adaptive timeouts
 738 this facility is no longer used. "VBR" is "version based recovery",
 739 and it introduces a new possibility for handling requests. Previously,
 740 f there were a gap in the transaction number sequence the the requests
 741 associated with the missing transaction numbers would be
 742 discarded. With VBR those transaction only need to be discarded if
 743 there is an actual dependency between the ones that were skipped and
 744 the currently latest committed transaction number. fixme: What are the
 745 circumstances that would lead to setting the 'imp_force_next_verify'
 746 or 'imp_pingable' flags? During recovery, the client sets the
 747 'imp_no_pinger_recover' flag, which tells the process to proceed from
 748 the current value of 'imp_replay_last_transno'. The
 749 'imp_need_mne_swab' flag indicates a version dependent circumstance
 750 where swabbing was inadvertently left out of one processing step.
 751
 752
 753 Export
 754 ^^^^^^
 755
 756 An 'obd_export' structure for a given target is created on a server
 757 for each client that connects to that target. The exports for all the
 758 clients for a given target are managed together. The export represents
 759 the connection state between the client and target as well as the
 760 current state of any ongoing activity. Thus each pending request will
 761 have a reference to the export. The export is discarded if the
 762 connection goes away, but only after all the references to it have
 763 been cleaned up. The state information for each export is also
 764 maintained on disk. In the event of a server failure, that or another
 765 server can read the export date from disk to enable recovery.
 766
 767 ----
 768 struct obd_export {
 769     struct portals_handle     exp_handle;
 770     atomic_t   exp_refcount;
 771     atomic_t   exp_rpc_count;
 772     atomic_t   exp_cb_count;
 773     atomic_t   exp_replay_count;
 774     atomic_t   exp_locks_count;
 775 #if LUSTRE_TRACKS_LOCK_EXP_REFS
 776     cfs_list_t exp_locks_list;
 777     spinlock_t      exp_locks_list_guard;
 778 #endif
 779     struct obd_uuid       exp_client_uuid;
 780     cfs_list_t exp_obd_chain;
 781     cfs_hlist_node_t      exp_uuid_hash;
 782     cfs_hlist_node_t      exp_nid_hash;
 783     cfs_list_t            exp_obd_chain_timed;
 784     struct obd_device    *exp_obd;
 785     struct obd_import    *exp_imp_reverse;
 786     struct nid_stat      *exp_nid_stats;
 787     struct ptlrpc_connection *exp_connection;
 788     __u32       exp_conn_cnt;
 789     cfs_hash_t *exp_lock_hash;
 790     cfs_hash_t *exp_flock_hash;
 791     cfs_list_t  exp_outstanding_replies;
 792     cfs_list_t  exp_uncommitted_replies;
 793     spinlock_t  exp_uncommitted_replies_lock;
 794     __u64       exp_last_committed;
 795     cfs_time_t  exp_last_request_time;
 796     cfs_list_t  exp_req_replay_queue;
 797     spinlock_t  exp_lock;
 798     struct obd_connect_data   exp_connect_data;
 799     enum obd_option       exp_flags;
 800     unsigned long
 801       exp_failed:1,
 802       exp_in_recovery:1,
 803       exp_disconnected:1,
 804       exp_connecting:1,
 805       exp_delayed:1,
 806       exp_vbr_failed:1,
 807       exp_req_replay_needed:1,
 808       exp_lock_replay_needed:1,
 809       exp_need_sync:1,
 810       exp_flvr_changed:1,
 811       exp_flvr_adapt:1,
 812       exp_libclient:1,
 813       exp_need_mne_swab:1;
 814     enum lustre_sec_part      exp_sp_peer;
 815     struct sptlrpc_flavor     exp_flvr;
 816     struct sptlrpc_flavor     exp_flvr_old[2];
 817     cfs_time_t exp_flvr_expire[2];
 818     spinlock_t exp_rpc_lock;
 819     cfs_list_t exp_hp_rpcs;
 820     cfs_list_t exp_reg_rpcs;
 821     cfs_list_t exp_bl_list;
 822     spinlock_t exp_bl_list_lock;
 823     union {
 824         struct tg_export_data     eu_target_data;
 825         struct mdt_export_data    eu_mdt_data;
 826         struct filter_export_data eu_filter_data;
 827         struct ec_export_data     eu_ec_data;
 828         struct mgs_export_data    eu_mgs_data;
 829     } u;
 830     struct nodemap      *exp_nodemap;
 831 };
 832 ----
 833
 834 The 'exp_handle' is a little extra information as compared with a
 835 'struct lustre_handle', which is just the cookie. The cookie that the
 836 server generates to uniquely identify this connection gets put into
 837 this structure along with their information about the device in
 838 question. This is the cookie the *_CONNECT reply sends back to the
 839 client and is then stored int he client's import.
 840
 841 The 'exp_refcount' gets incremented whenever some aspect of the export
 842 is "in use". The arrival of an otherwise unprocessed message for this
 843 target will increment the refcount. A reference by an LDLM lock that
 844 gets taken will increment the refcount. Callback invocations and
 845 replay also lead to incrementing the 'ref_count'. The next four fields
 846 - 'exp_rpc_count', exp_cb_count', and 'exp_replay_count', and
 847 'exp_locks_count' - all subcategorize the 'exp_refcount'. The
 848 reference counter keeps the export alive while there are any users of
 849 that export. The reference counter is also used for debug
 850 purposes. Similarly, the 'exp_locks_list' and 'exp_locks_list_guard'
 851 are further debug info that list the actual locks accounted for in
 852 'exp_locks_count'.
 853
 854 The 'exp_client_uuid' gives the UUID of the client connected to this
 855 export. Fixme: when and how does the UUID get generated?
 856
 857 The server maintains all the exports for a given target on a circular
 858 list. Each export's place on that list is maintained in the
 859 'exp_obd_chain'. A common activity is to look up the export based on
 860 the UUID or the nid of the client, and the 'exp_uuid_hash' and
 861 'exp_nid_hash' fields maintain this export's place in hashes
 862 constructed for that purpose.
 863
 864 Exports are also maintained on a list sorted by the last time the
 865 corresponding client was heard from. The 'exp_obd_chain_timed' field
 866 maintains the export's place on that list. When a message arrives from
 867 the client the time is "now" so the export gets put at the end of the
 868 list. Since it is circular, the next export is then the oldest. If it
 869 has not been heard of within its timeout interval that export is
 870 marked for later eviction.
 871
 872 The 'exp_obd' points to the 'obd_device' structure for the device that
 873 is the target of this export.
 874
 875 In the event of an LDLM call-back the export needs to have a the ability to
 876 initiate messages back to the client. The 'exp_imp_reverse' provides a
 877 "reverse" import that manages this capability.
 878
 879 The '/proc' stats for the export (and the target) get updated via the
 880 'exp_nid_stats'.
 881
 882 The 'exp_connection' points to the connection information for this
 883 export. This is the information about the actual networking pathway(s)
 884 that get used for communication.
 885
 886
 887 The 'exp_conn_cnt' notes the connection count value from the client at
 888 the time of the connection. In the event that more than one connection
 889 request is issued before the connection is established then the
 890 'exp_conn_cnt' will list the highest value. If a previous connection
 891 attempt (with a lower value) arrives later it may be safely
 892 discarded. Every request lists its connection count, so non-connection
 893 requests with lower connection count values can also be discarded.
 894 Note that this does not count how many times the client has connected
 895 to the target. If a client is evicted the export is deleted once it
 896 has been cleaned up and its 'exp_ref_count' reduced to zero. A new
 897 connection from the client will get a new export.
 898
 899 The 'exp_lock_hash' provides access to the locks granted to the
 900 corresponding client for this target. If a lock cannot be granted it
 901 is discarded. A file system lock ("flock") is also implemented through
 902 the LDLM lock system, but not all LDLM locks are flocks. The ones that
 903 are flocks are gathered in a hash 'exp_flock_hash'. This supports
 904 deadlock detection.
 905
 906 For those requests that initiate file system modifying transactions
 907 the request and its attendant locks need to be preserved until either
 908 a) the client acknowleges recieving the reply, or b) the transaction
 909 has been committed locally. This ensures a request can be replayed in
 910 the event of a failure. The LDLM lock is being kept until one of these
 911 event occurs to prevent any other modifications of the same object.
 912 The reply is kept on the 'exp_outstanding_replies' list until the LNet
 913 layer notifies the server that the reply has been acknowledged. A reply
 914 is kept on the 'exp_uncommitted_replies' list until the transaction
 915 (if any) has been committed.
 916
 917 The 'exp_last_committed' value keeps the transaction number of the
 918 last committed transaction. Every reply to a client includes this
 919 value as a means of early-as-possible notification of transactions that
 920 have been committed.
 921
 922 The 'exp_last_request_time' is self explanatory.
 923
 924 During reply a request that is waiting for reply is maintained on the
 925 list 'exp_req_replay_queue'.
 926
 927 The 'exp_lock' spin-lock is used for access control to the exports
 928 flags, as well as the 'exp_outstanding_replies' list and the revers
 929 import, if any.
 930
 931 The 'exp_connect_data' refers to an 'obd_connect_data' structure for
 932 the connection established between this target and the client this
 933 export refers to. See also the corresponding entry in the import and
 934 in the connect messages passed between the hosts.
 935
 936 The 'exp_flags' field encodes three directives as follows:
 937 ----
 938 enum obd_option {
 939         OBD_OPT_FORCE =         0x0001,
 940         OBD_OPT_FAILOVER =      0x0002,
 941         OBD_OPT_ABORT_RECOV =   0x0004,
 942 };
 943 ----
 944 fixme: Are the set for some exports and a condition of their
 945 existence? or do they reflect a transient state the export is passing
 946 through?
 947
 948 The 'exp_failed' flag gets set whenever the target has failed for any
 949 reason or the export is otherwise due to be cleaned up. Once set it
 950 will not be unset in this export. Any subsequent connection between
 951 the client and the target would be governed by a new export.
 952
 953 After a failure export data is retrieved from disk and the exports
 954 recreated. Exports created in this way will have their
 955 'exp_in_recovery' flag set. Once any outstanding requests and locks
 956 have been recovered for the client, then the export is recovered and
 957 'exp_in_recovery' can be cleared. When all the client exports for a
 958 given target have been recovered then the target is considered
 959 recovered, and when all targets have been recovered the server is
 960 considered recovered.
 961
 962 A *_DISCONNECT message from the client will set the 'exp_disconnected'
 963 flag, as will any sort of failure of the target. Once set the export
 964 will be cleaned up and deleted.
 965
 966 When a *_CONNECT message arrives the 'exp_connecting' flag is set. If
 967 for some reason a second *_CONNECT request arrives from the client it can
 968 be discarded when this flag is set.
 969
 970 The 'exp_delayed' flag is no longer used. In older code it indicated
 971 that recovery had not completed in a timely fashion, but that a tardy
 972 recovery would still be possible, since there were no dependencies on
 973 the export.
 974
 975 The 'exp_vbr_failed' flag indicates a failure during the recovery
 976 process. See <<recovery>> for a more detailed discussion of recovery
 977 and transaction replay. For a file system modifying request, the
 978 server composes its reply including the 'pb_pre_versions' entries in
 979 'ptlrpc_body', which indicate the most recent updates to the
 980 object. The client updates the request with the 'pb_transno' and
 981 'pb_pre_versions' from the reply, and keeps that request until the
 982 target signals that the transaction has been committed to disk. If the
 983 client times-out without that confirmation then it will 'replay' the
 984 request, which now includes the 'pb_pre_versions' information. During
 985 a replay the target checks that the object has the same version as
 986 'pb_pre_versions' in replay. If this check fails then the object can't
 987 be restored in the same state as it was in before failure. Usually that
 988 happens if the recovery process fails for the connection between some
 989 other client and this target, so part of change needed for this client
 990 wasn't restored. At that point the 'exp_vbr_failed' flag is set
 991 to indicate version based recovery failed. This will lead to the client
 992 being evicted and this export being cleaned up and deleted.
 993
 994 At the start of recovery both the 'exp_req_replay_needed' and
 995 'exp_lock_replay_needed' flags are set. As request replay is completed
 996 the 'exp_req_replay_needed' flag is cleared. As lock replay is
 997 completed the 'exp_lock_replay_needed' flag is cleared. Once both are
 998 cleared the 'exp_in_recovery' flag can be cleared.
 999
1000 The 'exp_need_sync' supports an optimization. At mount time it is
1001 likely that every client (potentially thousands) will create an export
1002 and that export will need to be saved to disk synchronously. This can
1003 lead to an unusually high and poorly performing interaction with the
1004 disk. When the export is created the 'exp_need_sync' flag is set and
1005 the actual writing to disk is delayed. As transactions arrive from
1006 clients (in a much less coordinated fashion) the 'exp_need_sync' flag
1007 indicates a need to have the export data on disk before proceeding
1008 with a new transaction, so as it is next updated the transaction is
1009 done synchronously to commit all changes on disk. At that point the
1010 flag is cleared (except see below).
1011
1012 In DNE (phase I) the export for an MDT managing the connection from
1013 another MDT will want to always keep the 'exp_need_sync' flag set. For
1014 that special case such an export sets the 'exp_keep_sync', which then
1015 prevents the 'exp_need_sync' flag from ever being cleared. This will
1016 no longer be needed in DNE Phase II.
1017
1018 The 'exp_flvr_changed' and 'exp_flvr_adapt' flags along with
1019 'exp_sp_peer', 'exp_flvr', 'exp_flvr_old', and 'exp_flvr_expire'
1020 fields are all used to manage the security settings for the
1021 connection. Security is discussed in the <<security>> section. (fixme:
1022 or will be.)
1023
1024 The 'exp_libclient' flag indicates that the export is for a client
1025 based on "liblustre". This allows for simplified handling on the
1026 server. (fixme: how is processing simplified? It sounds like I may
1027 need a whole special section on liblustre.)
1028
1029 The 'exp_need_mne_swab' flag indicates the presence of an old bug that
1030 affected one special case of failed swabbing. It is not part of
1031 current processing.
1032
1033 As RPCs arrive they are first subjected to triage. Each request is
1034 placed on the 'exp_hp_rpcs' list and examined to see if it is high
1035 priority (PING, truncate, bulk I/O). If it is not high priority then
1036 it is moved to the 'exp_reg_prcs' list. The 'exp_rpc_lock' protects
1037 both lists from concurrent access.
1038
1039 All arriving LDLM requests get put on the 'exp_bl_list' and access to
1040 that list is controlled via the 'exp_bl_list_lock'.
1041
1042 The union provides for target specific data. The 'eu_target_data' is
1043 for a common core of fields for a generic target. The others are
1044 specific to particular target types: 'eu_mdt_data' for MDTs,
1045 'eu_filter_data' for OSTs, 'eu_ec_data' for an "echo client" (fixme:
1046 describe what an echo client is somewhere), and 'eu_mgs_data' is for
1047 an MGS.
1048
1049 The 'exp_bl_lock_at' field supports adaptive timeouts which will be
1050 discussed separately. (fixme: so discuss it somewhere.)
1051
1052 Connection Count
1053 ^^^^^^^^^^^^^^^^
1054
1055 Each export maintains a connection count.