lustre/doc/clio.txt

   1                ******************************************************
   2                * Overview of the Lustre Client I/O (CLIO) subsystem *
   3                ******************************************************
   4
   5 Original Authors:
   6 =================
   7 Nikita Danilov <Nikita_Danilov@xyratex.com>
   8
   9 Topics
  10 ======
  11
  12 1. Overview
  13         1.1. Goals
  14         1.2. Terminology
  15                 i.   I/O vs. Transfer
  16                 ii.  Top-{object,lock,page}, Sub-{object,lock,page}
  17                 iii. VVP, SLP, CCC
  18         1.3. Main Differences with the Pre-CLIO Client Code
  19         1.4. Layered objects, Slices, Headers
  20         1.5. Instantiation
  21         1.6. Life Cycle
  22         1.7. State Machines
  23         1.8. Finalization
  24         1.9. Code Structure
  25 2. Layers
  26         2.1. VVP, SLP, Echo-client
  27         2.2. LOV, LOVSUB (layouts)
  28         2.3. OSC
  29 3. Objects
  30         3.1. FID, Hashing, Caching, LRU
  31         3.2. Top-object, Sub-object
  32         3.3. Object Operations
  33         3.4. Object Attributes
  34 4. Pages
  35         4.1. Page Indexing
  36         4.2. Page Ownership
  37         4.3. Page Transfer Locking
  38         4.4. Page Operations
  39 5. Locks
  40         5.1. Lock Life Cycle
  41         5.2. Top-lock and Sub-locks
  42         5.3. Lock State Machine
  43         5.4. Lock Concurrency
  44         5.5. Shared Sub-locks
  45         5.6. Use Case: Lock Invalidation
  46 6. IO
  47         6.1. Fixed IO Types
  48         6.2. IO State Machine
  49         6.3. Parallel IO
  50         6.4. Data-flow: From Stack to IO Slice
  51 7. Transfer
  52         7.1. Immediate vs. Opportunistic Transfers
  53         7.2. Page Lists
  54         7.3. Transfer States: Prepare, Completion
  55         7.4. Page Completion Handlers, Synchronous Transfer
  56 8. LU Environment (lu_env)
  57         8.1. Motivation, Server Environment Usage
  58         8.2. Client Usage
  59         8.3. Sub-environments
  60 9. Use Cases
  61         9.1. Inode Creation
  62         9.2. First IO to a File
  63                 i. Read, Read-ahead
  64                 ii. Write
  65         9.3. Cached IO
  66         9.4. Lock-less and No-cache IO
  67
  68 ================
  69 = 1. Overview =
  70 ================
  71
  72 1.1. Goals
  73 ==========
  74
  75 CLIO is a re-write of interfaces between layers in the client data-path (read,
  76 write, truncate). Its goals are:
  77
  78 - Reduce the number of bugs in the IO path;
  79
  80 - Introduce more logical layer interfaces instead of current all-in-one OBD
  81   device interface;
  82
  83 - Define clear and precise semantics for the interface entry points;
  84
  85 - Simplify the structure of the client code.
  86
  87 - Support upcoming features:
  88
  89         . SNS,
  90         . p2p caching,
  91         . parallel non-blocking IO,
  92         . pNFS;
  93
  94 - Reduce stack consumption.
  95
  96 Restrictions:
  97
  98 - No meta-data changes;
  99 - No support for 2.4 kernels;
 100 - Portable code;
 101 - No changes to recovery;
 102 - The same layers with mostly the same functionality;
 103 - As few changes to the core logic of each Lustre data-stack layer as possible
 104   (e.g., no changes to the read-ahead or OSC RPC logic).
 105
 106 1.2. Terminology
 107 ================
 108
 109 Any discussion of client functionality has to talk about `read' and `write'
 110 system calls on the one hand and about `read' and `write' requests to the
 111 server on the other hand. To avoid confusion, the former high level operations
 112 are called `IO', while the latter are called `transfer'.
 113
 114 Many concepts apply uniformly to pages, locks, files, and IO contexts, such as
 115 reference counting, caching, etc. To describe such situations, a common term is
 116 needed to denote things from any of the above classes. `Object' would be a
 117 natural choice, but files and especially stripes are already called objects, so
 118 the term `entity' is used instead.
 119
 120 Due to the striping it's often a case that some entity is composed of multiple
 121 entities of the same kind: a file is composed of stripe objects, a logical lock
 122 on a file is composed of stripe locks on the file's stripes, etc. In these
 123 cases we shall talk about top-object, top-lock, top-page, top-IO, etc. being
 124 constructed from sub-objects, sub-locks, sub-pages, sub-io's respectively.
 125
 126 The topmost module in the Linux client, is traditionally known as `llite'.  The
 127 corresponding CLIO layer is called `VVP' (VFS, VM, POSIX) to reflect its
 128 functional responsibilities. The top-level layer for liblustre is called `SLP'.
 129 VVP and SLP share a lot of logic and data-types. Their common functions and
 130 types are prefixed with `ccc' which stands for "Common Client Code".
 131
 132 1.3. Main Differences with the Pre-CLIO Client Code
 133 ===================================================
 134
 135 - Locks on files (as opposed to locks on stripes) are first-class objects (i.e.
 136   layered entities);
 137
 138 - Sub-objects (stripes) are first-class objects;
 139
 140 - Stripe-related logic is moved out of llite (almost);
 141
 142 - IO control flow is different:
 143
 144         . Pre-CLIO: llite implements control flow, calling underlying OBD
 145           methods as necessary;
 146
 147         . CLIO: generic code (cl_io_loop()) controls IO logic calling all
 148           layers, including VVP.
 149
 150   In other words, VVP (or any other top-layer) instead of calling some
 151   pre-existing `lustre interface', also implements parts of this interface.
 152
 153 - The lu_env allocator from MDT is used on the client.
 154
 155 1.4. Layered Objects
 156 ====================
 157
 158 CLIO continues the layered object approach that was found to be useful for the
 159 MDS rewrite in Lustre 2.0. In this approach, instances of key entity types
 160 (files, pages, locks, etc.) are represented as a header, containing attributes
 161 shared by all layers. Each header contains a linked list of per-layer `slices',
 162 each of which contains a pointer to a vector of function pointers. Generic
 163 operations on layered objects are implemented by going through the list of
 164 slices and invoking the corresponding function from the operation vector at
 165 every layer.  In this way generic object behavior is delegated to the layers.
 166
 167 For example, a page entity is represented by struct cl_page, from which hangs
 168 off a list of cl_page_slice structures, one for each layer in the stack.
 169 cl_page_slice contains a pointer to struct cl_page_operations.
 170 cl_page_operations has the field
 171
 172         void (*cpo_completion)(const struct lu_env *env,
 173               const struct cl_page_slice *slice, int ioret);
 174
 175 When transfer of a page is finished, ->cpo_completion() methods are called in a
 176 particular order (bottom to top in this case).
 177
 178 Allocation of slices is done during instance creation. If a layer needs some
 179 private state for an object, it embeds the slice into its own data structure.
 180 For example, the OSC layer defines
 181
 182         struct osc_lock {
 183                 struct cl_lock_slice ols_cl;
 184                 struct ldlm_lock *ols_lock;
 185                 ...
 186         };
 187
 188 When an operation from cl_lock_operations is called, it is given a pointer to
 189 struct cl_lock_slice, and the layer casts it to its private structure (for
 190 example, struct osc_lock) to access per-layer state.
 191
 192 The following types of layered objects exist in CLIO:
 193
 194 - File system objects (files and stripes): struct cl_object_header, slices
 195   are of type struct cl_object;
 196
 197 - Cached pages with data: struct cl_page, slices are of type
 198   cl_page_slice;
 199
 200 - Extent locks: struct cl_lock, slices are of type cl_lock_slice;
 201
 202 - IO content: struct cl_io, slices are of type cl_io_slice;
 203
 204 - Transfer request: struct cl_req, slices are of type cl_req_slice.
 205
 206 1.5. Instantiation
 207 ==================
 208
 209 Entities with different sequences of slices can co-exist. A typical example of
 210 this is a local vs. remote object on the MDS. A local object, based on some
 211 file in the local file system has MDT, MDD, LOD and OSD as its layers, whereas
 212 a remote object (representing an object local to some other MDT) has MDT, MDD,
 213 LOD, and OSP layers.
 214
 215 When the client is being mounted, its device stack is configured according to
 216 llog configuration records. The typical configuration is
 217
 218                            vvp_device
 219                                 |
 220                                 V
 221                            lov_device
 222                                 |
 223                         +---+---+---+---+
 224                         |   |   |   |   |
 225                         V   V   V   V   V
 226                       .... osc_device's ....
 227
 228 In this tree every node knows its descendants. When a new file (inode) is
 229 created, every layer, starting from the top, creates a slice with a state and
 230 an operation vector for this layer, appends this slice to the tail of a list
 231 anchored at the object header, and then calls the corresponding lower layer
 232 device to do the same. That is, the file object structure is determined by the
 233 configuration of devices to which this file belongs.
 234
 235 Pages and locks, in turn, belong to the file objects, and when a new page is
 236 created for a given object, slices of this object are iterated through and
 237 every slice is asked to initialize a new page, which includes (usually)
 238 allocation of a new page slice and its insertion into a list of page slices.
 239 Locks and IO context instantiation is handled similarly.
 240
 241 1.6. Life cycle
 242 ===============
 243
 244 All layered objects except IO contexts and transfer requests (which leaves file
 245 objects, pages and locks) are reference counted and cached. They have a uniform
 246 caching mechanism:
 247
 248 - Objects are kept in some sort of an index (global FID hash for file objects,
 249   per-file radix tree for pages, and per-file list for locks);
 250
 251 - A reference for an object can be acquired by cl_{object,page,lock}_find()
 252   functions that search the index, and if object is not there, create a new one
 253   and insert it into the index;
 254
 255 - A reference is released by cl_{object,page,lock}_put() functions. When the
 256   last reference is released, the object is returned to the cache (still in the
 257   index), except when the user explicitly set `do not cache' flag for this
 258   object. In the latter case the object is destroyed immediately.
 259
 260 IO contexts are owned by a thread (or, potentially a group of threads) doing
 261 IO, and need neither reference counting nor indexing. Similarly, transfer
 262 requests are owned by a OSC device, and their lifetime is from RPC creation
 263 until completion notification.
 264
 265 1.7. State Machines
 266 ===================
 267
 268 All types of layered objects contain a state-machine, although for the transfer
 269 requests this machine is trivial (CREATED -> PREPARED -> INFLIGHT ->
 270 COMPLETED), and for the file objects it is very simple.  Page, lock, and IO
 271 state machines are described in more detail below.
 272
 273 As a generic rule, state machine transitions are made under some kind of lock:
 274 VM lock for a page, a per-lock mutex for a cl_lock, and LU site spin-lock for
 275 an object. After some event that might cause a state transition, such lock is
 276 taken, and the object state is analysed to check whether transition is
 277 possible.  If it is, the state machine is advanced to the new state and the
 278 lock is released. IO state transitions do not require concurrency control.
 279
 280 1.8. Finalization
 281 =================
 282
 283 State machine and reference counting interact during object destruction. In
 284 addition to temporary pointers to an entity (that are counted in its reference
 285 counter), an entity is reachable through
 286
 287 - Indexing structures described above
 288
 289 - Pointers internal to some layer of this entity. For example, a page is
 290   reachable through a pointer from VM page, lock might be reachable through a
 291   ldlm_lock::l_ast_data pointer, and sub-{lock,object,page} might be reachable
 292   through a pointer from its top-entity.
 293
 294 Entity destruction happens in three phases:
 295
 296 - First, a decision is made to destroy an entity, when, for example, a lock is
 297   cancelled, or a page is truncated from a file. At this point the `do not
 298   cache' bit is set in the entity header, and all ways to reach entity from
 299   internal pointers are severed.
 300
 301   cl_{page,lock,object}_get() functions never return an entity with the `do not
 302   cache' bit set, so from this moment no new internal pointers can be
 303   obtained.
 304
 305   See: cl_page_delete(), cl_lock_delete();
 306
 307 - Pointers `drain' for some time as existing references are released. In
 308   this phase the entity is reachable through
 309
 310         . temporary pointers, counted in its reference counter, and
 311         . possibly a pointer in the indexing structure.
 312
 313 - When the last reference is released, the entity can be safely freed (after
 314   possibly removing it from the index).
 315
 316   See lu_object_put(), cl_page_put(), cl_lock_put().
 317
 318 1.9. Code Structure
 319 ===================
 320
 321 The CLIO code resides in the following files:
 322
 323   {llite,lov,osc}/*_{dev,object,lock,page,io}.c
 324   liblustre/llite_cl.c
 325   lclient/*.c
 326   obdclass/cl_*.c
 327   include/cl_object.h
 328
 329 All global CLIO data-types are defined in include/cl_object.h header which
 330 contains detailed documentation. Generic clio code is in
 331 obdclass/cl_{object,page,lock,io}.c
 332
 333 An implementation of CLIO interfaces for a layer foo is located in
 334 foo/foo_{dev,object,page,lock,io}.c files, with (temporary) exception of
 335 liblustre code that is located in liblustre/llite_cl.c.
 336
 337 Definitions of data-structures shared within a layer are in
 338 foo/foo_cl_internal.h
 339
 340 VVP and SLP share most of the CLIO functionality and data-structures. Common
 341 functions are defined in the lustre/lclient directory, and common types are in
 342 the lustre/include/lclient.h header.
 343
 344 =============
 345 = 2. Layers =
 346 =============
 347
 348 This section briefly outlines responsibility of every layer in the stack. More
 349 detailed description of functionality is in the following sections on objects,
 350 pages and locks.
 351
 352 2.1. VVP, SLP, Echo-client
 353 ==========================
 354
 355 There are currently 3 options for the top-most Lustre layer:
 356
 357 - VVP: linux kernel client,
 358 - SLP: liblustre client, and
 359 - echo-client: special client used by the Lustre testing sub-system.
 360
 361 Other possibilities are:
 362
 363 - Client ports to other operating systems (OSX, Windows, Solaris),
 364 - pNFS and NFS exports.
 365
 366 The responsibilities of the top-most layer include:
 367
 368 - Definition of the entry points through which Lustre is accessed by the
 369   applications;
 370 - Interaction with the hosting VM/MM system;
 371 - Interaction with the hosting VFS or equivalent;
 372 - Implementation of the desired semantics of top of Lustre (e.g. POSIX
 373   or Win32 semantics).
 374
 375 Let's look at VVP in more detail. First, VVP implements VFS entry points
 376 required by the Linux kernel interface: ll_file_{read,write,sendfile}(). Then,
 377 VVP implements VM entry points: ll_{write,invalidate,release}page().
 378
 379 For file objects, VVP slice (ccc_object, shared with liblustre) contains a
 380 pointer to an inode.
 381
 382 For pages, the VVP slice (ccc_page) contains a pointer to the VM page
 383 (cfs_page_t), a `defer up to date' flag to track read-ahead hits (similar to
 384 the pre-CLIO client), and fields necessary for synchronous transfer (see
 385 below).  VVP is responsible for implementation of the interaction between
 386 client page (cl_page) and the VM.
 387
 388 There is no special VVP private state for locks.
 389
 390 For IO, VVP implements
 391
 392 - Mapping from Linux specific entry points (readv, writev, sendfile, etc.)
 393   to Lustre IO loop,
 394
 395 - mmap,
 396
 397 - POSIX features like short reads, O_APPEND atomicity, etc.
 398
 399 - Read-ahead (this is arguably not the best layer in which to implement
 400   read-ahead, as the existing read-ahead algorithm is network-aware).
 401
 402 2.2. LOV, LOVSUB
 403 ================
 404
 405 The LOV layer implements RAID-0 striping. It maps top-entities (file objects,
 406 locks, pages, IOs) to one or more sub-entities. LOVSUB is a companion layer
 407 that does the reverse mapping.
 408
 409 2.3. OSC
 410 ========
 411
 412 The OSC layer deals with networking stuff:
 413
 414 - It decides when an efficient RPC can be formed from cached data;
 415
 416 - It calls LNET to initiate a transfer and to get notification of completion;
 417
 418 - It calls LDLM to implement distributed cache coherency, and to get
 419   notifications of lock cancellation requests;
 420
 421 ==============
 422 = 3. Objects =
 423 ==============
 424
 425 3.1. FID, Hashing, Caching, LRU
 426 ===============================
 427
 428 Files and stripes are collectively known as (file system) `objects'. The CLIO
 429 client reuses support for layered objects from the MDT stack. Both client and
 430 MDT objects are based on struct lu_object type, representing a slice of a file
 431 system object. lu_object's for a given object are linked through the
 432 ->lo_linkage field into a list hanging off field ->loh_layers of struct
 433 lu_object_header, that represents a whole layered object.
 434
 435 lu_object and lu_object_header provide functionality common between a client
 436 and a server:
 437
 438 - An object is uniquely identified by a FID; all objects are kept in a hash
 439   table indexed by a FID;
 440
 441 - Objects are reference counted. When the last reference to an object is
 442   released it is returned back into the cache, unless it has been explicitly
 443   marked for deletion, in which case it is immediately destroyed;
 444
 445 - Objects in the cache are kept in a LRU list that is scanned to keep cache
 446   size under control.
 447
 448 On the MDT, lu_object is wrapped into struct md_object where additional state
 449 that all server-side objects have is stored. Similarly, on a client, lu_object
 450 and lu_object_header are embedded into struct cl_object and struct
 451 cl_object_header where additional client state is stored.
 452
 453 cl_object_header contains following additional state:
 454
 455 - ->coh_tree: a radix tree of cached pages for this object. In this tree pages
 456   are indexed by their logical offset from the beginning of this object. This
 457   tree is protected by ->coh_page_guard spin-lock;
 458
 459 - ->coh_locks: a double-linked list of all locks for this object. Locks in all
 460   possible states (see Locks section below) reside in this list without any
 461   particular ordering.
 462
 463 3.2. Top-object, Sub-object
 464 ===========================
 465
 466 An important distinction from the server side, where md_object and dt_object
 467 are used, is that cl_object "fans out" at the LOV level: depending on the file
 468 layout, a single file is represented as a set of "sub-objects" (stripes). At
 469 the implementation level, struct lov_object contains an array of cl_objects.
 470 Each sub-object is a full-fledged cl_object, having its FID and living in the
 471 LRU and hash table. Each sub-object has its own radix tree of pages, and its
 472 own list of locks.
 473
 474 This leads to the next important difference with the server side: on the
 475 client, it's quite usual to have objects with the different sequence of layers.
 476 For example, typical top-object is composed of the following layers:
 477
 478 - VVP
 479 - LOV
 480
 481 whereas its sub-objects are composed of layers:
 482
 483 - LOVSUB
 484 - OSC
 485
 486 Here "LOVSUB" is a mostly dummy layer, whose purpose is to keep track of the
 487 object-subobject relationship:
 488
 489                   cl_object_header-+--->radix tree of pages
 490                           |        |
 491                           V        +--->list of locks
 492            inode<----ccc_object
 493                           |
 494                           V
 495                      lov_object
 496                           |
 497                   +---+---+---+---+
 498                   |   |   |   |   |
 499                   V   |   |   |   |
 500      cl_object_header |   .   .   .
 501            |          |   .   .   .
 502            |          V
 503            .  cl_object_header-+--->radix tree of pages
 504                       |        |
 505                       V        +--->list of locks
 506                 lovsub_object
 507                       |
 508                       V
 509                  osc_object
 510
 511 Sub-objects are not cached independently: when top-object is about to be
 512 discarded from the memory, all its sub-objects are torn-down and destroyed too.
 513
 514 3.3. Object Operations
 515 ======================
 516
 517 In addition to the lu_object_operations vector, each cl_object slice has
 518 cl_object_operations. lu_object_operations deals with object creation and
 519 destruction of objects. Client specific cl_object_operations fall into two
 520 categories:
 521
 522 - Creation of dependent entities: these are ->coo_{page,lock,io}_init()
 523   methods called at every layer when a new page, lock or IO context are being
 524   created, and
 525
 526 - Object attributes: ->coo_attr_{get,set}() methods that are called to get or
 527   set common client object attributes (struct cl_attr): size, [mac]times, etc.
 528
 529 3.4. Object Attributes
 530 ======================
 531
 532 A cl_object has a set of attributes defined by struct cl_attr. Attributes
 533 include object size, object known-minimum-size (KMS), access, change and
 534 modification times and ownership identifiers. Description of KMS is beyond the
 535 scope of this document, refer to the (non-)existent Lustre documentation on the
 536 subject.
 537
 538 Both top-objects and sub-objects have attributes. Consistency of the attributes
 539 is protected by a lock on the top-object, accessible through
 540 cl_object_attr_{un,}lock() calls. This allows a sub-object and its top-object
 541 attributes to be changed atomically.
 542
 543 Attributes are accessible through cl_object_attr_{g,s}et() functions that call
 544 per-layer ->coo_attr_{s,g}et() object methods. Top-object attributes are
 545 calculated from the sub-object ones by lov_attr_get() that optimizes for the
 546 case when none of sub-object attributes have changed since the last call to
 547 lov_attr_get().
 548
 549 As a further potential optimization, one can recalculate top-object attributes
 550 at the moment when any sub-object attribute is changed. This would allow to
 551 avoid collecting cumulative attributes over all sub-objects. To implement this
 552 optimization _all_ changes of sub-object attributes must go through
 553 cl_object_attr_set().
 554
 555 ============
 556 = 4. Pages =
 557 ============
 558
 559 A cl_page represents a portion of a file, cached in the memory. All pages of
 560 the given file are of the same size, and are kept in the radix tree hanging off
 561 the cl_object_header.
 562
 563 A cl_page is associated with a VM page of the hosting environment (struct page
 564 in the Linux kernel, for example), cfs_page_t. It is assumed that this
 565 association is implemented by one of cl_page layers (top layer in the current
 566 design) that
 567
 568 - intercepts per-VM-page call-backs made by the host environment (e.g., memory
 569   pressure),
 570
 571 - translates state (page flag bits) and locking between lustre and the host
 572   environment.
 573
 574 The association between cl_page and cfs_page_t is immutable and established
 575 when cl_page is created. It is possible to imagine a setup where different
 576 pages get their backing VM buffers from different sources. For example, in the
 577 case if pNFS export, some pages might be backed by local DMU buffers, while
 578 others (representing data in remote stripes), by normal VM pages.
 579
 580 4.1. Page Indexing
 581 ==================
 582
 583 Pages within a given object are linearly ordered. The page index is stored in
 584 the ->cpo_index field. In a typical Lustre setup, a top-object has an array of
 585 sub-objects, and every page in a top-object corresponds to a page in one of its
 586 sub-objects. This second page (a sub-page of a first), is a first class
 587 cl_page, and, in particular, it is inserted into the sub-object's radix tree,
 588 where it is indexed by its offset within the sub-object. Sub-page and top-page
 589 are linked together through the ->cp_child and ->cp_parent fields in struct
 590 cl_page:
 591
 592                        +------>radix tree of pages
 593                        |                  /|\
 594                        |                 /.|.\
 595                        |                 ..V..
 596            cl_object_header<------------cl_page<-----------------+
 597                    |          ->cp_obj     |                     |
 598                    V                       V                     |
 599     inode<----ccc_object<---------------ccc_page---->cfs_page_t  |
 600                    |        ->cpl_obj      |                     |
 601                    V                       V                     | ->cp_child
 602               lov_object<---------------lov_page                 |
 603                    |        ->cpl_obj                            | ->cp_parent
 604            +---+---+---+---+                                     |
 605            |   |   |   |   |                                     |
 606            .   |   .   .   .                                     |
 607                |                                                 |
 608                |   +------>radix tree of pages                   |
 609                |   |                       /|\                   |
 610                |   |                      /.|.\                  |
 611                V   |                      ..V..                  |
 612        cl_object_header<-----------------cl_page<----------------+
 613                |            ->cp_obj        |
 614                V                            V
 615          lovsub_object<-----------------lovsub_page
 616                |            ->cpl_obj       |
 617                V                            V
 618           osc_object<--------------------osc_page
 619                             ->cpl_obj
 620
 621 4.2. Page Ownership
 622 ===================
 623
 624 A cl_page can be "owned" by a particular cl_io (see below), guaranteeing this
 625 IO an exclusive access to this page with regard to other IO attempts and
 626 various events changing page state (such as transfer completion, or eviction of
 627 the page from memory). Note, that in general a cl_io cannot be identified with
 628 a particular thread, and page ownership is not exactly equal to the current
 629 thread holding a lock on the page. The layer implementing the association
 630 between cl_page and cfs_page_t has to implement ownership on top of available
 631 synchronization mechanisms.
 632
 633 While the Lustre client maintains the notion of page ownership by IO, the
 634 hosting MM/VM usually has its own page concurrency control mechanisms. For
 635 example, in Linux, page access is synchronized by the per-page PG_locked
 636 bit-lock, and generic kernel code (generic_file_*()) takes care to acquire and
 637 release such locks as necessary around the calls to the file system methods
 638 (->readpage(), ->prepare_write(), ->commit_write(), etc.). This leads to the
 639 situation when there are two different ways to own a page in the client:
 640
 641 - Client code explicitly and voluntary owns the page (cl_page_own());
 642
 643 - The hosting VM locks a page and then calls the client, which has to "assume"
 644   ownership from the VM (cl_page_assume()).
 645
 646 Dual methods to release ownership are cl_page_disown() and cl_page_unassume().
 647
 648 4.3. Page Transfer Locking
 649 ==========================
 650
 651 cl_page implements a simple locking design: as noted above, a page is protected
 652 by a VM lock while IO owns it. The same lock is kept while the page is in
 653 transfer.  Note that this is different from the standard Linux kernel behavior
 654 where page write-out is protected by a lock (PG_writeback) separate from VM
 655 lock (PG_locked). It is felt that this single-lock design is more portable and,
 656 moreover, Lustre cannot benefit much from a separate write-out lock due to LDLM
 657 locking.
 658
 659 4.4. Page Operations
 660 ====================
 661
 662 See documentation for cl_object.h:cl_page_operations. See cl_page state
 663 descriptions in documentation for cl_object.h:cl_page_state.
 664
 665 ============
 666 = 5. Locks =
 667 ============
 668
 669 A struct cl_lock represents an extent lock on cached file or stripe data.
 670 cl_lock is used only to maintain distributed cache coherency and provides no
 671 intra-node synchronization. It should be noted that, as with other Lustre DLM
 672 locks, cl_lock is actually a lock _request_, rather than a lock itself.
 673
 674 As locks protect cached data, and the unit of data caching is a page, locks are
 675 of page granularity.
 676
 677 5.1. Lock Life Cycle
 678 ====================
 679
 680 Locks for a given file are cached in a per-file doubly linked list. The overall
 681 lock life cycle is as following:
 682
 683 - The lock is created in the CLS_NEW state. At this moment the lock doesn't
 684   actually protect anything;
 685
 686 - The Lock is enqueued, that is, sent to server, passing through the
 687   CLS_QUEUING state. In this state multiple network communications with
 688   multiple servers may occur;
 689
 690 - Once fully enqueued, the lock moves into the CLS_ENQUEUED state where it
 691   waits for a final reply from the server or servers;
 692
 693 - When a reply granting this lock is received, the lock moves into the CLS_HELD
 694   state. In this state the lock protects file data, and pages in the lock
 695   extent can be cached (and dirtied for a write lock);
 696
 697 - When the lock is not actively used, it is `unused' and, moving through the
 698   CLS_UNLOCKING state, lands in the CLS_CACHED state. In this state the lock
 699   still protects cached data. The difference with CLS_HELD state is that a lock
 700   in the CLS_CACHED state can be cancelled;
 701
 702 - Ultimately, the lock is either cancelled, or destroyed without cancellation.
 703   In any case, it is moved in CLS_FREEING state and eventually freed.
 704
 705   A lock can be cancelled by a client either voluntarily (in reaction to memory
 706   pressure, by explicit user request, or as part of early cancellation), or
 707   involuntarily, when a blocking AST arrives.
 708
 709   A lock can be destroyed without cancellation when its object is destroyed
 710   (there should be no cached data at this point), or during eviction (when
 711   cached data are invalid);
 712
 713 - If an unrecoverable error occurs at any point (e.g., due to network timeout,
 714   or a server's refusal to grant a lock), the lock is moved into the
 715   CLS_FREEING state.
 716
 717 The description above matches the slow IO path. In the common fast path there
 718 is already a cached lock covering the extent which the IO is against. In this
 719 case, the cl_lock_find() function finds the cached lock. If the found lock is
 720 in the CLS_HELD state, it can be used for IO immediately. If the found lock is
 721 in CLS_CACHED state, it is removed from the cache and transitions to CLS_HELD.
 722 If the lock is in the CLS_QUEUING or CLS_ENQUEUED state, some other IO is
 723 currently in the process of enqueuing it, and the current thread helps that
 724 other thread by continuing the enqueue operation.
 725
 726 The actual process of finding a lock in the cache is in fact more involved than
 727 the above description, because there are cases when a lock matching the IO
 728 extent and mode still cannot be used for this IO. For example, locks covering
 729 multiple stripes cannot be used for regular IO, due to the danger of cascading
 730 evictions. For such situations, every layer can optionally define
 731 cl_lock_operations::clo_fits_into() method that might declare a given lock
 732 unsuitable for a given IO. See lov_lock_fits_into() as an example.
 733
 734 5.2. Top-lock and Sub-locks
 735 ===========================
 736
 737 A top-lock protects cached pages of a top-object, and is based on a set of
 738 sub-locks, protecting cached pages of sub-objects:
 739
 740                     +--------->list of locks
 741                     |                   |
 742                     |                   V
 743         cl_object_header<------------cl_lock
 744                 |          ->cld_obj    |
 745                 V                       V
 746            ccc_object<---------------ccc_lock
 747                 |          ->cls_obj    |
 748                 V                       V
 749            lov_object<---------------lov_lock
 750                 |          ->cls_obj    |
 751         +---+---+---+---+           +---+---+---+---+
 752         |   |   |   |   |           |   |   |   |   |
 753         .   |   .   .   .           .   .   .   |   .
 754             |                                   |
 755             |   +-------------->list of locks   |
 756             |   |                           |   |
 757             V   |                           V   V
 758     cl_object_header<----------------------cl_lock
 759             |              ->cp_obj           |
 760             V                                 V
 761       lovsub_object<---------------------lovsub_lock
 762             |              ->cls_obj          |
 763             V                                 V
 764        osc_object<------------------------osc_lock
 765                            ->cls_obj
 766
 767 When a top-lock is created, it creates sub-locks based on the striping method
 768 (RAID0 currently). Sub-locks are `created' in the same manner as top-locks: by
 769 calling cl_lock_find() function to go through the lock cache. To enqueue a
 770 top-lock all of its sub-locks have to be enqueued also, with ordering
 771 constraints defined by enqueue options:
 772
 773 - To enqueue a regular top-lock, each sub-lock has to be enqueued and granted
 774   before the next one can be enqueued. This is necessary to avoid deadlock;
 775
 776 - For `try-lock' style top-lock (e.g., a glimpse request, or O_NONBLOCK IO
 777   locks), requests can be enqueued in parallel, because dead-lock is not
 778   possible in this case.
 779
 780 Sub-lock state depends on its top-lock state:
 781
 782 - When top-lock is being enqueued, its sub-locks are in QUEUING, ENQUEUED,
 783   or HELD state;
 784
 785 - When a top-lock is in HELD state, its sub-locks are in HELD state too;
 786
 787 - When a top-lock is in CACHED state, its sub-locks are in CACHED state too;
 788
 789 - When a top-lock is in FREEING state, it detaches itself from all sub-locks,
 790   and those are usually deleted too.
 791
 792 A sub-lock can be cancelled while its top-lock is in CACHED state. To maintain
 793 an invariant that CACHED lock is immediately ready for re-use by IO, the
 794 top-lock is moved into NEW state. The next attempt to use this lock will
 795 enqueue it again, resulting in the creation and enqueue of any missing
 796 sub-locks.  As follows from the description above, the top-lock provides
 797 somewhat weaker guarantees than one might expect:
 798
 799 - Some of its sub-locks can be missing, and
 800
 801 - Top-lock does not necessarily protect the whole of its extent.
 802
 803 In other words, a top-lock is potentially porous, and in effect, it is just a
 804 hint, describing what sub-locks are likely to exist. Nonetheless, in the most
 805 important cases of a file per client, and of clients working in the disjoint
 806 areas of a shared file this hint is precise.
 807
 808 5.3. Lock State Machine
 809 =======================
 810
 811 A cl_lock is a state machine. This requires some clarification. One of the
 812 goals of CLIO is to make IO path non-blocking, or at least to make it easier to
 813 make it non-blocking in the future. Here `non-blocking' means that when a
 814 system call (read, write, truncate) reaches a situation where it has to wait
 815 for a communication with the server, it should--instead of waiting--remember
 816 its current state and switch to some other work. That is, instead of waiting
 817 for a lock enqueue, the client should proceed doing IO on the next stripe, etc.
 818 Obviously this is rather radical redesign, and it is not planned to be fully
 819 implemented at this time. Instead we are putting some infrastructure in place
 820 that would make it easier to do asynchronous non-blocking IO in the future.
 821 Specifically, where the old locking code goes to sleep (waiting for enqueue,
 822 for example), the new code returns cl_lock_transition::CLO_WAIT. When the
 823 enqueue reply comes, its completion handler signals that the lock state-machine
 824 is ready to move to the next state. There is some generic code in cl_lock.c
 825 that sleeps, waiting for these signals. As a result, for users of this
 826 cl_lock.c code, it looks like locking is done in the normal blocking fashion,
 827 and at the same time it is possible to switch to the non-blocking locking
 828 (simply by returning cl_lock_transition::CLO_WAIT from cl_lock.c functions).
 829
 830 For a description of state machine states and transitions see enum
 831 cl_lock_state.
 832
 833 There are two ways to restrict a set of states which a lock might move to:
 834
 835 - Placing a "hold" on a lock guarantees that the lock will not be moved into
 836   cl_lock_state::CLS_FREEING state until the hold is released. A hold can only
 837   be acquired on a lock that is not in cl_lock_state::CLS_FREEING. All holds on
 838   a lock are counted in cl_lock::cll_holds. A hold protects the lock from
 839   cancellation and destruction. Requests to cancel and destroy a lock on hold
 840   will be recorded, but only honoured when the last hold on a lock is released;
 841
 842 - Placing a "user" on a lock guarantees that lock will not leave the set of
 843   states cl_lock_state::CLS_NEW, cl_lock_state::CLS_QUEUING,
 844   cl_lock_state::CLS_ENQUEUED and cl_lock_state::CLS_HELD, once it enters this
 845   set. That is, if a user is added onto a lock in a state not from this set, it
 846   doesn't immediately force the lock to move to this set, but once the lock
 847   enters this set it will remain there until all users are removed. Lock users
 848   are counted in cl_lock::cll_users.
 849
 850   A user is used to assure that the lock is not cancelled or destroyed while it
 851   is being enqueued or actively used by some IO.
 852
 853   Currently, a user always comes with a hold (cl_lock_invariant() checks that a
 854   number of holds is not less than a number of users).
 855
 856 Lock "users" are used by the top-level IO code to guarantee that a lock is not
 857 cancelled when IO it protects is going on. Lock "holds" are used by a top-lock
 858 (LOV code) to guarantee that its sub-locks are in an expected state.
 859
 860 5.4. Lock Concurrency
 861 =====================
 862
 863 The following describes how the lock state-machine operates. The fields of
 864 struct cl_lock are protected by the cl_lock::cll_guard mutex.
 865
 866 - The mutex is taken, and cl_lock::cll_state is examined.
 867
 868 - For every state there are possible target states which the lock can move
 869   into. They are tried in order. Attempts to move into the next state are
 870   done by _try() functions in cl_lock.c:cl_{enqueue,unlock,wait}_try().
 871
 872 - If the transition can be performed immediately, the state is changed and the
 873   mutex is released.
 874
 875 - If the transition requires blocking, the _try() function returns
 876   cl_lock_transition::CLO_WAIT. The caller unlocks the mutex and goes to sleep,
 877   waiting for the possibility of a lock state change. It is woken up when some
 878   event occurs that makes lock state change possible (e.g., the reception of
 879   the reply from the server), and repeats the loop.
 880
 881 Top-lock and sub-lock have separate mutices and the latter has to be taken
 882 first to avoid deadlock.
 883
 884 To see an example of interaction of all these issues, take a look at the
 885 lov_cl.c:lov_lock_enqueue() function. It is called as part of cl_enqueue_try(),
 886 and tries to advance top-lock to the ENQUEUED state by advancing the
 887 state-machines of its sub-locks (lov_lock_enqueue_one()). Note also that it
 888 uses trylock to take the sub-lock mutex to avoid deadlock. It also has to
 889 handle CEF_ASYNC enqueue, when sub-locks enqueues have to be done in parallel
 890 (this is used for glimpse locks which cannot deadlock).
 891
 892               +------------------>NEW
 893               |                    |
 894               |                    | cl_enqueue_try()
 895               |                    |
 896               |    cl_unuse_try()  V
 897               |  +--------------QUEUING (*)
 898               |  |                 |
 899               |  |                 | cl_enqueue_try()
 900               |  |                 |
 901               |  | cl_unuse_try()  V
 902      sub-lock |  +-------------ENQUEUED (*)
 903     cancelled |  |                 |
 904               |  |                 | cl_wait_try()
 905               |  |                 |
 906               |  |                 V
 907               |  |                HELD<---------+
 908               |  |                 |            |
 909               |  |                 |            |
 910               |  | cl_unuse_try()  |            |
 911               |  |                 |            |
 912               |  |                 V            | cached
 913               |  +------------>UNLOCKING (*)    | lock found
 914               |                    |            |
 915               |    cl_unuse_try()  |            |
 916               |                    |            |
 917               |                    |            | cl_use_try()
 918               |                    V            |
 919               +------------------CACHED---------+
 920                                    |
 921                               (Cancelled)
 922                                    |
 923                                    V
 924                                 FREEING
 925
 926 5.5. Shared Sub-locks
 927 =====================
 928
 929 For various reasons, the same sub-lock can be shared by multiple top-locks. For
 930 example, a large sub-lock can match multiple small top-locks. In general, a
 931 sub-lock keeps a list of all its parents, and propagates certain events to
 932 them, e.g., as described above, when a sub-lock is cancelled, it moves _all_ of
 933 its top-locks from CACHED to NEW state.
 934
 935 This leads to a curious situation, when an operation on some top-lock (e.g.,
 936 enqueue), changes state of one of its sub-locks, and this change has to be
 937 propagated to the other top-locks of this sub-lock. The resulting locking
 938 pattern is top->bottom->top, which is obviously not deadlock safe. To avoid
 939 deadlocks, try-locking is used in such situations. See
 940 cl_object.h:cl_lock_closure documentation for details.
 941
 942 5.6. Use Case: Lock Invalidation
 943 ================================
 944
 945 To demonstrate how objects, pages and lock data-structures interact, let's look
 946 at the example of stripe lock invalidation.
 947
 948 Imagine that on the client C0 there is a file object F, striped over stripes
 949 S0, S1 and S2 (files and stripes are represented by cl_object_header). Further,
 950 there is a write lock LF, for the extent [a, b] (recall that lock extents are
 951 measured in pages) of file F. This lock is based on write sub-locks LS0, LS1
 952 and LS2 for the corresponding extents of S0, S1 and S2 respectively.
 953
 954 All locks are in CACHED state. Each LSi sub-lock has a osc_lock slice, where a
 955 pointer to the struct ldlm_lock is stored. The ->l_ast_data field of ldlm_lock
 956 points back to the sub-lock's osc_lock.
 957
 958 The client caches clean and dirty pages for F, some in [a, b] and some outside
 959 of it (these latter are necessarily covered by some other locks). Each of these
 960 pages is in F's radix tree, and points through cl_page::cp_child to a sub-page
 961 which is in radix-tree of one of Si's.
 962
 963 Some other client requests a lock that conflicts with LS1. The OST where S1
 964 lives, sends a blocking AST to C0.
 965
 966 C0's LDLM invokes lock->l_blocking_ast(), which is osc_ldlm_blocking_ast(),
 967 which eventually calls acquires a mutex on the sub-lock and calls
 968 cl_lock_cancel(sub-lock). cl_lock_cancel() ascends through sub-lock's slices
 969 (which are osc_lock and lovsub_lock), calling ->clo_cancel() method at every
 970 stripe, that is, calling osc_lock_cancel() (the LOVSUB layer doesn't define
 971 ->clo_cancel()).
 972
 973 osc_lock_cancel() calls cl_lock_page_out() to invalidate all pages cached under
 974 this lock after sending dirty ones back to stripe S1's server.
 975
 976 To do this, cl_lock_page_out() obtains the sub-lock's object and sweeps through
 977 its radix tree from the starting to ending offset of the sub-lock (recall that
 978 a sub-lock extent is measured in page offsets within a sub-object). For every
 979 page thus found cl_page_unmap() function is called to invalidate it. This
 980 function goes through sub-page slices bottom-to-top, then follows ->cp_parent
 981 pointer to go to the top-page and repeats the same process. Eventually
 982 vvp_page_unmap() is called which unmaps a page (top-page by this time) from the
 983 page tables.
 984
 985 After a page is invalidated, it is prepared for transfer if it is dirty. This
 986 step also includes a bottom-to-top scan of the page and top-page slices, and
 987 calls to ->cpo_prep() methods at each layer, allowing vvp_page_prep_write() to
 988 announce to the VM that the VM page is being written.
 989
 990 Once all pages are written, they are removed from radix-trees and destroyed.
 991 This completes invalidation of a sub-lock, and osc_lock_cancel() exits.  Note
 992 that:
 993
 994 - No special cancellation logic for the top-lock is necessary;
 995
 996 - Specifically, VVP knows nothing about striping and there is no need to
 997   handle the case where only part of the top-lock is cancelled;
 998
 999 - There is no need to convert between file and stripe offsets during this
1000   process;
1001
1002 - There is no need to keep track of locks protecting the given page.
1003
1004 =========
1005 = 6. IO =
1006 =========
1007
1008 An IO context (struct cl_io) is a layered object describing the state of an
1009 ongoing IO operation (such as a system call).
1010
1011 6.1. Fixed IO Types
1012 ===================
1013
1014 There are two classes of IO contexts, represented by cl_io:
1015
1016 - An IO for a specific type of client activity, enumerated by enum cl_io_type:
1017
1018         . CIT_READ: read system call including read(2), readv(2), pread(2),
1019           sendfile(2);
1020         . CIT_WRITE: write system call;
1021         . CIT_TRUNC: truncate system call;
1022         . CIT_FAULT: page fault handling;
1023
1024 - A `catch-all' CIT_MISC IO type for all other IO activity:
1025
1026         . cancellation of an extent lock,
1027         . VM induced page write-out,
1028         . glimpse,
1029         . other miscellaneous stuff.
1030
1031 The difference between CIT_MISC and other IO types is that CIT_MISC IO is
1032 merely a context in which pages are owned and locks are enqueued, whereas
1033 other IO types, in addition to being a context, are also state machines.
1034
1035 6.2. IO State Machine
1036 =====================
1037
1038 The idea behind the cl_io state machine is that initial `work' that has to be
1039 done (e.g., writing a 3MB user buffer into a given file) is done as a sequence
1040 of `iterations', and an iteration is executed as following an idiomatic
1041 sequence of steps:
1042
1043 - Prepare: determine what work is to be done at this iteration;
1044
1045 - Lock: enqueue and acquire all locks necessary to perform this iteration;
1046
1047 - Start: either perform iteration work synchronously, or post it
1048   asynchronously, or both;
1049
1050 - End: wait for the completion of asynchronous work;
1051
1052 - Unlock: release locks, acquired at the "lock" step;
1053
1054 - Finalize: finalize iteration state.
1055
1056 cl_io is a layered entity and each step above is performed by invoking the
1057 corresponding cl_io_operations method on every layer. As will be explained
1058 below, this is especially important in the `prepare' step, as it allows layers
1059 to cooperate in determining the scope of the current iteration.
1060
1061 For CIT_READ or CIT_WRITE IO, a typical scenario is splitting the original user
1062 buffer into chunks that map completely inside of a single stripe in the target
1063 file, and processing each chunk as a separate iteration. In this case, it is
1064 the LOV layer that (in lov_io_rw_iter_init() function) determines the extent of
1065 the current iteration.
1066
1067 Once the iteration is prepared, the `lock' step acquires all necessary DLM
1068 locks to cover the region of a file that is affected by the current iteration.
1069 The `start' step does the actual processing, which for write means placing
1070 pages from the user buffer into the cache, and for read means fetching pages
1071 from the server, including read-ahead pages (see `immediate transfer' below).
1072 Truncate and page fault are executed in one iteration (currently that is, it's
1073 easy to change truncate implementation to, for instance, truncate each stripe
1074 in a separate iteration, should the need arise).
1075
1076 6.3. Parallel IO
1077 ================
1078
1079 One important planned generalization of this model is an out of order execution
1080 of iterations.
1081
1082 A motivating example for this is a write of a large user level buffer,
1083 overlapping with multiple stripes. Typically, a busy Lustre client has its
1084 per-OSC caches for the dirty pages nearly full, which means that the write
1085 often has to block, waiting for the cache to drain. Instead of blocking the
1086 whole IO operation, CIT_WRITE might switch to the next stripe and try to do IO
1087 there.  Without such a `non-blocking' IO, a slow OST or an unfair network
1088 degrades the performance of the whole cluster.
1089
1090 Another example is a legacy single-threaded application running on a multi-core
1091 client machine, where IO throughput is limited by the single thread copying
1092 data between the user buffer to the kernel pages. Multiple concurrent IO
1093 iterations that can be scheduled independently on the available processors
1094 eliminate this bottleneck by copying the data in parallel.
1095
1096 Obviously, parallel IO is not compatible with the usual `sequential IO'
1097 semantics. For example, POSIX read and write have a very simple failure model,
1098 where some initial (possibly empty) segment of a user buffer is processed
1099 successfully, and none of the remaining bytes were read and written. Parallel
1100 IO can fail in much more complex ways.
1101
1102 For now, only sequential iterations are supported.
1103
1104 6.4. Data-flow: From Stack to IO Slice
1105 ======================================
1106
1107 The parallel IO design outlined above implies that an ongoing IO can be
1108 preempted by other IO and later resumed, all potentially in the same thread.
1109 This means that IO state cannot be kept on a stack, as it is customarily done
1110 in UNIX file system drivers. Instead, the layered cl_io is used to store
1111 information about the current iteration and progress within it.  Coincidentally
1112 (almost) this is similar to the way IO requests are used by the Windows driver
1113 stack.
1114
1115 A set of common fields in struct cl_io describe the IO and are shared by all
1116 layers.  Important properties so described include:
1117
1118 - The IO type;
1119
1120 - A file (struct cl_object) against which this IO is executed;
1121
1122 - A position in a file where the read or write is taking place, and a count of
1123   bytes remaining to be processed (for CIT_READ and CIT_WRITE);
1124
1125 - A size to which file is being truncated or expanded (for CIT_TRUNC);
1126
1127 - A list of locks acquired for this IO;
1128
1129 Each layer keeps IO state in its `IO slice', described below, with all slices
1130 chained to the list hanging off of struct cl_io:
1131
1132 - vvp_io, ccc_io: these two slices are used by the top-most layer of the Linux
1133   kernel client. ccc_io is a state common between kernel client and liblustre,
1134   and vvp_io is a state private to the kernel client.
1135
1136   The most important state in ccc_io is an array of struct iovec, describing
1137   user space buffers from or to which IO is taking place. Note that other
1138   layers in the IO stack have no idea that data actually came from user space.
1139
1140   vvp_io contains kernel specific fields, such as VM information describing a
1141   page fault, or the sendfile target.
1142
1143 - lov_io: IO state private for the LOV layer is kept here. The most important IO
1144   state at the LOV layer is an array of sub-IO's. Each sub-IO is a normal
1145   struct cl_io, representing a part of the IO process for a given iteration.
1146   With current sequential iterations, only one sub-IO is active at a time.
1147
1148 - osc_io: this slice stores IO state private to the OSC layer that exists within
1149   each sub-IO created by LOV.
1150
1151 ===============
1152 = 7. Transfer =
1153 ===============
1154
1155 7.1. Immediate vs. Opportunistic Transfers
1156 ==========================================
1157
1158 There are two possible modes of transfer initiation on the client:
1159
1160 - Immediate transfer: this is started when a high level IO wants a page or a
1161   collection of pages to be transferred right away. Examples: read-ahead,
1162   a synchronous read in the case of non-page aligned write, page write-out as
1163   part of an extent lock cancellation, page write-out as a part of memory
1164   cleansing. Immediate transfer can be both cl_req_type::CRT_READ and
1165   cl_req_type::CRT_WRITE;
1166
1167 - Opportunistic transfer (cl_req_type::CRT_WRITE only), that happens when IO
1168   wants to transfer a page to the server some time later, when it can be done
1169   efficiently. Example: pages dirtied by the write(2) path. Pages submitted for
1170   an opportunistic transfer are kept in a "staging area".
1171
1172 In any case, a transfer takes place in the form of a cl_req, which is a
1173 representation for a network RPC.
1174
1175 Pages queued for an opportunistic transfer are placed into a staging area
1176 (represented as a set of per-object and per-device queues at the OSC layer)
1177 until it is decided that an efficient RPC can be composed of them. This
1178 decision is made by "a req-formation engine", currently implemented as part of
1179 the OSC layer. Req-formation depends on many factors: the size of the resulting
1180 RPC, RPC alignment, whether or not multi-object RPCs are supported by the
1181 server, max-RPC-in-flight limitations, size of the staging area, etc. CLIO uses
1182 unmodified RPC formation logic from OSC, so it is not discussed here.
1183
1184 For an immediate transfer the IO submits a cl_page_list which the req-formation
1185 engine slices into cl_req's, possibly adding cached pages to some of the
1186 resulting req's.
1187
1188 Whenever a page from cl_page_list is added to a newly constructed req, its
1189 cl_page_operations::cpo_prep() layer methods are called. At that moment, the
1190 page state is atomically changed from cl_page_state::CPS_OWNED to
1191 cl_page_state::CPS_PAGEOUT or cl_page_state::CPS_PAGEIN, cl_page::cp_owner is
1192 zeroed, and cl_page::cp_req is set to the req. cl_page_operations::cpo_prep()
1193 method at a particular layer might return -EALREADY to indicate that it does
1194 not need to submit this page at all. This is possible, for example, if a page
1195 submitted for read became up-to-date in the meantime; and for write, if the
1196 page don't have dirty bit set. See cl_io_submit_rw() for details.
1197
1198 Whenever a staged page is added to a newly constructed req, its
1199 cl_page_operations::cpo_make_ready() layer methods are called. At that moment,
1200 the page state is atomically changed from cl_page_state::CPS_CACHED to
1201 cl_page_state::CPS_PAGEOUT, and cl_page::cp_req is set to req. The
1202 cl_page_operations::cpo_make_ready() method at a particular layer might return
1203 -EAGAIN to indicate that this page is not currently eligible for the transfer.
1204
1205 The RPC engine guarantees that once the ->cpo_prep() or ->cpo_make_ready()
1206 method has been called, the page completion routine (->cpo_completion() layer
1207 method) will eventually be called (either as a result of successful page
1208 transfer completion, or due to timeout).
1209
1210 To summarize, there are two main entry points into transfer sub-system:
1211
1212 - cl_io_submit_rw(): submits a list of pages for immediate transfer;
1213
1214 - cl_page_cache_add(): places a page into staging area for future
1215   opportunistic transfer.
1216
1217 7.2. Page Lists
1218 ===============
1219
1220 To submit a group of pages for immediate transfer struct cl_2queue is used. It
1221 contains two page lists: qin (input queue) and qout (output queue). Pages are
1222 linked into these queues by cl_page::cp_batch list heads. Qin is populated with
1223 the pages to be submitted to the transfer, and pages that were actually
1224 submitted are placed onto qout. Not all pages from qin might end up on qout due
1225 to
1226
1227 - ->cpo_prep() methods deciding that page should not be transferred, or
1228
1229 - unrecoverable submission error.
1230
1231 Pages not moved to qout remain on qin. It is up to the transfer submitter to
1232 decide when to remove pages from qin and qout. Remaining pages on qin are
1233 usually removed from this list right after (partially unsuccessful) transfer
1234 submission. Pages are usually left on qout until transfer completion. This way
1235 the caller can determine when all pages from the list were transferred.
1236
1237 The association between a page and an immediate transfer queue is protected by
1238 cl_page::cl_mutex. This mutex is acquired when a cl_page is added in a
1239 cl_page_list and released when a page is removed from the list.
1240
1241 When an RPC is formed, all of its constituent pages are linked together through
1242 cl_page::cp_flight list hanging off of cl_req::crq_pages. Pages are removed
1243 from this list just before the transfer completion method is invoked. No
1244 special lock protects this list, as pages in transfer are under a VM lock.
1245
1246 7.3. Transfer States: Prepare, Completion
1247 =========================================
1248
1249 The transfer (cl_req) state machine is trivial, and it is not explicitly coded.
1250 A newly created transfer is in the "prepare" state while pages are collected.
1251 When all pages are gathered, the transfer enters the "in-flight" state where it
1252 remains until it reaches the "completion" state where page completion handlers
1253 are invoked.
1254
1255 The per-layer ->cro_prep() transfer method is called when transfer preparation
1256 is completed and transfer is about to enter the in-flight state. Similarly, the
1257 per-layer ->cro_completion() method is called when the transfer completes
1258 before per-page completion methods are called.
1259
1260 Additionally, before moving a transfer out of the prepare state, the RPC engine
1261 calls the cl_req_attr_set() function.  This function invokes ->cro_attr_set()
1262 methods on every layer to fill in RPC header that server uses to determine
1263 where to get or put data. This replaces the old ->ap_{update,fill}_obdo()
1264 methods.
1265
1266 Further, cl_req's are not reference counted and access to them is not
1267 synchronized. This is because they are accessed only by the RPC engine in OSC
1268 which fully controls RPC life-time, and it uses an internal OSC lock
1269 (client_obd::cl_loi_list_lock spin-lock) for serialization.
1270
1271 7.4. Page Completion Handlers, Synchronous Transfer
1272 ===================================================
1273
1274 When a transfer completes, cl_req completion methods are called on every layer.
1275 Then, for every transfer page, per-layer page completion methods
1276 ->cpo_completion() are invoked. The page is still under the VM lock at this
1277 moment.  Completion methods are called bottom-to-top and it is responsibility
1278 of the last of them (i.e., the completion method of the top-most layer---VVP)
1279 to release the VM lock.
1280
1281 Both immediate and opportunistic transfers are asynchronous in the sense that
1282 control can return to the caller before the transfer completes. CLIO doesn't
1283 provide a synchronous transfer interface at all and it is up to a particular
1284 caller to implement it if necessary. The simplest way to wait for the transfer
1285 completion is wait on a page VM lock. This approach is used implicitly by the
1286 Linux kernel. There is a case, though, where one wants to do transfer
1287 completely synchronously without releasing the page VM lock: when
1288 ->prepare_write() method determines that a write goes from a non page-aligned
1289 buffer into a not up-to-date page, a portion of a page has to be fetched from
1290 the server. The VM page lock cannot be used to synchronize transfer completion
1291 in this case, because it is used to mark the page as owned by IO. To handle
1292 this, VVP attaches struct cl_sync_io to struct vvp_page. cl_sync_io contains a
1293 number of pages still in IO and a synchronization primitive (struct completion)
1294 which is signalled when transfer of the last page completes. The VVP page
1295 completion handler (vvp_page_completion_common()) checks for attached
1296 cl_sync_io and if it is there, decreases the number of in-flight pages and
1297 signals completion when that number drops to 0. A similar mechanism is used for
1298 direct-IO.
1299
1300 =============
1301 = 8. lu_env =
1302 =============
1303
1304 8.1. Motivation, Server Environment Usage
1305 =========================================
1306
1307 lu_env and related data-types (struct lu_context and struct lu_context_key)
1308 together implement a memory pre-allocation interface that Lustre uses to
1309 decrease stack consumption without resorting to fully dynamic allocation.
1310
1311 Stack space is severely limited in the Linux kernel. Lustre traditionally
1312 allocated a lot of automatic variables, resulting in spurious stack overflows
1313 that are hard to trigger (they usually need a certain combination of driver
1314 calls and interrupts to happen, making them extremely difficult to reproduce)
1315 and debug (as stack overflow can easily result in corruption of thread-related
1316 data-structures in the kernel memory, confusing the debugger).
1317
1318 The simplest way to handle this is to replace automatic variables with calls
1319 to the generic memory allocator, but
1320
1321 - The generic allocator has scalability problems, and
1322
1323 - Additional code to free allocated memory is needed.
1324
1325 The lu_env interface was originally introduced in the MDS rewrite for Lustre
1326 2.0 and matches server-side threading model very well. Roughly speaking,
1327 lu_context represents a context in which computation is executed and
1328 lu_context_key is a description of per-context data. In the simplest case
1329 lu_context corresponds to a server thread; then lu_context_key is effectively a
1330 thread-local storage (TLS). For a similar idea see the user-level pthreads
1331 interface pthread_key_create().
1332
1333 More formally, lu_context_key defines a constructor-destructor pair and a tags
1334 bit-mask. When lu_context is initialized (with a given tag bit-mask), a global
1335 array of all registered lu_context_keys is scanned, constructors for all keys
1336 with matching tags are invoked and their return values are stored in
1337 lu_context.
1338
1339 Once lu_context has been initialized, a value of any key allocated for this
1340 context can be retrieved very efficiently by indexing in the per-context
1341 array. lu_context_key_get() function is used for this.
1342
1343 When context is finalized, destructors are called for all keys allocated in
1344 this context.
1345
1346 The typical server usage is to have a lu_context for every server thread,
1347 initialized when the thread is started. To reduce stack consumption by the
1348 code running in this thread, a lu_context_key is registered that allocates in
1349 its constructor a struct containing as fields values otherwise allocated on
1350 the stack. See {mdt,osd,cmm,mdd}_thread_info for examples. Instead of doing
1351
1352         int function(args) {
1353                 /* structure "bar" in module "foo" */
1354                 struct foo_bar bar;
1355                 ...
1356
1357 the code roughly does
1358
1359         struct foo_thread_info {
1360                 struct foo_bar fti_bar;
1361                 ...
1362         };
1363
1364         int function(const struct lu_env *env, args) {
1365                 struct foo_bar *bar;
1366                 ...
1367                 bar = &lu_context_key_get(&env->le_ctx, &foo_thread_key)->fti_
1368
1369 etc.
1370
1371 struct lu_env contains 2 contexts:
1372
1373 - le_ctx: this context is embedded in lu_env. By convention, this context is
1374   used _only_ to avoid allocations on the stack, and it should never be used to
1375   pass parameters between functions or layers. The reason for this restriction
1376   is that using contexts for implicit state sharing leads to a code that is
1377   difficult to understand and modify.
1378
1379 - le_ses: this is a pointer to a context shared by all threads handling given
1380   RPC. Context itself is embedded into struct ptlrpc_request. Currently a
1381   request is always processed by a single thread, but this might change in the
1382   future in a design where a small pool of threads processes RPCs
1383   asynchronously.
1384
1385 Additionally, state kept in env->le_ses context is shared by multiple layers.
1386 For example, remote user credentials are stored there.
1387
1388 8.2. Client Environment Usage
1389 =============================
1390
1391 On a client there is a lu_env associated with every thread executing Lustre
1392 code. Again, it contains &env->le_ctx context used to reduce stack consumption.
1393 env->le_ses is used to share state between all threads handling a given IO.
1394 Again, currently an IO is processed by a single thread. env->le_ses is used to
1395 efficiently allocate cl_io slices ({vvp,lov,osc}_io).
1396
1397 There are three important differences with lu_env usage on the server:
1398
1399 - While on the server there is a fixed pool of threads, any client thread can
1400   execute Lustre code. This makes it impractical to pre-allocate and
1401   pre-initialize lu_context for every thread. Instead, contexts are constructed
1402   on demand and after use returned into a global cache that amortizes creation
1403   cost;
1404
1405 - Client call-chains frequentyly cross Lustre-VFS and Lustre-VM boundaries.
1406   This means that just passing lu_env as a first parameter to every Lustre
1407   function and method is not enough. To work around this problem, a pointer to
1408   lu_env is stored in a field in the kernel data-structure associated with the
1409   current thread (task_struct::journal_info), from where it is recovered when
1410   Lustre code is re-entered from VFS or VM;
1411
1412 - Sometimes client code is re-entered in a fashion that precludes re-use of the
1413   higher level lu _env. For example, when a read or write incurs a page fault
1414   in the user space buffer memory-mapped from a Lustre file, page fault
1415   handling is a separate IO, independent of the already ongoing system call.
1416   The Lustre page fault handler allocates a new lu_env (by calling
1417   lu_env_get_nested()) in which the nested IO is going on. A similar situation
1418   occurs when client DLM lock LRU shrinking code is invoked in the context of a
1419   system call.
1420
1421 8.3. Sub-environments
1422 =====================
1423
1424 As described above, lu_env (specifically, lu_env->le_ses) is used on a client
1425 to allocate per-IO state, including foo_io data on every layer. This leads to a
1426 complication at the LOV layer, which maintains multiple sub-IOs. As layers
1427 below LOV allocate their IO slices in lu_env->le_ses, LOV has to allocate an
1428 lu_env for every sub-IO and to carefully juggle them when invoking lower layer
1429 methods. The case of a single IO is optimized by re-using the top-environment.
1430
1431 ================
1432 = 9. Use cases =
1433 ================
1434
1435 9.1. Inode Creation
1436 ===================
1437
1438 Lookup ends up calling ll_update_inode() to setup a new inode with a given
1439 meta-data descriptor (obtained from the meta-data path). cl_inode_init() calls
1440 cl_object_find() eventually calling lu_object_find_try() that either finds a
1441 cl_object in the cache or allocates a new one, calling
1442 lu_device_operations::ldo_object_{alloc,init}() methods on every layer top to
1443 bottom. Every layer allocates its private data structure ({vvp,lov}_object) and
1444 links it into an object header (cl_object_header) by calling lu_object_add().
1445 At the VVP layer, vvp_object contains a pointer to the inode. The LOV layer
1446 allocates a lov_object containing an array of pointers to sub-objects that are
1447 found in the cache or allocated by calling cl_object_find (recursively). These
1448 sub-objects have LOVSUB and OSC layer data.
1449
1450 A top-object and its sub-objects are inserted into a global FID-based hash
1451 table and a global LRU list.
1452
1453 9.2. First IO to a File
1454 =======================
1455
1456 After an object is instantiated as described in the previous use case, the
1457 first IO call against this object has to create DLM locks. The following
1458 operations re-use cached locks (see below).
1459
1460 A read call starts at ll_file_readv() which eventually calls
1461 ll_file_io_generic(). This function calls cl_io_init() to initialize an IO
1462 context, which calls the cl_object_operations::coo_io_init() method on every
1463 layer.  As in the case of object instantiation, these methods allocate
1464 layer-private IO state ({vvp,lov}_io) and add it to the list hanging off of the
1465 IO context header cl_io by calling cl_io_add(). At the VVP layer, vvp_io_init()
1466 handles special cases (like count == 0), updates statistic counters, and in the
1467 case of write it takes a per-inode semaphore to avoid possible deadlock.
1468
1469 At the LOV layer, lov_io_init_raid0() allocates a struct lov_io and stores in
1470 it the original IO parameters (starting offset and byte count). This is needed
1471 because LOV is going to modify these parameters. Sub-IOs are not allocated at
1472 this point---they are lazily instantiated later.
1473
1474 Once the top-IO has been initialized, ll_file_io_generic() enters the main IO
1475 loop cl_io_loop() that drives IO iterations, going through
1476
1477 - cl_io_iter_init() calling cl_io_operations::cio_iter_init() top-to-bottom
1478 - cl_io_lock() calling cl_io_operations::cio_lock() top-to-bottom
1479 - cl_io_start() calling cl_io_operations::cio_start() top-to-bottom
1480 - cl_io_end() calling cl_io_operations::cio_end() bottom-to-top
1481 - cl_io_unlock() calling cl_io_operations::cio_unlock() bottom-to-top
1482 - cl_io_iter_fini() calling cl_io_operations::cio_iter_fini() bottom-to-top
1483 - cl_io_rw_advance() calling cl_io_operations::cio_advance() bottom-to-top
1484
1485 repeatedly until cl_io::ci_continue remains 0 after an iteration. These "IO
1486 iterations" move an IO context through consecutive states (see enum
1487 cl_io_state).  ->cio_iter_init() decides at each layer what part of the
1488 remaining IO is to be done during current iteration. Currently,
1489 lov_io_rw_iter_init() is the only non-trivial implementation of this method. It
1490 does the following:
1491
1492 - Except for the cases of truncate and O_APPEND write, it shrinks the IO extent
1493   recorded in the top-IO (starting offset and bytes count) so that this extent
1494   is fully contained within a single stripe. This avoids "cascading evictions";
1495
1496 - It allocates sub-IOs for all stripes intersecting with the resulting IO range
1497   (which, in case of non-append write or read means creating single sub-io) by
1498   calling cl_io_init() that (as above) creates a cl_io context with lovsub_io
1499   and osc_io layers. The initialized cl_io is primed from the top-IO
1500   (lov_io_sub_inherit()) and cl_io_iter_init() is called against it;
1501
1502 - Finally all sub-ios for the current iteration are linked together into a
1503   lov_io::lis_active list.
1504
1505 Now we have a top-IO and its sub-IO in CIS_IT_STARTED state. cl_io_lock()
1506 collects locks on all layers without actually enqueuing them: vvp_io_rw_lock()
1507 requests a lock on the IO extent (possibly shrunk by LOV, see above) and
1508 optionally on extents of Lustre files that happen to be memory-mapped onto the
1509 user-level buffer used for this IO. In the future layers like SNS might request
1510 additional locks, e.g., to protect parity blocks.
1511
1512 Locks requested by ->cio_lock() methods are added to the cl_lockset embedded
1513 into top cl_io. The lockset contains 3 lock queues: "todo", "current" and
1514 "done".  Locks are initially placed in the todo queue. Once locks from all
1515 layers have been collected, they are sorted to avoid deadlocks
1516 (cl_io_locks_sort()) and them enqueued by cl_lockset_lock(). The latter can
1517 enqueue multiple locks concurrently if the enqueuing mode guarantees this is
1518 safe (e.g., lock is a try-lock). Locks being enqueued are in the "current"
1519 queue, from where they are moved into "done" queue when the lock is granted.
1520
1521 At this stage we have top- and sub-IO in the CIS_LOCKED state with all needed
1522 locks held. cl_io_start() moves cl_io into CIS_IO_GOING mode and calls
1523 ->cio_start() method. In the VVP layer this method invokes some version of
1524 generic_file_{read,write}() function.
1525
1526 In the case of read, generic_file_read() calls for every non-up-to-date page
1527 the a_ops->readpage() method that eventually (after obtaining cl_page
1528 corresponding to the VM page supplied to it) calls cl_io_read_page() which in
1529 turn calls cl_io_operations::cio_read_page().
1530
1531 vvp_io_read_page() populates a queue by a target page and pages from read-ahead
1532 window. The resulting queue is then submitted for immediate transfer by calling
1533 cl_io_submit_rw() which ends up calling osc_io_submit_page() for every
1534 not-up-to-date page in the queue.
1535
1536 ->readpage() returns at this point, and the VM waits on a VM page lock, which
1537 is released by the transfer completion handler before copying page data to the
1538 user buffer.
1539
1540 In the case of write, generic_file_write() calls the a_ops->prepare_write() and
1541 a_ops->commit_write() address space methods that end up calling
1542 cl_io_prepare_write() and cl_io_commit_write() respectively. These functions
1543 follow the normal Linux protocol for write, including a possible synchronous
1544 read of a non-overwritten part of a page (vvp_page_sync_io() call in
1545 vvp_io_prepare_partial()). In the normal case it ends up placing the dirtied
1546 page into the staging area (cl_page_cache_add() call in vvp_io_commit_write()).
1547 If the staging area is full already, cl_page_cache_add() fails with -EDQUOT and
1548 the page is transferred immediately by calling vvp_page_sync_io().
1549
1550 9.3. Cached IO
1551 ==============
1552
1553 Subsequent IO calls will, most likely, find suitable locks already cached on
1554 the client. This happens because the server tries to grant as large a lock as
1555 possible, to reduce future enqueue RPC traffic for a given file from a given
1556 client. Cached locks are kept (in no particular order) on a
1557 cl_object_header::coh_locks list. When, in the cl_io_lock() step, a layer
1558 requests a lock, this list is scanned for a matching lock. If a found lock is
1559 in the HELD or CACHED state it can be re-used immediately by simply calling
1560 cl_lock_use() method, which eventually calls ldlm_lock_addref_try() to protect
1561 the underlying DLM lock from a concurrent cancellation while IO is going on. If
1562 a lock in another (NEW, QUEUING or ENQUEUED) state is found, it is enqueued as
1563 usual.
1564
1565 9.4. Lock-less and No-cache IO
1566 ==============================
1567
1568 IO context has a "locking mode" selected from MAYBE, NEVER or MANDATORY set
1569 (enum cl_io_lock_dmd), that specifies what degree of distributed cache
1570 coherency is assumed by this IO. MANDATORY mode requires all caches accessed by
1571 this IO to be protected by distributed locks. In NEVER mode no distributed
1572 coherency is needed at the expense of not caching the data. This mode is
1573 required for the cases where client can not or will not participate in the
1574 cache coherency protocol (e.g., a liblustre client that cannot respond to the
1575 lock blocking call-backs while in the compute phase). In MAYBE mode some of the
1576 caches involved in this IO are used and are globally coherent, and some other
1577 caches are bypassed.
1578
1579 O_APPEND writes and truncates are always executed in MANDATORY mode. All other
1580 calls are executed in NEVER mode by liblustre (see below) and in MAYBE mode by
1581 a normal Linux client.
1582
1583 In MAYBE mode every OSC individually decides whether to use DLM. An OST might
1584 return -EUSERS to an enqueue RPC indicating that the stripe in question is
1585 contended and that the client should switch to the lockless IO mode. If this
1586 happens, OSC, instead of using ldlm_lock, creates a special "lockless OSC lock"
1587 that is not backed up by a DLM lock. This lock conflicts with any other lock in
1588 its range and self-cancels when its last user is removed. As a result, when IO
1589 proceeds to the stripe that is in lockless mode, all conflicting extent locks
1590 are cancelled, purging the cache. When IO against this stripe ends, the lock is
1591 cancelled, sending dirty pages (just placed in the cache by IO) back to the
1592 server and invalidating the cache again. "Lockless locks" allow lockless and
1593 no-cache IO mode to be implemented by the same code paths as cached IO.
1594
1595 * * * END * * *