lustre/doc/clio.txt

   1                ******************************************************
   2                * Overview of the Lustre Client I/O (CLIO) subsystem *
   3                ******************************************************
   4
   5 Original Authors:
   6 =================
   7 Nikita Danilov <Nikita_Danilov@xyratex.com>
   8
   9 Topics
  10 ======
  11
  12 1. Overview
  13         1.1. Goals
  14         1.2. Terminology
  15                 i.   I/O vs. Transfer
  16                 ii.  Top-{object,lock,page}, Sub-{object,lock,page}
  17                 iii. VVP, llite
  18         1.3. Main Differences with the Pre-CLIO Client Code
  19         1.4. Layered objects, Slices, Headers
  20         1.5. Instantiation
  21         1.6. Life Cycle
  22         1.7. State Machines
  23         1.8. Finalization
  24         1.9. Code Structure
  25 2. Layers
  26         2.1. VVP, Echo-client
  27         2.2. LOV, LOVSUB (layouts)
  28         2.3. OSC
  29 3. Objects
  30         3.1. FID, Hashing, Caching, LRU
  31         3.2. Top-object, Sub-object
  32         3.3. Object Operations
  33         3.4. Object Attributes
  34         3.5. Object Layout
  35 4. Pages
  36         4.1. Page Indexing
  37         4.2. Page Ownership
  38         4.3. Page Transfer Locking
  39         4.4. Page Operations
  40         4.5. Page Initialization
  41 5. Locks
  42         5.1. Lock Life Cycle
  43         5.2. cl_lock and LDLM Lock
  44         5.3. Use Case: Lock Invalidation
  45 6. IO
  46         6.1. Fixed IO Types
  47         6.2. IO State Machine
  48         6.3. Parallel IO
  49         6.4. Data-flow: From Stack to IO Slice
  50 7. Transfer
  51         7.1. Immediate vs. Opportunistic Transfers
  52         7.2. Page Lists
  53         7.3. Transfer States: Prepare, Completion
  54         7.4. Page Completion Handlers, Synchronous Transfer
  55 8. LU Environment (lu_env)
  56         8.1. Motivation, Server Environment Usage
  57         8.2. Client Usage
  58         8.3. Sub-environments
  59 9. Use Cases
  60         9.1. Inode Creation
  61         9.2. First IO to a File
  62                 i. Read, Read-ahead
  63                 ii. Write
  64         9.3. Lock-less and No-cache IO
  65
  66 ================
  67 = 1. Overview =
  68 ================
  69
  70 1.1. Goals
  71 ==========
  72
  73 CLIO is a re-write of interfaces between layers in the client data-path (read,
  74 write, truncate). Its goals are:
  75
  76 - Reduce the number of bugs in the IO path;
  77
  78 - Introduce more logical layer interfaces instead of current all-in-one OBD
  79   device interface;
  80
  81 - Define clear and precise semantics for the interface entry points;
  82
  83 - Simplify the structure of the client code.
  84
  85 - Support upcoming features:
  86
  87         . SNS,
  88         . p2p caching,
  89         . parallel non-blocking IO,
  90         . pNFS;
  91
  92 - Reduce stack consumption.
  93
  94 Restrictions:
  95
  96 - No meta-data changes;
  97 - No support for 2.4 kernels;
  98 - Portable code;
  99 - No changes to recovery;
 100 - The same layers with mostly the same functionality;
 101 - As few changes to the core logic of each Lustre data-stack layer as possible
 102   (e.g., no changes to the read-ahead or OSC RPC logic).
 103
 104 1.2. Terminology
 105 ================
 106
 107 Any discussion of client functionality has to talk about `read' and `write'
 108 system calls on the one hand and about `read' and `write' requests to the
 109 server on the other hand. To avoid confusion, the former high level operations
 110 are called `IO', while the latter are called `transfer'.
 111
 112 Many concepts apply uniformly to pages, locks, files, and IO contexts, such as
 113 reference counting, caching, etc. To describe such situations, a common term is
 114 needed to denote things from any of the above classes. `Object' would be a
 115 natural choice, but files and especially stripes are already called objects, so
 116 the term `entity' is used instead.
 117
 118 Due to the striping it's often a case that some entity is composed of multiple
 119 entities of the same kind: a file is composed of stripe objects, a logical lock
 120 on a file is composed of stripe locks on the file's stripes, etc. In these
 121 cases we shall talk about top-object, top-lock, top-page, top-IO, etc. being
 122 constructed from sub-objects, sub-locks, sub-pages, sub-io's respectively.
 123
 124 The topmost module in the Linux client, is traditionally known as `llite'.  The
 125 corresponding CLIO layer is called `VVP' (VFS, VM, POSIX) to reflect its
 126 functional responsibilities.
 127
 128 1.3. Main Differences with the Pre-CLIO Client Code
 129 ===================================================
 130
 131 - Locks on files (as opposed to locks on stripes) are first-class objects (i.e.
 132   layered entities);
 133
 134 - Sub-objects (stripes) are first-class objects;
 135
 136 - Stripe-related logic is moved out of llite (almost);
 137
 138 - IO control flow is different:
 139
 140         . Pre-CLIO: llite implements control flow, calling underlying OBD
 141           methods as necessary;
 142
 143         . CLIO: generic code (cl_io_loop()) controls IO logic calling all
 144           layers, including VVP.
 145
 146   In other words, VVP (or any other top-layer) instead of calling some
 147   pre-existing `lustre interface', also implements parts of this interface.
 148
 149 - The lu_env allocator from MDT is used on the client.
 150
 151 1.4. Layered Objects
 152 ====================
 153
 154 CLIO continues the layered object approach that was found to be useful for the
 155 MDS rewrite in Lustre 2.0. In this approach, instances of key entity types
 156 (files, pages, locks, etc.) are represented as a header, containing attributes
 157 shared by all layers. Each header contains a linked list of per-layer `slices',
 158 each of which contains a pointer to a vector of function pointers. Generic
 159 operations on layered objects are implemented by going through the list of
 160 slices and invoking the corresponding function from the operation vector at
 161 every layer.  In this way generic object behavior is delegated to the layers.
 162
 163 For example, a page entity is represented by struct cl_page, from which hangs
 164 off a list of cl_page_slice structures, one for each layer in the stack.
 165 cl_page_slice contains a pointer to struct cl_page_operations.
 166 cl_page_operations has the field
 167
 168         void (*cpo_completion)(const struct lu_env *env,
 169               const struct cl_page_slice *slice, int ioret);
 170
 171 When transfer of a page is finished, ->cpo_completion() methods are called in a
 172 particular order (bottom to top in this case).
 173
 174 Allocation of slices is done during instance creation. If a layer needs some
 175 private state for an object, it embeds the slice into its own data structure.
 176 For example, the OSC layer defines
 177
 178         struct osc_lock {
 179                 struct cl_lock_slice ols_cl;
 180                 struct ldlm_lock *ols_lock;
 181                 ...
 182         };
 183
 184 When an operation from cl_lock_operations is called, it is given a pointer to
 185 struct cl_lock_slice, and the layer casts it to its private structure (for
 186 example, struct osc_lock) to access per-layer state.
 187
 188 The following types of layered objects exist in CLIO:
 189
 190 - File system objects (files and stripes): struct cl_object_header, slices
 191   are of type struct cl_object;
 192
 193 - Cached pages with data: struct cl_page, slices are of type
 194   cl_page_slice;
 195
 196 - Extent locks: struct cl_lock, slices are of type cl_lock_slice;
 197
 198 - IO content: struct cl_io, slices are of type cl_io_slice;
 199
 200 - Transfer request: struct cl_req, slices are of type cl_req_slice.
 201
 202 1.5. Instantiation
 203 ==================
 204
 205 Entities with different sequences of slices can co-exist. A typical example of
 206 this is a local vs. remote object on the MDS. A local object, based on some
 207 file in the local file system has MDT, MDD, LOD and OSD as its layers, whereas
 208 a remote object (representing an object local to some other MDT) has MDT, MDD,
 209 LOD, and OSP layers.
 210
 211 When the client is being mounted, its device stack is configured according to
 212 llog configuration records. The typical configuration is
 213
 214                            vvp_device
 215                                 |
 216                                 V
 217                            lov_device
 218                                 |
 219                         +---+---+---+---+
 220                         |   |   |   |   |
 221                         V   V   V   V   V
 222                       .... osc_device's ....
 223
 224 In this tree every node knows its descendants. When a new file (inode) is
 225 created, every layer, starting from the top, creates a slice with a state and
 226 an operation vector for this layer, appends this slice to the tail of a list
 227 anchored at the object header, and then calls the corresponding lower layer
 228 device to do the same. That is, the file object structure is determined by the
 229 configuration of devices to which this file belongs.
 230
 231 Pages and locks, in turn, belong to the file objects, and when a new page is
 232 created for a given object, slices of this object are iterated through and
 233 every slice is asked to initialize a new page, which includes (usually)
 234 allocation of a new page slice and its insertion into a list of page slices.
 235 Locks and IO context instantiation is handled similarly.
 236
 237 1.6. Life cycle
 238 ===============
 239
 240 All layered objects except locks, IO contexts and transfer requests (which
 241 leaves file objects, pages) are reference counted and cached. They have a
 242 uniform caching mechanism:
 243
 244 - Objects are kept in some sort of an index (global FID hash for file objects,
 245   per-file radix tree for pages, and per-file list for locks);
 246
 247 - A reference for an object can be acquired by cl_{object,page}_find() functions
 248   that search the index, and if object is not there, create a new one
 249   and insert it into the index;
 250
 251 - A reference is released by cl_{object,page}_put() functions. When the last
 252   reference is released, the object is returned to the cache (still in the
 253   index), except when the user explicitly set `do not cache' flag for this
 254   object. In the latter case the object is destroyed immediately.
 255
 256 Locks(cl_lock) are owned by individual IO threads. cl_lock is a container of
 257 lock requirements for underlying DLM locks. cl_lock is allocated by
 258 cl_lock_request() and destroyed by cl_lock_cancel(). DLM locks are cacheable
 259 and can be reused by cl_locks.
 260
 261 IO contexts are owned by a thread (or, potentially a group of threads) doing
 262 IO, and need neither reference counting nor indexing. Similarly, transfer
 263 requests are owned by a OSC device, and their lifetime is from RPC creation
 264 until completion notification.
 265
 266 1.7. State Machines
 267 ===================
 268
 269 All types of layered objects contain a state-machine, although for the transfer
 270 requests this machine is trivial (CREATED -> PREPARED -> INFLIGHT ->
 271 COMPLETED), and for the file objects it is very simple.  Page, lock, and IO
 272 state machines are described in more detail below.
 273
 274 As a generic rule, state machine transitions are made under some kind of lock:
 275 VM lock for a page, and LU site spin-lock for an object. After some event that
 276 might cause a state transition, such lock is taken, and the object state is
 277 analysed to check whether transition is possible. If it is, the state machine
 278 is advanced to the new state and the lock is released. IO state transitions do
 279 not require concurrency control.
 280
 281 1.8. Finalization
 282 =================
 283
 284 State machine and reference counting interact during object destruction. In
 285 addition to temporary pointers to an entity (that are counted in its reference
 286 counter), an entity is reachable through
 287
 288 - Indexing structures described above
 289
 290 - Pointers internal to some layer of this entity. For example, a page is
 291   reachable through a pointer from VM page, and sub-{object, lock, page} might
 292   be reachable through a pointer from its top-entity.
 293
 294 Entity destruction happens in three phases:
 295
 296 - First, a decision is made to destroy an entity, when, for example, a page is
 297   truncated from a file, or an inode is destroyed. At this point the `do not
 298   cache' bit is set in the entity header, and all ways to reach entity from
 299   internal pointers are severed.
 300
 301   cl_{page,object}_get() functions never return an entity with the `do not
 302   cache' bit set, so from this moment no new internal pointers can be
 303   obtained.
 304
 305 - Pointers `drain' for some time as existing references are released. In
 306   this phase the entity is reachable through
 307
 308         . temporary pointers, counted in its reference counter, and
 309         . possibly a pointer in the indexing structure.
 310
 311 - When the last reference is released, the entity can be safely freed (after
 312   possibly removing it from the index).
 313
 314   See lu_object_put(), cl_page_put().
 315
 316 1.9. Code Structure
 317 ===================
 318
 319 The CLIO code resides in the following files:
 320
 321   {llite,lov,osc}/*_{dev,object,lock,page,io}.c
 322   obdclass/cl_*.c
 323   include/cl_object.h
 324
 325 All global CLIO data-types are defined in include/cl_object.h header which
 326 contains detailed documentation. Generic clio code is in
 327 obdclass/cl_{object,page,lock,io}.c
 328
 329 An implementation of CLIO interfaces for a layer foo is located in
 330 foo/foo_{dev,object,page,lock,io}.c files.
 331
 332 Definitions of data-structures shared within a layer are in
 333 foo/foo_cl_internal.h.
 334
 335 =============
 336 = 2. Layers =
 337 =============
 338
 339 This section briefly outlines responsibility of every layer in the stack. More
 340 detailed description of functionality is in the following sections on objects,
 341 pages and locks.
 342
 343 2.1. VVP, Echo-client
 344 ==========================
 345
 346 There are currently 2 options for the top-most Lustre layer:
 347
 348 - VVP: linux kernel client,
 349 - echo-client: special client used by the Lustre testing sub-system.
 350
 351 Other possibilities are:
 352
 353 - Client ports to other operating systems (OSX, Windows, Solaris),
 354 - pNFS and NFS exports.
 355
 356 The responsibilities of the top-most layer include:
 357
 358 - Definition of the entry points through which Lustre is accessed by the
 359   applications;
 360 - Interaction with the hosting VM/MM system;
 361 - Interaction with the hosting VFS or equivalent;
 362 - Implementation of the desired semantics of top of Lustre (e.g. POSIX
 363   or Win32 semantics).
 364
 365 Let's look at VVP in more detail. First, VVP implements VFS entry points
 366 required by the Linux kernel interface: ll_file_{read,write,sendfile}(). Then,
 367 VVP implements VM entry points: ll_{write,invalidate,release}page().
 368
 369 For file objects, VVP slice (vvp_object) contains a pointer to an inode.
 370
 371 For pages, the VVP slice (vvp_page) contains a pointer to the VM page
 372 (struct page), a `defer up to date' flag to track read-ahead hits (similar to
 373 the pre-CLIO client), and fields necessary for synchronous transfer (see
 374 below).  VVP is responsible for implementation of the interaction between
 375 client page (cl_page) and the VM.
 376
 377 There is no special VVP private state for locks.
 378
 379 For IO, VVP implements
 380
 381 - Mapping from Linux specific entry points (readv, writev, sendfile, etc.)
 382   to Lustre IO loop,
 383
 384 - mmap,
 385
 386 - POSIX features like short reads, O_APPEND atomicity, etc.
 387
 388 - Read-ahead (this is arguably not the best layer in which to implement
 389   read-ahead, as the existing read-ahead algorithm is network-aware).
 390
 391 2.2. LOV, LOVSUB
 392 ================
 393
 394 The LOV layer implements RAID-0 striping. It maps top-entities (file objects,
 395 locks, pages, IOs) to one or more sub-entities. LOVSUB is a companion layer
 396 that does the reverse mapping.
 397
 398 2.3. OSC
 399 ========
 400
 401 The OSC layer deals with networking stuff:
 402
 403 - It decides when an efficient RPC can be formed from cached data;
 404
 405 - It calls LNET to initiate a transfer and to get notification of completion;
 406
 407 - It calls LDLM to implement distributed cache coherency, and to get
 408   notifications of lock cancellation requests;
 409
 410 ==============
 411 = 3. Objects =
 412 ==============
 413
 414 3.1. FID, Hashing, Caching, LRU
 415 ===============================
 416
 417 Files and stripes are collectively known as (file system) `objects'. The CLIO
 418 client reuses support for layered objects from the MDT stack. Both client and
 419 MDT objects are based on struct lu_object type, representing a slice of a file
 420 system object. lu_object's for a given object are linked through the
 421 ->lo_linkage field into a list hanging off field ->loh_layers of struct
 422 lu_object_header, that represents a whole layered object.
 423
 424 lu_object and lu_object_header provide functionality common between a client
 425 and a server:
 426
 427 - An object is uniquely identified by a FID; all objects are kept in a hash
 428   table indexed by a FID;
 429
 430 - Objects are reference counted. When the last reference to an object is
 431   released it is returned back into the cache, unless it has been explicitly
 432   marked for deletion, in which case it is immediately destroyed;
 433
 434 - Objects in the cache are kept in a LRU list that is scanned to keep cache
 435   size under control.
 436
 437 On the MDT, lu_object is wrapped into struct md_object where additional state
 438 that all server-side objects have is stored. Similarly, on a client, lu_object
 439 and lu_object_header are embedded into struct cl_object and struct
 440 cl_object_header where additional client state is stored.
 441
 442 3.2. Top-object, Sub-object
 443 ===========================
 444
 445 An important distinction from the server side, where md_object and dt_object
 446 are used, is that cl_object "fans out" at the LOV level: depending on the file
 447 layout, a single file is represented as a set of "sub-objects" (stripes). At
 448 the implementation level, struct lov_object contains an array of cl_objects.
 449 Each sub-object is a full-fledged cl_object, having its FID and living in the
 450 LRU and hash table. Each sub-object has its own radix tree of pages, and its
 451 own list of locks.
 452
 453 This leads to the next important difference with the server side: on the
 454 client, it's quite usual to have objects with the different sequence of layers.
 455 For example, typical top-object is composed of the following layers:
 456
 457 - VVP
 458 - LOV
 459
 460 whereas its sub-objects are composed of layers:
 461
 462 - LOVSUB
 463 - OSC
 464
 465 Here "LOVSUB" is a mostly dummy layer, whose purpose is to keep track of the
 466 object-subobject relationship:
 467
 468                   cl_object_header-+--->object LRU list
 469                           |
 470                           V
 471            inode<----vvp_object
 472                           |
 473                           V
 474                      lov_object
 475                           |
 476                   +---+---+---+---+
 477                   |   |   |   |   |
 478                   V   |   |   |   |
 479      cl_object_header |   .   .   .
 480            |          |   .   .   .
 481            |          V
 482            .  cl_object_header-+--->object LRU list
 483                       |
 484                       V
 485                 lovsub_object
 486                       |
 487                       V
 488                  osc_object
 489
 490 Sub-objects are not cached independently: when top-object is about to be
 491 discarded from the memory, all its sub-objects are torn-down and destroyed too.
 492
 493 3.3. Object Operations
 494 ======================
 495
 496 In addition to the lu_object_operations vector, each cl_object slice has
 497 cl_object_operations. lu_object_operations deals with object creation and
 498 destruction of objects. Client specific cl_object_operations fall into two
 499 categories:
 500
 501 - Creation of dependent entities: these are ->coo_{page,lock,io}_init()
 502   methods called at every layer when a new page, lock or IO context are being
 503   created, and
 504
 505 - Object attributes: ->coo_attr_{get,set}() methods that are called to get or
 506   set common client object attributes (struct cl_attr): size, [mac]times, etc.
 507
 508 3.4. Object Attributes
 509 ======================
 510
 511 A cl_object has a set of attributes defined by struct cl_attr. Attributes
 512 include object size, object known-minimum-size (KMS), access, change and
 513 modification times and ownership identifiers. Description of KMS is beyond the
 514 scope of this document, refer to the (non-)existent Lustre documentation on the
 515 subject.
 516
 517 Both top-objects and sub-objects have attributes. Consistency of the attributes
 518 is protected by a lock on the top-object, accessible through
 519 cl_object_attr_{un,}lock() calls. This allows a sub-object and its top-object
 520 attributes to be changed atomically.
 521
 522 Attributes are accessible through cl_object_attr_{g,s}et() functions that call
 523 per-layer ->coo_attr_{s,g}et() object methods. Top-object attributes are
 524 calculated from the sub-object ones by lov_attr_get() that optimizes for the
 525 case when none of sub-object attributes have changed since the last call to
 526 lov_attr_get().
 527
 528 As a further potential optimization, one can recalculate top-object attributes
 529 at the moment when any sub-object attribute is changed. This would allow to
 530 avoid collecting cumulative attributes over all sub-objects. To implement this
 531 optimization _all_ changes of sub-object attributes must go through
 532 cl_object_attr_set().
 533
 534 3.5. Object Layout
 535 ==================
 536
 537 Layout of an object decides how data of the file are placed onto OSTs. Object
 538 layout can be changed and if that happens, clients will have to reconfigure
 539 the object layout by calling cl_conf_set() before it can do anything to this
 540 object.
 541
 542 In order to be notified for the layout change, the client has to cache an
 543 inodebits lock: MDS_INODELOCK_LAYOUT in the memory. To change an object's
 544 layout, The MDT holds the EX mode of layout lock therefore all clients having
 545 this object cached will be notified.
 546
 547 Reconfiguring layout of objects is expensive because it has to clean up page
 548 cache and rebuild the sub-objects. There is a field lsm_layout_gen in
 549 lov_stripe_md and it must be increased whenever the layout is changed for an
 550 object. Therefore, if the revocation of layout lock is due to flase sharing of
 551 ibits lock, the lsm_layout_gen won't be changed and the client can reuse the
 552 page cache and subobjects.
 553
 554 CLIO uses ll_refresh_layout() to make sure that valid layout is fetched. This
 555 function must be called before any IO can be started to an object.
 556
 557 ============
 558 = 4. Pages =
 559 ============
 560
 561 A cl_page represents a portion of a file, cached in the memory. All pages of
 562 the given file are of the same size, and are kept in the radix tree of kernel
 563 VM.
 564
 565 A cl_page is associated with a VM page of the hosting environment (struct page
 566 in the Linux kernel, for example), cfs_page_t. It is assumed that this
 567 association is implemented by one of cl_page layers (top layer in the current
 568 design) that
 569
 570 - intercepts per-VM-page call-backs made by the host environment (e.g., memory
 571   pressure),
 572
 573 - translates state (page flag bits) and locking between lustre and the host
 574   environment.
 575
 576 The association between cl_page and cfs_page_t is immutable and established
 577 when cl_page is created. It is possible to imagine a setup where different
 578 pages get their backing VM buffers from different sources. For example, in the
 579 case if pNFS export, some pages might be backed by local DMU buffers, while
 580 others (representing data in remote stripes), by normal VM pages.
 581
 582 Unlike any other entities in CLIO, there is no subpage for cl_page.
 583
 584 4.1. Page Indexing
 585 ==================
 586
 587 Pages within a given object are linearly ordered. The page index is stored in
 588 the ->cpl_index field in cl_page_slice. In a typical Lustre setup, a top-object
 589 has an array of sub-objects, and every page in a top-object corresponds to a
 590 page slice in one of its sub-objects.
 591
 592 There is a radix tree for pages on the OSC layer. When a LDLM lock is being
 593 cancelled, the OSC will look up the radix tree so that the pages belong to
 594 the corresponding lock extent can be destroyed.
 595
 596 4.2. Page Ownership
 597 ===================
 598
 599 A cl_page can be "owned" by a particular cl_io (see below), guaranteeing this
 600 IO an exclusive access to this page with regard to other IO attempts and
 601 various events changing page state (such as transfer completion, or eviction of
 602 the page from memory). Note, that in general a cl_io cannot be identified with
 603 a particular thread, and page ownership is not exactly equal to the current
 604 thread holding a lock on the page. The layer implementing the association
 605 between cl_page and cfs_page_t has to implement ownership on top of available
 606 synchronization mechanisms.
 607
 608 While the Lustre client maintains the notion of page ownership by IO, the
 609 hosting MM/VM usually has its own page concurrency control mechanisms. For
 610 example, in Linux, page access is synchronized by the per-page PG_locked
 611 bit-lock, and generic kernel code (generic_file_*()) takes care to acquire and
 612 release such locks as necessary around the calls to the file system methods
 613 (->readpage(), ->write_begin(), ->write_end(), etc.). This leads to the
 614 situation when there are two different ways to own a page in the client:
 615
 616 - Client code explicitly and voluntary owns the page (cl_page_own());
 617
 618 - The hosting VM locks a page and then calls the client, which has to "assume"
 619   ownership from the VM (cl_page_assume()).
 620
 621 Dual methods to release ownership are cl_page_disown() and cl_page_unassume().
 622
 623 4.3. Page Transfer Locking
 624 ==========================
 625
 626 cl_page implements a simple locking design: as noted above, a page is protected
 627 by a VM lock while IO owns it. The same lock is kept while the page is in
 628 transfer.  Note that this is different from the standard Linux kernel behavior
 629 where page write-out is protected by a lock (PG_writeback) separate from VM
 630 lock (PG_locked). It is felt that this single-lock design is more portable and,
 631 moreover, Lustre cannot benefit much from a separate write-out lock due to LDLM
 632 locking.
 633
 634 4.4. Page Operations
 635 ====================
 636
 637 See documentation for cl_object.h:cl_page_operations. See cl_page state
 638 descriptions in documentation for cl_object.h:cl_page_state.
 639
 640 4.5. Page Initialization
 641 ========================
 642
 643 cl_page is the most frequently allocated and freed entity in the CLIO stack. In
 644 order to improvement the performance of allocation and free, a cl_page, along
 645 with the corresponding cl_page_slice for each layer, is allocated as a single
 646 memory buffer.
 647
 648 Now that the CLIO can support different type of object layout, and each layout
 649 may lead to different cl_page in size. When an object is initialized, the object
 650 initialization method ->loo_object_init() for each layer will decide the size of
 651 buffer for cl_page_slice by calling cl_object_page_init(). cl_object_page_init()
 652 will add the size to coh_page_bufsize of top cl_object_header and co_slice_off
 653 of the corresponding cl_object is used to remember the offset of page slice for
 654 this object.
 655
 656 ============
 657 = 5. Locks =
 658 ============
 659
 660 A struct cl_lock represents an extent lock on cached file or stripe data.
 661 cl_lock is used only to collect the lock requirement to complete an IO. Due to
 662 the existence of MMAP, it may require more than one lock to complete an IO.
 663 Except the cases of direct IO and lockless IO, a cl_lock will be attached to a
 664 LDLM lock on the OSC layer.
 665
 666 As locks protect cached data, and the unit of data caching is a page, locks are
 667 of page granularity.
 668
 669 5.1. Lock Life Cycle
 670 ====================
 671
 672 The lock requirements are collected in cl_io_lock(). In cl_io_lock(), the
 673 ->cio_lock() method for each layers are invoked to decide the lock extent by
 674 IO region, layout, and buffers. For example, in the VVP layer, it has to search
 675 buffers of IO and if the buffers belong to a Lustre file mmap region, the locks
 676 for the corresponding file will be requred.
 677
 678 Once the lock requirements are collected, cl_lock_request() is called to create
 679 and initialize individual locks. In cl_lock_request(), ->clo_enqueue() is called
 680 for each layers. Especially on the OSC layer, osc_lock_enqueue() is called to
 681 match or create LDLM lock to fulfill the lock requirement.
 682
 683 cl_lock is not cacheable. The locks will be destroyed after IO is complete. The
 684 lock destroying process starts from cl_io_unlock() where cl_lock_release() is
 685 called for each cl_lock. In cl_lock_release(), ->clo_cancel() methods are called
 686 for each layer to release the resource held by cl_lock. The most important
 687 resource held by cl_lock is the LDLM lock on the OSC layer. It will be released
 688 by osc_lock_cancel(). LDLM locks can still be cached in memory after being
 689 detached from cl_lock.
 690
 691 5.2. cl_lock and LDLM Lock
 692 ==========================
 693
 694 As the result of enqueue, an LDLM lock is attached to a cl_lock_slice on the OSC
 695 layer. The field ols_dlmlock in the osc_lock points to the LDLM lock.
 696
 697 When an LDLM lock is attached to a osc_lock, its use count(l_readers,l_writers)
 698 is increased therefore it can't be revoked during this time. A LDLM lock can be
 699 shared by multiple osc_lock, in that case, each osc_lock will hold the LDLM lock
 700 according to their use type, i.e., increase l_readers or l_writers respectively.
 701
 702 When a cl_lock is cancelled, the corresponding LDLM lock will be released.
 703 Cancellation of cl_lock does not necessarily cause the underlying LDLM lock to
 704 be cancelled. The LDLM lock can be cached in the memory unless it's being
 705 cancelled by OST.
 706
 707 To cache a page in the client memory, the page index must be covered by at
 708 least one LDLM lock's extent. Please refer to section 5.3 for the details of
 709 pages and lock.
 710
 711 5.3. Use Case: Lock Invalidation
 712 ================================
 713
 714 To demonstrate how objects, pages and lock data-structures interact, let's look
 715 at the example of stripe lock invalidation.
 716
 717 Imagine that on the client C0 there is a file object F, striped over stripes
 718 S0, S1 and S2 (files and stripes are represented by cl_object_header). Further,
 719 C0 just finished a write IO to file F's offset [a, b] and left some clean and
 720 dirty pages in C0 and corresponding LDLM lock LS0, LS1, and LS2 for the
 721 corresponding stripes of S0, S1, and S2 respectively. From section 4.1, the
 722 caching pages stay in the radix tree of S0, S1 and S2.
 723
 724 Some other client requests a lock that conflicts with LS1. The OST where S1
 725 lives, sends a blocking AST to C0.
 726
 727 C0's LDLM invokes lock->l_blocking_ast(), which is osc_ldlm_blocking_ast(),
 728 which eventually calls osc_cache_writeback_range() with the corresponding LDLM
 729 lock's extent as parameters. To find the pages covered by a LDLM lock,
 730 LDLM lock stores a pointer to osc_object in its l_ast_data.
 731
 732 In osc_cache_writeback_range(), it will check if there exist dirty pages for
 733 the extent of this lock. If that is the case, an OST_WRITE RPC will be issued
 734 to write the dirty pages back.
 735
 736 Once all pages are written, they are removed from radix-trees and destroyed.
 737 osc_lock_discard_pages() is called for this purpose. It will look up radix tree
 738 and then discard every page the extent covers.
 739
 740 =========
 741 = 6. IO =
 742 =========
 743
 744 An IO context (struct cl_io) is a layered object describing the state of an
 745 ongoing IO operation (such as a system call).
 746
 747 6.1. Fixed IO Types
 748 ===================
 749
 750 There are two classes of IO contexts, represented by cl_io:
 751
 752 - An IO for a specific type of client activity, enumerated by enum cl_io_type:
 753
 754         . CIT_READ: read system call including read(2), readv(2), pread(2),
 755           sendfile(2);
 756         . CIT_WRITE: write system call;
 757         . CIT_SETATTR: truncate and utime system call;
 758         . CIT_FAULT: page fault handling;
 759         . CIT_FSYNC: fsync(2) system call, ->writepages() writeback request;
 760
 761 - A `catch-all' CIT_MISC IO type for all other IO activity:
 762
 763         . cancellation of an extent lock,
 764         . VM induced page write-out,
 765         . glimpse,
 766         . other miscellaneous stuff.
 767
 768 The difference between CIT_MISC and other IO types is that CIT_MISC IO is
 769 merely a context in which pages are owned and locks are enqueued, whereas
 770 other IO types, in addition to being a context, are also state machines.
 771
 772 6.2. IO State Machine
 773 =====================
 774
 775 The idea behind the cl_io state machine is that initial `work' that has to be
 776 done (e.g., writing a 3MB user buffer into a given file) is done as a sequence
 777 of `iterations', and an iteration is executed as following an idiomatic
 778 sequence of steps:
 779
 780 - Prepare: determine what work is to be done at this iteration;
 781
 782 - Lock: enqueue and acquire all locks necessary to perform this iteration;
 783
 784 - Start: either perform iteration work synchronously, or post it
 785   asynchronously, or both;
 786
 787 - End: wait for the completion of asynchronous work;
 788
 789 - Unlock: release locks, acquired at the "lock" step;
 790
 791 - Finalize: finalize iteration state.
 792
 793 cl_io is a layered entity and each step above is performed by invoking the
 794 corresponding cl_io_operations method on every layer. As will be explained
 795 below, this is especially important in the `prepare' step, as it allows layers
 796 to cooperate in determining the scope of the current iteration.
 797
 798 Before an IO can be started, the client has to make sure that the object's
 799 layout is valid. The client will check there exists a valid layout lock being
 800 cached in the client side memory, otherwise, ll_layout_refresh() has to be
 801 called to fetch uptodate layout from the MDT side.
 802
 803 For CIT_READ or CIT_WRITE IO, a typical scenario is splitting the original user
 804 buffer into chunks that map completely inside of a single stripe in the target
 805 file, and processing each chunk as a separate iteration. In this case, it is
 806 the LOV layer that (in lov_io_rw_iter_init() function) determines the extent of
 807 the current iteration.
 808
 809 Once the iteration is prepared, the `lock' step acquires all necessary DLM
 810 locks to cover the region of a file that is affected by the current iteration.
 811 The `start' step does the actual processing, which for write means placing
 812 pages from the user buffer into the cache, and for read means fetching pages
 813 from the server, including read-ahead pages (see `immediate transfer' below).
 814 Truncate and page fault are executed in one iteration (currently that is, it's
 815 easy to change truncate implementation to, for instance, truncate each stripe
 816 in a separate iteration, should the need arise).
 817
 818 6.3. Parallel IO
 819 ================
 820
 821 One important planned generalization of this model is an out of order execution
 822 of iterations.
 823
 824 A motivating example for this is a write of a large user level buffer,
 825 overlapping with multiple stripes. Typically, a busy Lustre client has its
 826 per-OSC caches for the dirty pages nearly full, which means that the write
 827 often has to block, waiting for the cache to drain. Instead of blocking the
 828 whole IO operation, CIT_WRITE might switch to the next stripe and try to do IO
 829 there.  Without such a `non-blocking' IO, a slow OST or an unfair network
 830 degrades the performance of the whole cluster.
 831
 832 Another example is a legacy single-threaded application running on a multi-core
 833 client machine, where IO throughput is limited by the single thread copying
 834 data between the user buffer to the kernel pages. Multiple concurrent IO
 835 iterations that can be scheduled independently on the available processors
 836 eliminate this bottleneck by copying the data in parallel.
 837
 838 Obviously, parallel IO is not compatible with the usual `sequential IO'
 839 semantics. For example, POSIX read and write have a very simple failure model,
 840 where some initial (possibly empty) segment of a user buffer is processed
 841 successfully, and none of the remaining bytes were read and written. Parallel
 842 IO can fail in much more complex ways.
 843
 844 For now, only sequential iterations are supported.
 845
 846 6.4. Data-flow: From Stack to IO Slice
 847 ======================================
 848
 849 The parallel IO design outlined above implies that an ongoing IO can be
 850 preempted by other IO and later resumed, all potentially in the same thread.
 851 This means that IO state cannot be kept on a stack, as it is customarily done
 852 in UNIX file system drivers. Instead, the layered cl_io is used to store
 853 information about the current iteration and progress within it.  Coincidentally
 854 (almost) this is similar to the way IO requests are used by the Windows driver
 855 stack.
 856
 857 A set of common fields in struct cl_io describe the IO and are shared by all
 858 layers.  Important properties so described include:
 859
 860 - The IO type;
 861
 862 - A file (struct cl_object) against which this IO is executed;
 863
 864 - A position in a file where the read or write is taking place, and a count of
 865   bytes remaining to be processed (for CIT_READ and CIT_WRITE);
 866
 867 - A size to which file is being truncated or expanded (for CIT_SETATTR);
 868
 869 - A list of locks acquired for this IO;
 870
 871 Each layer keeps IO state in its `IO slice', described below, with all slices
 872 chained to the list hanging off of struct cl_io:
 873
 874 - vvp_io is used by the top-most layer of the Linux kernel client.
 875
 876   The most important state in vvp_io is an array of struct iovec, describing
 877   user space buffers from or to which IO is taking place. Note that other
 878   layers in the IO stack have no idea that data actually came from user space.
 879
 880   vvp_io contains kernel specific fields, such as VM information describing a
 881   page fault, or the sendfile target.
 882
 883 - lov_io: IO state private for the LOV layer is kept here. The most important IO
 884   state at the LOV layer is an array of sub-IO's. Each sub-IO is a normal
 885   struct cl_io, representing a part of the IO process for a given iteration.
 886   With current sequential iterations, only one sub-IO is active at a time.
 887
 888 - osc_io: this slice stores IO state private to the OSC layer that exists within
 889   each sub-IO created by LOV.
 890
 891 =================
 892 = 7. RPC Engine =
 893 =================
 894
 895 7.1. Immediate vs. Opportunistic Transfers
 896 ==========================================
 897
 898 There are two possible modes of transfer initiation on the client:
 899
 900 - Immediate transfer: this is started when a high level IO wants a page or a
 901   collection of pages to be transferred right away. Examples: read-ahead,
 902   a synchronous read in the case of non-page aligned write, page write-out as
 903   part of an extent lock cancellation, page write-out as a part of memory
 904   cleansing. Immediate transfer can be both cl_req_type::CRT_READ and
 905   cl_req_type::CRT_WRITE;
 906
 907 - Opportunistic transfer (cl_req_type::CRT_WRITE only), that happens when IO
 908   wants to transfer a page to the server some time later, when it can be done
 909   efficiently. Example: pages dirtied by the write(2) path. Pages submitted for
 910   an opportunistic transfer are kept in a "staging area".
 911
 912 In any case, a transfer takes place in the form of a cl_req, which is a
 913 representation for a network RPC.
 914
 915 Pages queued for an opportunistic transfer are placed into a staging area
 916 (represented as a set of per-object and per-device queues at the OSC layer)
 917 until it is decided that an efficient RPC can be composed of them. This
 918 decision is made by "a req-formation engine", currently implemented as part of
 919 the OSC layer. Req-formation depends on many factors: the size of the resulting
 920 RPC, RPC alignment, whether or not multi-object RPCs are supported by the
 921 server, max-RPC-in-flight limitations, size of the staging area, etc. CLIO uses
 922 osc_extent to group pages for req-formation. osc_extent are further managed in
 923 a per-object red-black tree for efficient RPC formatting.
 924
 925 For an immediate transfer the IO submits a cl_page_list which the req-formation
 926 engine slices into cl_req's, possibly adding cached pages to some of the
 927 resulting req's.
 928
 929 Whenever a page from cl_page_list is added to a newly constructed req, its
 930 cl_page_operations::cpo_prep() layer methods are called. At that moment, the
 931 page state is atomically changed from cl_page_state::CPS_OWNED to
 932 cl_page_state::CPS_PAGEOUT or cl_page_state::CPS_PAGEIN, cl_page::cp_owner is
 933 zeroed, and cl_page::cp_req is set to the req. cl_page_operations::cpo_prep()
 934 method at a particular layer might return -EALREADY to indicate that it does
 935 not need to submit this page at all. This is possible, for example, if a page
 936 submitted for read became up-to-date in the meantime; and for write, if the
 937 page don't have dirty bit set. See cl_io_submit_rw() for details.
 938
 939 Whenever a staged page is added to a newly constructed req, its
 940 cl_page_operations::cpo_make_ready() layer methods are called. At that moment,
 941 the page state is atomically changed from cl_page_state::CPS_CACHED to
 942 cl_page_state::CPS_PAGEOUT, and cl_page::cp_req is set to req. The
 943 cl_page_operations::cpo_make_ready() method at a particular layer might return
 944 -EAGAIN to indicate that this page is not currently eligible for the transfer.
 945
 946 The RPC engine guarantees that once the ->cpo_prep() or ->cpo_make_ready()
 947 method has been called, the page completion routine (->cpo_completion() layer
 948 method) will eventually be called (either as a result of successful page
 949 transfer completion, or due to timeout).
 950
 951 To summarize, there are two main entry points into transfer sub-system:
 952
 953 - cl_io_submit_rw(): submits a list of pages for immediate transfer;
 954
 955 - cl_io_commit_async(): places a list of pages into staging area for future
 956   opportunistic transfer.
 957
 958 7.2. Page Lists
 959 ===============
 960
 961 To submit a group of pages for immediate transfer struct cl_2queue is used. It
 962 contains two page lists: qin (input queue) and qout (output queue). Pages are
 963 linked into these queues by cl_page::cp_batch list heads. Qin is populated with
 964 the pages to be submitted to the transfer, and pages that were actually
 965 submitted are placed onto qout. Not all pages from qin might end up on qout due
 966 to
 967
 968 - ->cpo_prep() methods deciding that page should not be transferred, or
 969
 970 - unrecoverable submission error.
 971
 972 Pages not moved to qout remain on qin. It is up to the transfer submitter to
 973 decide when to remove pages from qin and qout. Remaining pages on qin are
 974 usually removed from this list right after (partially unsuccessful) transfer
 975 submission. Pages are usually left on qout until transfer completion. This way
 976 the caller can determine when all pages from the list were transferred.
 977
 978 The association between a page and an immediate transfer queue is protected by
 979 cl_page::cl_mutex. This mutex is acquired when a cl_page is added in a
 980 cl_page_list and released when a page is removed from the list.
 981
 982 When an RPC is formed, all of its constituent pages are linked together through
 983 cl_page::cp_flight list hanging off of cl_req::crq_pages. Pages are removed
 984 from this list just before the transfer completion method is invoked. No
 985 special lock protects this list, as pages in transfer are under a VM lock.
 986
 987 7.3. Transfer States: Prepare, Completion
 988 =========================================
 989
 990 The transfer (cl_req) state machine is trivial, and it is not explicitly coded.
 991 A newly created transfer is in the "prepare" state while pages are collected.
 992 When all pages are gathered, the transfer enters the "in-flight" state where it
 993 remains until it reaches the "completion" state where page completion handlers
 994 are invoked.
 995
 996 The per-layer ->cro_prep() transfer method is called when transfer preparation
 997 is completed and transfer is about to enter the in-flight state. Similarly, the
 998 per-layer ->cro_completion() method is called when the transfer completes
 999 before per-page completion methods are called.
1000
1001 Additionally, before moving a transfer out of the prepare state, the RPC engine
1002 calls the cl_req_attr_set() function.  This function invokes ->cro_attr_set()
1003 methods on every layer to fill in RPC header that server uses to determine
1004 where to get or put data. This replaces the old ->ap_{update,fill}_obdo()
1005 methods.
1006
1007 Further, cl_req's are not reference counted and access to them is not
1008 synchronized. This is because they are accessed only by the RPC engine in OSC
1009 which fully controls RPC life-time, and it uses an internal OSC lock
1010 (client_obd::cl_loi_list_lock spin-lock) for serialization.
1011
1012 7.4. Page Completion Handlers, Synchronous Transfer
1013 ===================================================
1014
1015 When a transfer completes, cl_req completion methods are called on every layer.
1016 Then, for every transfer page, per-layer page completion methods
1017 ->cpo_completion() are invoked. The page is still under the VM lock at this
1018 moment.  Completion methods are called bottom-to-top and it is responsibility
1019 of the last of them (i.e., the completion method of the top-most layer---VVP)
1020 to release the VM lock.
1021
1022 Both immediate and opportunistic transfers are asynchronous in the sense that
1023 control can return to the caller before the transfer completes. CLIO doesn't
1024 provide a synchronous transfer interface at all and it is up to a particular
1025 caller to implement it if necessary. The simplest way to wait for the transfer
1026 completion is wait on a page VM lock. This approach is used implicitly by the
1027 Linux kernel. There is a case, though, where one wants to do transfer
1028 completely synchronously without releasing the page VM lock: when
1029 ->prepare_write() method determines that a write goes from a non page-aligned
1030 buffer into a not up-to-date page, a portion of a page has to be fetched from
1031 the server. The VM page lock cannot be used to synchronize transfer completion
1032 in this case, because it is used to mark the page as owned by IO. To handle
1033 this, VVP attaches struct cl_sync_io to struct vvp_page. cl_sync_io contains a
1034 number of pages still in IO and a synchronization primitive (struct completion)
1035 which is signalled when transfer of the last page completes. The VVP page
1036 completion handler (vvp_page_completion_common()) checks for attached
1037 cl_sync_io and if it is there, decreases the number of in-flight pages and
1038 signals completion when that number drops to 0. A similar mechanism is used for
1039 direct-IO.
1040
1041 =============
1042 = 8. lu_env =
1043 =============
1044
1045 8.1. Motivation, Server Environment Usage
1046 =========================================
1047
1048 lu_env and related data-types (struct lu_context and struct lu_context_key)
1049 together implement a memory pre-allocation interface that Lustre uses to
1050 decrease stack consumption without resorting to fully dynamic allocation.
1051
1052 Stack space is severely limited in the Linux kernel. Lustre traditionally
1053 allocated a lot of automatic variables, resulting in spurious stack overflows
1054 that are hard to trigger (they usually need a certain combination of driver
1055 calls and interrupts to happen, making them extremely difficult to reproduce)
1056 and debug (as stack overflow can easily result in corruption of thread-related
1057 data-structures in the kernel memory, confusing the debugger).
1058
1059 The simplest way to handle this is to replace automatic variables with calls
1060 to the generic memory allocator, but
1061
1062 - The generic allocator has scalability problems, and
1063
1064 - Additional code to free allocated memory is needed.
1065
1066 The lu_env interface was originally introduced in the MDS rewrite for Lustre
1067 2.0 and matches server-side threading model very well. Roughly speaking,
1068 lu_context represents a context in which computation is executed and
1069 lu_context_key is a description of per-context data. In the simplest case
1070 lu_context corresponds to a server thread; then lu_context_key is effectively a
1071 thread-local storage (TLS). For a similar idea see the user-level pthreads
1072 interface pthread_key_create().
1073
1074 More formally, lu_context_key defines a constructor-destructor pair and a tags
1075 bit-mask. When lu_context is initialized (with a given tag bit-mask), a global
1076 array of all registered lu_context_keys is scanned, constructors for all keys
1077 with matching tags are invoked and their return values are stored in
1078 lu_context.
1079
1080 Once lu_context has been initialized, a value of any key allocated for this
1081 context can be retrieved very efficiently by indexing in the per-context
1082 array. lu_context_key_get() function is used for this.
1083
1084 When context is finalized, destructors are called for all keys allocated in
1085 this context.
1086
1087 The typical server usage is to have a lu_context for every server thread,
1088 initialized when the thread is started. To reduce stack consumption by the
1089 code running in this thread, a lu_context_key is registered that allocates in
1090 its constructor a struct containing as fields values otherwise allocated on
1091 the stack. See {mdt,osd,cmm,mdd}_thread_info for examples. Instead of doing
1092
1093         int function(args) {
1094                 /* structure "bar" in module "foo" */
1095                 struct foo_bar bar;
1096                 ...
1097
1098 the code roughly does
1099
1100         struct foo_thread_info {
1101                 struct foo_bar fti_bar;
1102                 ...
1103         };
1104
1105         int function(const struct lu_env *env, args) {
1106                 struct foo_bar *bar;
1107                 ...
1108                 bar = &lu_context_key_get(&env->le_ctx, &foo_thread_key)->fti_
1109
1110 etc.
1111
1112 struct lu_env contains 2 contexts:
1113
1114 - le_ctx: this context is embedded in lu_env. By convention, this context is
1115   used _only_ to avoid allocations on the stack, and it should never be used to
1116   pass parameters between functions or layers. The reason for this restriction
1117   is that using contexts for implicit state sharing leads to a code that is
1118   difficult to understand and modify.
1119
1120 - le_ses: this is a pointer to a context shared by all threads handling given
1121   RPC. Context itself is embedded into struct ptlrpc_request. Currently a
1122   request is always processed by a single thread, but this might change in the
1123   future in a design where a small pool of threads processes RPCs
1124   asynchronously.
1125
1126 Additionally, state kept in env->le_ses context is shared by multiple layers.
1127 For example, remote user credentials are stored there.
1128
1129 8.2. Client Environment Usage
1130 =============================
1131
1132 On a client there is a lu_env associated with every thread executing Lustre
1133 code. Again, it contains &env->le_ctx context used to reduce stack consumption.
1134 env->le_ses is used to share state between all threads handling a given IO.
1135 Again, currently an IO is processed by a single thread. env->le_ses is used to
1136 efficiently allocate cl_io slices ({vvp,lov,osc}_io).
1137
1138 There are three important differences with lu_env usage on the server:
1139
1140 - While on the server there is a fixed pool of threads, any client thread can
1141   execute Lustre code. This makes it impractical to pre-allocate and
1142   pre-initialize lu_context for every thread. Instead, contexts are constructed
1143   on demand and after use returned into a global cache that amortizes creation
1144   cost;
1145
1146 - Client call-chains frequentyly cross Lustre-VFS and Lustre-VM boundaries.
1147   This means that just passing lu_env as a first parameter to every Lustre
1148   function and method is not enough. To work around this problem, a pointer to
1149   lu_env is stored in a field in the kernel data-structure associated with the
1150   current thread (task_struct::journal_info), from where it is recovered when
1151   Lustre code is re-entered from VFS or VM;
1152
1153 - Sometimes client code is re-entered in a fashion that precludes re-use of the
1154   higher level lu _env. For example, when a read or write incurs a page fault
1155   in the user space buffer memory-mapped from a Lustre file, page fault
1156   handling is a separate IO, independent of the already ongoing system call.
1157   The Lustre page fault handler allocates a new lu_env (by calling
1158   lu_env_get_nested()) in which the nested IO is going on. A similar situation
1159   occurs when client DLM lock LRU shrinking code is invoked in the context of a
1160   system call.
1161
1162 8.3. Sub-environments
1163 =====================
1164
1165 As described above, lu_env (specifically, lu_env->le_ses) is used on a client
1166 to allocate per-IO state, including foo_io data on every layer. This leads to a
1167 complication at the LOV layer, which maintains multiple sub-IOs. As layers
1168 below LOV allocate their IO slices in lu_env->le_ses, LOV has to allocate an
1169 lu_env for every sub-IO and to carefully juggle them when invoking lower layer
1170 methods. The case of a single IO is optimized by re-using the top-environment.
1171
1172 ================
1173 = 9. Use cases =
1174 ================
1175
1176 9.1. Inode Creation
1177 ===================
1178
1179 Lookup ends up calling ll_update_inode() to setup a new inode with a given
1180 meta-data descriptor (obtained from the meta-data path). cl_inode_init() calls
1181 cl_object_find() eventually calling lu_object_find_try() that either finds a
1182 cl_object in the cache or allocates a new one, calling
1183 lu_device_operations::ldo_object_{alloc,init}() methods on every layer top to
1184 bottom. Every layer allocates its private data structure ({vvp,lov}_object) and
1185 links it into an object header (cl_object_header) by calling lu_object_add().
1186 At the VVP layer, vvp_object contains a pointer to the inode. The LOV layer
1187 allocates a lov_object containing an array of pointers to sub-objects that are
1188 found in the cache or allocated by calling cl_object_find (recursively). These
1189 sub-objects have LOVSUB and OSC layer data.
1190
1191 A top-object and its sub-objects are inserted into a global FID-based hash
1192 table and a global LRU list.
1193
1194 9.2. First IO to a File
1195 =======================
1196
1197 After an object is instantiated as described in the previous use case, the
1198 first IO call against this object has to create DLM locks. The following
1199 operations re-use cached locks (see below).
1200
1201 A read call starts at ll_file_readv() which eventually calls
1202 ll_file_io_generic(). This function calls cl_io_init() to initialize an IO
1203 context, which calls the cl_object_operations::coo_io_init() method on every
1204 layer.  As in the case of object instantiation, these methods allocate
1205 layer-private IO state ({vvp,lov}_io) and add it to the list hanging off of the
1206 IO context header cl_io by calling cl_io_add(). At the VVP layer, vvp_io_init()
1207 handles special cases (like count == 0), updates statistic counters, and in the
1208 case of write it takes a per-inode semaphore to avoid possible deadlock.
1209
1210 At the LOV layer, lov_io_init_raid0() allocates a struct lov_io and stores in
1211 it the original IO parameters (starting offset and byte count). This is needed
1212 because LOV is going to modify these parameters. Sub-IOs are not allocated at
1213 this point---they are lazily instantiated later.
1214
1215 Once the top-IO has been initialized, ll_file_io_generic() enters the main IO
1216 loop cl_io_loop() that drives IO iterations, going through
1217
1218 - cl_io_iter_init() calling cl_io_operations::cio_iter_init() top-to-bottom
1219 - cl_io_lock() calling cl_io_operations::cio_lock() top-to-bottom
1220 - cl_io_start() calling cl_io_operations::cio_start() top-to-bottom
1221 - cl_io_end() calling cl_io_operations::cio_end() bottom-to-top
1222 - cl_io_unlock() calling cl_io_operations::cio_unlock() bottom-to-top
1223 - cl_io_iter_fini() calling cl_io_operations::cio_iter_fini() bottom-to-top
1224 - cl_io_rw_advance() calling cl_io_operations::cio_advance() bottom-to-top
1225
1226 repeatedly until cl_io::ci_continue remains 0 after an iteration. These "IO
1227 iterations" move an IO context through consecutive states (see enum
1228 cl_io_state).  ->cio_iter_init() decides at each layer what part of the
1229 remaining IO is to be done during current iteration. Currently,
1230 lov_io_rw_iter_init() is the only non-trivial implementation of this method. It
1231 does the following:
1232
1233 - Except for the cases of truncate and O_APPEND write, it shrinks the IO extent
1234   recorded in the top-IO (starting offset and bytes count) so that this extent
1235   is fully contained within a single stripe. This avoids "cascading evictions";
1236
1237 - It allocates sub-IOs for all stripes intersecting with the resulting IO range
1238   (which, in case of non-append write or read means creating single sub-io) by
1239   calling cl_io_init() that (as above) creates a cl_io context with lovsub_io
1240   and osc_io layers. The initialized cl_io is primed from the top-IO
1241   (lov_io_sub_inherit()) and cl_io_iter_init() is called against it;
1242
1243 - Finally all sub-ios for the current iteration are linked together into a
1244   lov_io::lis_active list.
1245
1246 Now we have a top-IO and its sub-IO in CIS_IT_STARTED state. cl_io_lock()
1247 collects locks on all layers without actually enqueuing them: vvp_io_rw_lock()
1248 requests a lock on the IO extent (possibly shrunk by LOV, see above) and
1249 optionally on extents of Lustre files that happen to be memory-mapped onto the
1250 user-level buffer used for this IO. In the future layers like SNS might request
1251 additional locks, e.g., to protect parity blocks.
1252
1253 Locks requested by ->cio_lock() methods are added to the cl_lockset embedded
1254 into top cl_io. The lockset contains 2 lock queues: "todo" and "done". Locks
1255 are initially placed in the todo queue. Once locks from all layers have been
1256 collected, they are sorted to avoid deadlocks (cl_io_locks_sort()) and then
1257 enqueued by cl_lockset_lock(). The locks will be moved from todo list into
1258 "done" list when they are granted.
1259
1260 At this stage we have top- and sub-IO in the CIS_LOCKED state with all needed
1261 locks held. cl_io_start() moves cl_io into CIS_IO_GOING mode and calls
1262 ->cio_start() method. In the VVP layer this method invokes some version of
1263 generic_file_{read,write}() function.
1264
1265 In the case of read, generic_file_read() calls for every non-up-to-date page
1266 the a_ops->readpage() method that eventually (after obtaining cl_page
1267 corresponding to the VM page supplied to it) calls cl_io_read_page() which in
1268 turn calls cl_io_operations::cio_read_page().
1269
1270 vvp_io_read_page() populates a queue by a target page and pages from read-ahead
1271 window. The resulting queue is then submitted for immediate transfer by calling
1272 cl_io_submit_rw() which ends up calling osc_io_submit_page() for every
1273 not-up-to-date page in the queue.
1274
1275 ->readpage() returns at this point, and the VM waits on a VM page lock, which
1276 is released by the transfer completion handler before copying page data to the
1277 user buffer.
1278
1279 In the case of write, generic_file_write() calls the a_ops->write_begin() and
1280 a_ops->write_end() address space methods that end up calling ll_write_begin()
1281 and ll_write_end() respectively. These functions follow the normal Linux
1282 protocol for write, including a possible synchronous read of a non-overwritten
1283 part of a page (ll_page_sync_io() call in ll_prepare_partial_page()). The pages
1284 are placed into a list of vui_queue of vvp_io. In the normal case the pages will
1285 be committed after all pages are handled by calling vvp_io_write_commit(). In
1286 vvp_io_commit_write(), it calls cl_io_commit_async() to submit the dirty pages
1287 into OSC writeback cache, where grant are allocated and the pages are added into
1288 a red-black tree of osc_extent. In case there is no enough grant on the client
1289 side, cl_io_commit_async will fail with -EDQUOT and the pages are transferred
1290 immediately by calling ll_page_sync_io().
1291
1292 9.3. Lock-less and No-cache IO
1293 ==============================
1294
1295 IO context has a "locking mode" selected from MAYBE, NEVER or MANDATORY set
1296 (enum cl_io_lock_dmd), that specifies what degree of distributed cache
1297 coherency is assumed by this IO. MANDATORY mode requires all caches accessed by
1298 this IO to be protected by distributed locks. In NEVER mode no distributed
1299 coherency is needed at the expense of not caching the data. This mode is
1300 required for the cases where client can not or will not participate in the
1301 cache coherency protocol (e.g., a liblustre client that cannot respond to the
1302 lock blocking call-backs while in the compute phase). In MAYBE mode some of the
1303 caches involved in this IO are used and are globally coherent, and some other
1304 caches are bypassed.
1305
1306 O_APPEND writes and truncates are always executed in MANDATORY mode. All other
1307 calls are executed in NEVER mode by liblustre (see below) and in MAYBE mode by
1308 a normal Linux client.
1309
1310 In MAYBE mode every OSC individually decides whether to use DLM. An OST might
1311 return -EUSERS to an enqueue RPC indicating that the stripe in question is
1312 contended and that the client should switch to the lockless IO mode. If this
1313 happens, OSC, instead of using ldlm_lock, creates a special "lockless OSC lock"
1314 that is not backed up by a DLM lock. This lock conflicts with any other lock in
1315 its range and self-cancels when its last user is removed. As a result, when IO
1316 proceeds to the stripe that is in lockless mode, all conflicting extent locks
1317 are cancelled, purging the cache. When IO against this stripe ends, the lock is
1318 cancelled, sending dirty pages (just placed in the cache by IO) back to the
1319 server and invalidating the cache again. "Lockless locks" allow lockless and
1320 no-cache IO mode to be implemented by the same code paths as cached IO.
1321
1322 * * * END * * *