Documentation/osd-api.txt

   1                ****************************************************
   2                * Overview of the Lustre Object Storage Device API *
   3                ****************************************************
   4
   5 Original Authors:
   6 =================
   7 Alex    Zhuravlev <alexey.zhuravlev@intel.com>
   8 Andreas Dilger    <andreas.dilger@intel.com>
   9 Johann  Lombardi  <johann.lombardi@intel.com>
  10 Li      Wei       <wei.g.li@intel.com>
  11 Niu     Yawei     <yawei.niu@intel.com>
  12
  13 Last Updated: October 9, 2012
  14
  15  Copyright (c) 2012, 2013, Intel Corporation.
  16
  17 This file is released under the GPLv2.
  18
  19 Topics
  20 ======
  21
  22 I.   Introduction
  23         1. What OSD API is
  24         2. What OSD API is Not
  25         3. Layering
  26         4. Audience/Goal
  27 II.  Backend Storage Subsystem Requirements
  28         1. Atomicity of Updates
  29         2. Object Attributes
  30                 i.  Standard POSIX Attributes
  31                 ii. Extended Attributes
  32         3. Efficient Index
  33         4. Commit Callbacks
  34         5. Space Accounting
  35 III. OSD & LU Infrastructure
  36         1. Devices
  37                 i.   Device Overview
  38                 ii.  Device Type & Operations
  39                 iii. Device Operations
  40                 iv.  OBD Methods
  41         2. Objects
  42                 i.   Object Overview
  43                 ii.  Object Lifecycle
  44                 iii. Special Objects
  45                 iv.  Object Operations
  46         3. Lustre Environment
  47 IV.  Data (DT) API
  48         1. Data Device
  49         2. Data Objects
  50                 i.   Common Storage Operations
  51                 ii.  Data Object Operations
  52                 iii. Indice Operations
  53         3. Transactions
  54                 i.   Description
  55                 ii.  Lifetime
  56                 iii. Methods
  57         4. Locking
  58                 i.   Description
  59                 ii.  Methods
  60 V.   Quota Enforcement
  61         1. Overview
  62         2. QSD API
  63 Appendix 1. A brief note on Lustre configuration.
  64 Appendix 2. Sample Code
  65
  66 ===================
  67 = I. Introduction =
  68 ===================
  69
  70 1. What OSD API is
  71 ==================
  72
  73 OSD API is the interface to access and modify data that is supposed to be stored
  74 persistently. This API layer is the interface to code that bridges individual
  75 file systems such as ext4 or ZFS to Lustre.
  76 The API is a generic interface to transaction and journaling based file systems
  77 so many backend file systems can be supported in a Lustre implementation.
  78 Data can be cached within the OSD or backend target and could be destroyed
  79 before hitting storage, but in general the final target is a persistent storage.
  80 This API creates many possibilities, including using object-storage devices or
  81 other new persistent storage technologies.
  82
  83 2. What OSD API is Not
  84 ======================
  85
  86 OSD API should not be used to control in-core-only state (like ldlm locking),
  87 configuration, etc. The upper layers of the IO/metadata stack should not be
  88 involved with the underlying layout or allocation in the OSD storage.
  89
  90 3. Layering
  91 ===========
  92
  93 Lustre is composed of different kernel modules, each implementing different
  94 layers in the software stack in an object-oriented approach. Generally, each
  95 layer builds (or stacks) upon another, and each object is a child of the
  96 generic LU object class. Hence the term "LU stack" is often used to reference
  97 this hierarchy of lustre modules and objects.
  98
  99 Each layer (i.e. mdt/mdd/lod/osp/ofd/osd) defines its own generic item
 100 (lu_object/lu_device) which are thus gathered in a compound item (lu_site/
 101 lu_object_layer) representing the multi-layered stacks. Different classes of
 102 operations can then be implemented by each layer, depending on its natures.
 103
 104 As a result, each OSD is expected to implement:
 105 - the generic LU API used to manage the device stack and objects (see chapter
 106   III)
 107 - the DT API (most commonly called OSD API) used to manipulate on-disk
 108   structures (see chapter IV).
 109
 110 4. Audience/Goal
 111 ================
 112
 113 The goal of this document is to provide the reader with the information
 114 necessary to accurately construct a new Object Storage Device (OSD) module
 115 interface layer for Lustre in order to use a new backend file system with
 116 Lustre 2.4 and greater.
 117
 118 ==============================================
 119 = II. Backend Storage Subsystem Requirements =
 120 ==============================================
 121
 122 The purpose of this section is to gather the requirements for the storage
 123 subsystems below the OSD API.
 124
 125 1. Atomicity of Updates
 126 =======================
 127
 128 The underlying OSD storage must be able to provide some form of atomic commit
 129 for multiple arbitrary updates to OSD storage within a single transaction.
 130 It will always know in advance of the transaction starting which objects will
 131 be modified, and how they will be modified.
 132
 133 If any of the updates associated with a transaction are stored persistently
 134 (i.e. some state in the OSD is modified), then all of the updates in that
 135 transaction must also be stored persistently (Atomic). If the OSD should fail
 136 in some manner that prevents all the updates of a transaction from being
 137 completed, then none of the updates shall be completed (Consistent).
 138 Once the updates have been reported committed to the caller (i.e. commit
 139 callbacks have been run), they cannot be rolled back for any reason (Durable).
 140
 141 2. Object Attributes
 142 ====================
 143
 144 i. Standard POSIX Attributes
 145 ----------------------------
 146 The OSD object should be able to store normal POSIX attributes on each object
 147 as specified by Lustre:
 148 - user ID (32 bits)
 149 - group ID (32 bits)
 150 - object type (16 bits)
 151 - access mode (16 bits)
 152 - metadata change time (96 bits, 64-bit seconds, 32-bit nanoseconds)
 153 - data modification time (96 bits, 64-bit seconds, 32-bit nanoseconds)
 154 - data access time (96 bits, 64-bit seconds, 32-bit nanoseconds)
 155 - creation time (96 bits, 64-bit seconds, 32-bit nanoseconds, optional)
 156 - object size (64 bits)
 157 - link count (32 bits)
 158 - flags (32 bits)
 159 - object version (64 bits)
 160
 161 The OSD object shall not modify these attributes itself.
 162
 163 In addition, it is desirable track the object allocation size (“blocks”), which
 164 the OSD manages itself. Lustre will query the object allocation size, but will
 165 never modify it. If these attributes are not managed by the OSD natively as part
 166 of the object itself, they can be stored in an extended attribute associated
 167 with the object.
 168
 169 ii. Extended Attributes
 170 ------------------------
 171 The OSD should have an efficient mechanism for storing small extended attributes
 172 with each object. This implies that the extended attributes can be accessed at
 173 the same time as the object (without extra seek/read operations). There is also
 174 a requirement to store larger extended attributes in some cases (over 1kB in
 175 size), but the performance of such attributes can be slower proportional to the
 176 attribute size.
 177
 178 3. Efficient Index
 179 ==================
 180
 181 The OSD must provide a mechanism for efficient key=value retrieval, for both
 182 fixed-length and variable length keys and values. It is expected that an index
 183 may hold tens of millions of keys, and must be able to do random key lookups
 184 in an efficient manner. It must also provide a mechanism for iterating over all
 185 of the keys in a particular index and returning these to the caller in a
 186 consistent order across multiple calls. It must be able to provide a cookie that
 187 defines the current index at which the iteration is positioned, and must be able
 188 to continue iteration at this index at a later time.
 189
 190 4. Commit Callbacks
 191 ===================
 192
 193 The OSD must provide some mechanism to register multiple arbitrary callback
 194 functions for each transaction, and call these functions after the transaction
 195 with which they are associated has committed to persistent storage.
 196 It is not required that they be called immediately at transaction commit time,
 197 but they cannot be delayed an arbitrarily long time, or other parts of the
 198 system may suffer resource exhaustion. If this mechanism is not implemented by
 199 the underlying storage, then it needs to be provided in some manner by the OSD
 200 implementation itself.
 201
 202 5. Space Accounting
 203 ===================
 204
 205 In order to provide quota functionality for the OSD, it must be able to track
 206 the object allocation size against at least two different keys (typically User
 207 ID and Group ID). The actual mechanism of tracking this allocation is internal
 208 to the OSD. Lustre will specify the owners of the object against which to track
 209 this space. Space accounting information will be accessed by Lustre via the
 210 index API on special objects dedicated to space allocation management.
 211
 212 ================================
 213 = III. OSD & LU Infrastructure =
 214 ================================
 215
 216 As a member of the LU stack, each OSD module is expected to implement the
 217 generic LU API used to manage devices and objects.
 218
 219 1. Devices
 220 ==========
 221
 222 i. Device Overview
 223 ------------------
 224 Each layer in the stack is represented by a lu_device structure which holds
 225 the very basic data like reference counter, a reference to the site (Lustre
 226 object collection in-core, very similar to inode cache), a reference to
 227 struct lu_type which in turn describe this specific type of devices
 228 (type name, operations etc).
 229
 230 OSD device is created and initialized at mount time to let configuration
 231 component access data it needs before the whole Lustre stack is ready.
 232 OSD device is destroyed when all the devices using that are destroyed too.
 233 Usually this happen when the server stack shuts down at unmount time.
 234
 235 There might be few OSD devices of the given type (say, few zfs-osd and
 236 ldiskfs-osd). The type stores method common for all OSD instances of given type
 237 (below they start with ldto_ prefix). Then every instance of OSD device can get
 238 few specific methods (below the start with ldo_ prefix).
 239
 240 To connect devices into a stack, ->o_connect() method is used (see struct
 241 obd_ops). Currently OSD should implement this method to track all it’s users.
 242 Then to disconnect ->o_disconnect() method is used. OSD should implement this
 243 method, track remaining users and once no users left, call
 244 class_manual_cleanup() function which initiate removal of OSD.
 245
 246 As the stack involves many devices and there may be cross-references between
 247 them, it’s easier to break the whole shutdown procedure into the two steps and
 248 do not set a specific order in which different devices shutdown: at the first
 249 step the devices should release all the resources they use internally
 250 (so-called pre-cleanup procedure), at the second step they are actually
 251 destroyed.
 252
 253 ii. Device Type & Operations
 254 ----------------------------
 255 The first thing to do when developing a new OSD is to define a lu_device_type
 256 structure to define and register the new OSD type. The following fields of the
 257 lu_device_type needs to be filled appropriately:
 258 ldt_tags
 259         is the type of device, typically data, metadata or client (see
 260         lu_device_tag). An OSD device is of data type and should always
 261         registers as such by setting this field to LU_DEVICE_DT.
 262 ldt_name
 263         is the name associated with the new OSD type.
 264         See LUSTRE_OSD_{LDISKFS,ZFS}_NAME for reference.
 265 ldt_ops
 266         is the vector of lu_device_type operations, please see below for
 267         further details
 268 ldt_ctxt_type
 269         is the lu_context_tag to be used for operations.
 270         This should be set to LCT_LOCAL for OSDs.
 271
 272 In the original 2.0 MDS stack the devices were built from the top down and OSD
 273 was the final device to setup. This schema does not work very well when you have
 274 to access on-disk data early and when you have OSD shared among few services
 275 (e.g. MDS + MGS on a same storage). So the schema has changed to a reverse one:
 276 mount procedure sets up correct OSD, then the stack is built from the bottom up.
 277 And instead of introducing another set of methods we decided to use existing
 278 obd_connect() and obd_disconnect() given that many existing devices have been
 279 already configured this way by the configuration component. Notice also that
 280 configuration profiles are organized in this order (LOV/LOD go first, then MDT).
 281 Given that device “below” is ready at every step, there is no point in calling
 282 separate init method.
 283
 284 Due to complexity in other modules, when the device itself can be referenced by
 285 number of entities like exports, RPCs, transactions, callbacks, access via
 286 procfs, the notion of precleanup was introduced to be able all the activity
 287 safely before the actual cleanup takes place. Similarly ->ldto_device_fini()
 288 and ->ldto_device_free() were introduced. So, the former should be used to break
 289 any interaction with the outside, the latter - to actually free the device.
 290
 291 So, the configuration component meets SETUP command in the configuration profile
 292 (see Appendix 1), finds appropriate device and calls ->ldto_device_alloc() to
 293 set up it as an LU device.
 294
 295 The prototypes of device type operations are the following:
 296
 297 struct lu_device *(*ldto_device_alloc)(const struct lu_env *,
 298                                        struct lu_device_type *,
 299                                        struct lustre_cfg *);
 300 struct lu_device *(*ldto_device_free)(const struct lu_env *,
 301                                       struct lu_device *);
 302 int  (*ldto_device_init)(const struct lu_env *, struct lu_device *,
 303                          const char *, struct lu_device *);
 304 struct lu_device *(*ldto_device_fini)(const struct lu_env *env, struct lu_device *);
 305 int  (*ldto_init)(struct lu_device_type *t);
 306 void (*ldto_fini)(struct lu_device_type *t);
 307 void (*ldto_start)(struct lu_device_type *t);
 308 void (*ldto_stop)(struct lu_device_type *t);
 309
 310 ldto_device_alloc
 311         The method is called by configuration component (in case of disk file
 312         system OSD, this is lustre/obdclass/obd_mount.c) to allocate device.
 313         Notice generic struct lu_device does not hold a pointer to private data.
 314         Instead OSD should embed struct lu_device into own structure (like
 315         struct osd_device) and return address of lu_device in that structure.
 316 ldto_device_fini
 317         The method is called when OSD is about to release. OSD should detach
 318         from resources like disk file system, procfs, release objects it holds
 319         internally, etc. This is so-called precleanup procedure.
 320 ldto_device_free
 321         The method is called to actually release memory allocated in
 322         ->ldto_device_alloc().
 323 ldto_device_ini
 324         The method is not used by OSD currently.
 325 ldto_init
 326         The method is called when specific type of OSD is registered in the
 327         system. Currently the method is used to register OSD-specific data for
 328         environments (see Lustre environment in section 3).
 329         See LU_TYPE_INIT_FINI() macro as an example.
 330 ldto_fini
 331         The method is called when specific type of OSD unregisters.
 332         Currently used to unregister OSD-specific data from environment.
 333 ldto_start
 334         The method is called when the first device of this type is being
 335         instantiated. Currently used to fill existing environments with
 336         OSD-specific data.
 337 ldto_stop
 338         This method is called when the last instance of specific OSD has gone.
 339         Currently used to release OSD-specific data from environments.
 340
 341 iii. Device Operations
 342 ----------------------
 343 Now that the osd device can be set up, we need to export methods to handle
 344 device-level operation. All those methods are listed in the lu_device_operations
 345 structure, this includes:
 346
 347 struct lu_object *(*ldo_object_alloc)(const struct lu_env *,
 348                                       const struct lu_object_header *,
 349                                       struct lu_device *);
 350 int (*ldo_process_config)(const struct lu_env *, struct lu_device *,
 351                           struct lustre_cfg *);
 352 int (*ldo_recovery_complete)(const struct lu_env *, struct lu_device *);
 353 int (*ldo_prepare)(const struct lu_env *, struct lu_device *,
 354                    struct lu_device *);
 355
 356 ldo_object_alloc
 357         The method is called when a high-level service wants to access an
 358         object not found in local lustre cache (see struct lu_site).
 359         OSD should allocate a structure, initialize object’s methods and return
 360         a pointer to struct lu_device which is embedded into OSD object
 361         structure.
 362 ldo_process_config
 363         The method is called in case of configuration changes. Mostly used by
 364         high-level services to update local tunables. It’s also possible to let
 365         MGS store OSD tunables and set them properly on every server mount or
 366         when tunables change run-time.
 367 ldto_recovery_complete
 368         The method is called when recovery procedure between a server and
 369         clients is completed. This method is used by high-level devices mostly
 370         (like OSP to cleanup OST orphans, MDD to cleanup open unlinked files
 371         left by missing client, etc).
 372 ldo_prepare
 373         The method is called when all the devices belonging to the stack are
 374         configured and setup properly. At this point the server becomes ready
 375         to handle RPCs and start recovery procedure.
 376         In current implementation OSD uses this method to initialize local quota
 377         management.
 378
 379 iv.  OBD Methods
 380 ----------------
 381 Although the LU infrastructure aims at replacing the storage operations of the
 382 legacy OBD API (see struct obd_ops in lustre/include/obd.h). The OBD API is
 383 still used in several places for device configuration and on the Lustre client
 384 (e.g. it’s still used on the client for LDLM locking). The OBD API storage
 385 operations are not needed for server components, and should be ignored.
 386
 387 As far as the OSD layer is concerned, upper layers still connect/disconnect
 388 to/from the OSD instance via obd_ops::o_connect/disconnect. As a result, each
 389 OSD should implement those two operations:
 390
 391 int (*o_connect)(const struct lu_env *, struct obd_export **,
 392                  struct obd_device *, struct obd_uuid *,
 393                  struct obd_connect_data *, void *);
 394 int (*o_disconnect)(struct obd_export *);
 395
 396 o_connect
 397         The method should track number of connections made (i.e. number of
 398         active users of this OSD) and call class_connect() and return a struct
 399         obd_export via class_conn2export(), see osd_obd_connect(). The structure
 400         holds a reference on the device, preventing it from early release.
 401 o_disconnect
 402         The method is called then some one using this OSD does not need its
 403         service any more (i.e. at unmount). For every passed struct export the
 404         method should call class_disconnect(export). Once the last user has
 405         gone, OSD should call class_manual_cleanup() to schedule the device
 406         removal.
 407
 408 2. Objects
 409 ==========
 410
 411 i. Object Overview
 412 ------------------
 413 Lustre identifies objects in the underlying OSD storage by a unique 128-bit
 414 File IDentifier (FID) that is specified by Lustre and is the only identifier
 415 that Lustre is aware of for this object. The FID is known to Lustre before any
 416 access to the object is done (even before it is created), using
 417 lu_object_find(). Since Lustre only uses the FID to identify an object, if the
 418 underlying OSD storage cannot directly use the Lustre-specified FID to retrieve
 419 the object at a later time, it must create a table or index object (normally
 420 called the Object Index (OI)) to map Lustre FIDs to an internal object
 421 identifier. Lustre does not need to understand the format or value of the
 422 internal object identifier at any time outside of the OSD.
 423
 424 The FID itself is composed of 3 members:
 425 struct lu_fid {
 426         __u64   f_seq;
 427         __u32   f_oid;
 428         __u32   f_ver;
 429 };
 430
 431 While the OSD itself should typically not interpret the FID, it may be possible
 432 to optimize the OSD performance by understanding the properties of a FID.
 433
 434 The f_seq (sequence) component is allocated in piecewise (though not contiguous)
 435 manner to different nodes, and each sequence forms a “group” of related objects.
 436 The sequence number may be any value in the range [1, 263], but there are
 437 typically not a huge number of sequences in use at one time (typically less than
 438 one million at the maximum). Within a single sequence, it is likely that tens to
 439 thousands (and less commonly millions) of mostly-sequential f_oid values will be
 440 allocated. In order to efficiently map FIDs into objects, it is desirable to
 441 also be able to associate the OSD-internal index with key-value pairs.
 442
 443 Every object is represented with a header (struct lu_header) and so-called slice
 444 on every layer of the stack. Core Lustre code maintains a cache of objects
 445 (so-called lu-site, see struct lu_site). which is very similar to Linux inode
 446 cache.
 447
 448 ii. Object Lifecycle
 449 --------------------
 450 In-core object is created when high-level service needs it to process RPC or
 451 perform some background job like LFSCK. FID of the object is supposed to be
 452 known before the object is created. FID can come from RPC or from a disk.
 453 Having the FID lu_object_find() function is called, it search for the object in
 454 the cache (see struct lu_site) and if the object is not found, creates it
 455 using ->ldo_device_alloc(), ->loo_object_init() and ->loo_object_start()
 456 methods.
 457
 458 Objects are referenced and tracked by Lustre core. If object is not in use,
 459 it’s put on LRU list and at some point (subject to internal caching policy or
 460 memory pressure callbacks from the kernel) Lustre schedules such an object for
 461 removal from the cache. To do so Lustre core marks the object is going out and
 462 calls ->loo_object_release() and ->loo_object_free() iterating over all the
 463 layers involved.
 464
 465 iii. Special Objects
 466 --------------------
 467 Lustre uses a set of special objects using the FID_SEQ_LOCAL_FILE sequence.
 468 All the objects are listed in the local_oid enum, which includes:
 469 - OTABLE_OT_OID which is an index object providing list of all existing
 470   objects on this storage. The key is an opaque string and the record is FID.
 471   This object is used by high-level components like LFSCK to iterate over
 472   objects.
 473 - ACCT_USER_OID/ACCT_GROUP_OID are used for accessing space accounting
 474   information for respectively users and groups.
 475 - LAST_RECV_OID is the last_rcvd file for respectively
 476   the MDT and OST.
 477
 478 iv. Object Operations
 479 ---------------------
 480 Object management methods are called by Lustre to manipulate OSD-specific
 481 (private) data associated with a specific object during the lifetime of an
 482 object. All the object operations are described in struct lu_object_operations:
 483
 484 int (*loo_object_init)(const struct lu_env *, struct lu_object *,
 485                        const struct lu_object_conf *);
 486 int (*loo_object_start)(const struct lu_env *, struct lu_object *);
 487 void (*loo_object_delete)(const struct lu_env *, struct lu_object *);
 488 void (*loo_object_free)(const struct lu_env *, struct lu_object *);
 489 void (*loo_object_release)(const struct lu_env *, struct lu_object *);
 490 int (*loo_object_print)(const struct lu_env *, void *, lu_printer_t,
 491                         const struct lu_object *);
 492 int (*loo_object_invariant)(const struct lu_object *);
 493
 494 loo_object_init
 495         This method is called when a new object is being created (see
 496         lu_object_alloc(), it’s purpose is to initialize object’s internals,
 497         usually file system lookups object on a disk (notice a header storing
 498         FID is already created by a top device) using Object Index mapping FID
 499         to local object id like dnode. LOC_F_NEW can be passed to the method
 500         when the caller knows the object is new and OSD can skip OI lookup to
 501         improve performance. If the object exists, then the LOHA_FLAG flag in
 502         loh_flags (struct lu_object_header) is set.
 503 loo_object_start
 504         The method is called when all the structures and the header are
 505         initialized. Currently user by high-level service to as a post-init
 506         procedure (i.e. to setup own methods depending on object type which is
 507         brought into the header by OSD’s ->loo_object_init())
 508 loo_object_delete
 509         is called to let OSD release resources behind an object (except memory
 510         allocated for an object), like release file system’s inode.
 511         It’s separated from ->loo_object_free() to be able to iterate over
 512         still-existing objects. the main purpose to separate
 513         ->loo_object_delete() and ->loo_object_free() is to avoid recursion
 514         during potentially stack consuming resource release.
 515 loo_object_free
 516         is called to actually release memory allocated by ->ldo->object_alloc()
 517         If the object contains a struct lu_object_header, then it must be
 518         freed by call_rcu() or rcu_kfree().
 519 loo_object_release
 520         is called when object last it’s last user and moves onto LRU list of
 521         unused objects. implementation of this method is optional to OSD.
 522 loo_object_print
 523         is used for debugging purpose, it should output details of an object in
 524         human-readable format. Details usually include information like address
 525         of an object, local object number (dnode/inode), type of an object, etc.
 526 loo_object_invariant
 527         another optional method for debugging purposes which is called to verify
 528         internal consistency of object.
 529
 530 3. Lustre Environment
 531 =====================
 532
 533 There is a notion of an environment represented by struct lu_env in many
 534 functions and methods. Literally this is a Thread Local Storage (TLS), which is
 535 bound to every service thread and used by that thread exclusively, there is no
 536 need to serialize access to the data stored here.
 537 The original purpose of the environment was to workaround small Linux stack
 538 (4-8K). A component (like device or library) can register its own descriptor
 539 (see LU_KEY_INIT macro) and then every new thread will be populating the
 540 environment with buffers described.
 541
 542 =====================
 543 = IV. Data (DT) API =
 544 =====================
 545
 546 The previous section listed all the methods that have to be provided by an OSD
 547 module in order to fit in the LU stack. In addition to those generic functions,
 548 each layer should implement a different class of operations depending on its
 549 natures. There are currently 3 classes of devices:
 550 - LU_DEVICE_DT: DaTa device (e.g. lod, osp, osd, ofd),
 551 - LU_DEVICE_MD: MetaData device (e.g. mdt, mdd),
 552 - LU_DEVICE_CL: CLient I/O device (e.g. vvp, lov, lovsub, osc).
 553
 554 The purpose of this section is to document the DT API (used for devices and
 555 objects) which has to be implemented by each OSD module. The DT API is most
 556 commonly called the OSD API.
 557
 558 1. Data Device
 559 ==============
 560
 561 To access disk file system, Lustre defines a new device type called dt_device
 562 which is a sub-class of generic lu_device. It includes a new operation vector
 563 (namely dt_device_operations structure) defining all the actions that can be
 564 performed against a dt_device. Here are the operation prototypes:
 565
 566 int   (*dt_statfs)(const struct lu_env *, struct dt_device *,
 567                    struct obd_statfs *);
 568 struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *);
 569 int   (*dt_trans_start)(const struct lu_env *, struct dt_device *,
 570                         struct thandle *th);
 571 int   (*dt_trans_stop)(const struct lu_env *, struct thandle *);
 572 int   (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *);
 573 int   (*dt_root_get)(const struct lu_env *, struct dt_device *,
 574                      struct lu_fid *);
 575 void  (*dt_conf_get)(const struct lu_env *, const struct dt_device *,
 576                      struct dt_device_param *);
 577 int   (*dt_sync)(const struct lu_env *, struct dt_device *);
 578 int   (*dt_ro)(const struct lu_env *, struct dt_device *);
 579 int   (*dt_commit_async)(const struct lu_env *, struct dt_device *);
 580
 581 dt_trans_create
 582 dt_trans_start
 583 dt_trans_stop
 584 dt_trans_cb_add
 585         please refer to IV.3
 586 dt_statfs
 587         called to report current file system usage information: all, free and
 588         available blocks/objects.
 589 dt_root_get
 590         called to get FID of the root object. Used to follow backend filesystem
 591         rules and support backend file system in a state where users can mount
 592         it directly (with ldiskfs/zfs/etc).
 593 dt_sync
 594         called to flush all complete but not written transactions. Should block
 595         until the flush is completed.
 596 dt_ro
 597         called to turn backend into read-only mode.
 598         Used by testing infrastructure to simulate recovery cases.
 599 dt_commit_async
 600         called to notify OSD/backend that higher level need transaction to be
 601         flushed as soon as possible. Used by Commit-on-Share feature.
 602         Should return immediately and not block for long.
 603
 604 2. Data Objects
 605 ===============
 606
 607 There are two types of DT objects:
 608 1) regular objects, storing unstructured data (e.g. flat files, OST objects,
 609    llog objects)
 610 2) index objects, storing key=value pairs (e.g. directories, quota indexes,
 611    FLDB)
 612
 613 As a result, there are 3 sets of methods that should be implemented by the OSD
 614 layer:
 615 - core methods used to create/destroy/manipulate attributes of objects
 616 - data methods used to access the object body as a flat address space
 617   (read/write/truncate/punch) for regular objects
 618 - index operations to access index objects as a key-value association
 619
 620 A data object is represented by the dt_object structure which is defined as
 621 a sub-class of lu_object, plus operation vectors for the core, data and index
 622 methods as listed above.
 623
 624 i. Common Storage Operations
 625 ----------------------------
 626 The core methods are defined in dt_object_operations as follows:
 627
 628 void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned);
 629 void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned);
 630 void (*do_read_unlock)(const struct lu_env *, struct dt_object *);
 631 void (*do_write_unlock)(const struct lu_env *, struct dt_object *);
 632 int  (*do_write_locked)(const struct lu_env *, struct dt_object *);
 633 int  (*do_attr_get)(const struct lu_env *, struct dt_object *,
 634                      struct lu_attr *);
 635 int  (*do_declare_attr_set)(const struct lu_env *, struct dt_object *,
 636                             const struct lu_attr *, struct thandle *);
 637 int  (*do_attr_set)(const struct lu_env *, struct dt_object *,
 638                     const struct lu_attr *, struct thandle *);
 639 int  (*do_xattr_get)(const struct lu_env *, struct dt_object *,
 640                       struct lu_buf *, const char *);
 641 int  (*do_declare_xattr_set)(const struct lu_env *, struct dt_object *,
 642                              const struct lu_buf *, const char *, int,
 643                              struct thandle *);
 644 int  (*do_xattr_set)(const struct lu_env *, struct dt_object *,
 645                       const struct lu_buf *, const char *, int,
 646                       struct thandle *);
 647 int  (*do_declare_xattr_del)(const struct lu_env *, struct dt_object *,
 648                               const char *, struct thandle *);
 649 int  (*do_xattr_del)(const struct lu_env *, struct dt_object *, const char *,
 650                       struct thandle *);
 651 int  (*do_xattr_list)(const struct lu_env *, struct dt_object *,
 652                        struct lu_buf *);
 653 void (*do_ah_init)(const struct lu_env *, struct dt_allocation_hint *,
 654                     struct dt_object *, struct dt_object *, cfs_umode_t);
 655 int  (*do_declare_create)(const struct lu_env *, struct dt_object *,
 656                            struct lu_attr *, struct dt_allocation_hint *,
 657                            struct dt_object_format *, struct thandle *);
 658 int  (*do_create)(const struct lu_env *, struct dt_object *, struct lu_attr *,
 659                    struct dt_allocation_hint *, struct dt_object_format *,
 660                    struct thandle *);
 661 int  (*do_declare_destroy)(const struct lu_env *, struct dt_object *,
 662                            struct thandle *);
 663 int  (*do_destroy)(const struct lu_env *, struct dt_object *, struct thandle *);
 664 int  (*do_index_try)(const struct lu_env *, struct dt_object *,
 665                      const struct dt_index_features *);
 666 int  (*do_declare_ref_add)(const struct lu_env *, struct dt_object *,
 667                            struct thandle *);
 668 int  (*do_ref_add)(const struct lu_env *, struct dt_object *, struct thandle *);
 669 int  (*do_declare_ref_del)(const struct lu_env *, struct dt_object *,
 670                            struct thandle *);
 671 int  (*do_ref_del)(const struct lu_env *, struct dt_object *, struct thandle *);
 672 int  (*do_object_sync)(const struct lu_env *, struct dt_object *);
 673
 674 do_read_lock
 675 do_write_lock
 676 do_read_unlock
 677 do_write_unlock
 678 do_write_locked
 679         please refer to IV.4
 680 do_attr_get
 681         The method is called to get regular attributes an object stores.
 682         The lu_attr fields maps the usual unix file attributes, like ownership
 683         or size. The object must exist.
 684 do_declare_attr_set
 685         the method is called to notify OSD the caller is going to modify regular
 686         attributes of an object in specified transaction. OSD should use this
 687         method to reserve resources needed to change attributes. Can be called
 688         on an non-existing object.
 689 do_attr_set
 690         the method is called to change attributes of an object. The object
 691         must exist. If the fl argument has LU_XATTR_CREATE, the extended
 692         argument must not exist, otherwise -EEXIST should be returned.
 693         If the fl argument has LU_XATTR_REPLACE, the extended argument must
 694         exist, otherwise -ENODATA should be returned. The object must exist.
 695         The maximum size of extended attribute supported by OSD should be
 696         present in struct dt_device_param the caller can get with
 697         ->dt_conf_get() method.
 698 do_xattr_get
 699         called when the caller needs to get an extended attribute with a
 700         specified name. If the struct lu_buf argument has a null lb_buf, the
 701         size of the extended attribute should be returned. If the requested
 702         extended attribute does not exist, -ENODATA should be returned.
 703         The object must exist. If buffer space (specified in lu_buf.lb_len) is
 704         not enough to fit the value, then return -ERANGE.
 705 do_declare_xattr_set
 706         called to notify OSD the caller is going to set/change an extended
 707         attribute on an object. OSD should use this method to reserve resources
 708         needed to change an attribute.
 709 do_xattr_set
 710         called when the caller needs to change an extended attribute with
 711         specified name.
 712 do_declare_xattr_del
 713         called to notify OSD the caller is going to remove an extended attribute
 714         with a specified name
 715 do_xattr_del
 716         called when the caller needs to remove an extended attribute with a
 717         specified name. Deleting an nonexistent extended attribute is allowed.
 718         The object must exist. The method called on a non-existing attribute
 719         returns 0.
 720 do_xattr_list
 721         called when the caller needs to get a list of existing extended
 722         attributes (only names of attributes are returned). The size of the list
 723         is returned, including the string terminator. If the lu_buf argument has
 724         a null lb_buf, how many bytes the list would require is returned to help
 725         the caller to allocate a buffer of an appropriate size.
 726         The object must exist.
 727 do_ah_init
 728         called to let OSD to prepare allocation hint which stores information
 729         about object locality, type. later this allocation hint is passed to
 730         ->do_create() method and use OSD can use this information to optimize
 731         on-disk object location. allocation hint is opaque for the caller and
 732         can contain OSD-specific information.
 733 do_declare_create
 734         called to notify OSD the caller is going to create a new object in a
 735         specified transaction.
 736 do_create
 737         called to create an object on the OSD in a specified transaction.
 738         For index objects the caller can request a set of index properties (like
 739         key/value size). If OSD can not support requested properties, it should
 740         return an error. The object shouldn't exist already (i.e.
 741         dt_object_exist() should return false).
 742 do_declare_destroy
 743         called to notify OSD the caller is going to destroy an object in a
 744         specified transaction.
 745 do_destroy
 746         called to destroy an object in a specified transaction. Semantically,
 747         it’s dual to object creation and does not care about on-disk reference
 748         to the object (in contrast with POSIX unlink operation).
 749         The object must exist (i.e. dt_object_exist() must return true).
 750 do_index_try
 751         called when the caller needs to use an object as an index (the object
 752         should be created as an index before). Also the caller specify a set of
 753         properties she expect the index should support.
 754 do_declare_ref_add
 755         called to notify OSD the caller is going to increment nlink attribute
 756         in a specified transaction.
 757 do_ref_add
 758         called to increment nlink attribute in a specified transaction.
 759         The object must exist.
 760 do_declare_ref_del
 761         called to notify OSD the caller is going to decrement nlink attribute
 762         in a specified transaction.
 763 do_ref_del
 764         called to decrement nlink attribute in a specified transaction.
 765         This is typically done on an object when a record referring to it is
 766         deleted from an index object. The object must exist.
 767 do_object_sync
 768         called to flush a given object on-disk. It’s a fine grained version of
 769         ->do_sync() method which should make sure an object is stored on-disk.
 770         OSD (or backend file system) can track a status of every object and if
 771         an object is already flushed, then just the method can return
 772         immediately. The method is used on OSS now, but can also be used on MDS
 773         at some point to improve performance of COS.
 774 do_data_get
 775         the method is not used any more and planned for removal.
 776
 777 ii. Data Object Operations
 778 --------------------------
 779 Set of methods described in struct dt_body_operations which should be used with
 780 regular objects storing unstructured data:
 781
 782 ssize_t (*dbo_read)(const struct lu_env *, struct dt_object *, struct lu_buf *,
 783                     loff_t *pos);
 784 ssize_t (*dbo_declare_write)(const struct lu_env *, struct dt_object *,
 785                              const loff_t, loff_t, struct thandle *);
 786 ssize_t (*dbo_write)(const struct lu_env , struct dt_object *,
 787                      const struct lu_buf *, loff_t *, struct thandle *, int);
 788 int (*dbo_bufs_get)(const struct lu_env *, struct dt_object *, loff_t,
 789                     ssize_t, struct niobuf_local *, int);
 790 int (*dbo_bufs_put)(const struct lu_env *, struct dt_object *,
 791                     struct niobuf_local *, int);
 792 int (*dbo_write_prep)(const struct lu_env *, struct dt_object *,
 793                       struct niobuf_local *, int);
 794 int (*dbo_declare_write_commit)(const struct lu_env *, struct dt_object *,
 795                                 struct niobuf_local *,int, struct thandle *);
 796 int (*dbo_write_commit)(const struct lu_env *, struct dt_object *,
 797                         struct niobuf_local *, int, struct thandle *);
 798 int (*dbo_read_prep)(const struct lu_env *, struct dt_object *,
 799                      struct niobuf_local *, int);
 800 int (*dbo_fiemap_get)(const struct lu_env *, struct dt_object *,
 801                       struct ll_user_fiemap *);
 802 int (*dbo_declare_punch)(const struct lu_env*, struct dt_object *, __u64,
 803                           __u64,struct thandle *);
 804 int (*dbo_punch)(const struct lu_env *, struct dt_object *, __u64, __u64,
 805                 struct thandle *);
 806
 807 dbo_read
 808         is called to read raw unstructured data from a specified range of an
 809         object. It returns number of bytes read or an error. Usually OSD
 810         implements this method using internal buffering (to be able to put data
 811         at non-aligned address). So this method should not be used to move a
 812         lot of data. Lustre services use it to read to read small internal data
 813         like last_rcvd file, llog files. It's also used to fetch body symlinks.
 814 dbo_declare_write
 815         is called to notify OSD the caller will be writing data to a specific
 816         range of an object in a specified transaction.
 817 dbo_write
 818         is called to write raw unstructured data to a specified range of an
 819         object in a specified transaction. data should be written atomically
 820         with another change in the transaction. The method is used by Lustre
 821         services to update small portions on a disk. OSD should maintain size
 822         attribute consistent with data written.
 823 dbo_bufs_get
 824         is called to fill memory with buffer descriptors (see struct
 825         niobuf_local) for a specified range of an object. memory for the set is
 826         provided by the caller, no concurrent access to this memory is allowed.
 827         OSD can fill all fields of the descriptor except lnb_grant_used.
 828         The caller specify whether buffers will be user to read or write data.
 829         This method is used to access file system's internal buffers for
 830         zero-copy IO. Internal buffers referenced by descriptors are supposed to
 831         be pinned in memory
 832 dbo_bufs_put
 833         is called to unpin/release internal buffers referenced by the
 834         descriptors dbo_bufs_get returns. After this point pointers in the
 835         descriptors are not valid.
 836 dbo_write_prep
 837         is called to fill internal buffers with actual data. this is required
 838         for buffers which do not match filesystem blocksize, as later the buffer
 839         is supposed to be written as a whole. for example, ldiskfs uses 4k
 840         blocks, but the caller wants to update just a half of that. to prevent
 841         data corruption, this method is called OSD compares range to be written
 842         with 4k, if they do not match, then OSD fetches data from a disk.
 843         If they do match, then all the data will be overwritten and there is no
 844         need to fetch data from a disk.
 845 dbo_declare_write_commit
 846         is called to notify OSD the caller is going to write internal buffers
 847         and OSD needs to reserve enough resource in a transaction.
 848 dbo_write_commit
 849         is called to actually make data in internal buffers part of a specified
 850         transaction. Data is supposed to be written by the moment the
 851         transaction is considered committed. This is slightly different from
 852         generic transaction model because in this case it's allowed to have
 853         data written, but not have transaction committed.
 854         If no dbo_write_commit is called, then dbo_bufs_put should discard
 855         internal buffers and possible changes made to internal buffers should
 856         not be visible.
 857 dbo_read_prep
 858         is called to fill all internal buffers referenced by descriptors with
 859         actual data. buffers may already contain valid data (be cached), so OSD
 860         can just verify the data is valid and return immediately.
 861 dbo_fiemap_get
 862         is called to map logical range of an object to physical blocks where
 863         corresponded range of data is actually stored.
 864 dbo_declare_punch
 865         is called to notify OSD the caller is going to punch (deallocate)
 866         specified range in a transaction.
 867 dbo_punch
 868         is called to punch (deallocate) specified range of data in a
 869         transaction. this method is allowed to use few disk file system
 870         transactions (within the same lustre transaction handle).
 871         Currently Lustre calls the method in form of truncate only where the end
 872         offset is EOF always.
 873
 874 iii. Indice Operations
 875 ----------------------
 876 In contrast with raw unstructured data they are collection of key=value pairs.
 877 OSD should provide with few methods to lookup, insert, delete and scan pairs.
 878 Indices may have different properties like key/value size, string/binary keys,
 879 etc. When user need to use an index, it needs to check whether the index has
 880 required properties with a special method. indices are used by Lustre services
 881 to maintain user-visible namespace, FLD, index of unlinked files, etc.
 882
 883 The method prototypes are defined in dt_index_operations as follows:
 884
 885 int (*dio_lookup)(const struct lu_env *, struct dt_object *, struct dt_rec *,
 886                   const struct dt_key *);
 887 int (*dio_declare_insert)(const struct lu_env *, struct dt_object *,
 888                           const struct dt_rec *, const struct dt_key *,
 889                           struct thandle *);
 890 int (*dio_insert)(const struct lu_env *, struct dt_object *,
 891                   const struct dt_rec *, const struct dt_key *,
 892                   struct thandle *, int);
 893 int (*dio_declare_delete)(const struct lu_env *, struct dt_object *,
 894                           const struct dt_key *, struct thandle *);
 895 int (*dio_delete)(const struct lu_env *, struct dt_object *,
 896                   const struct dt_key *, struct thandle *);
 897
 898 dio_lookup
 899         is called to lookup exact key=value pair. A value is copied into a
 900         buffer provided by the caller. so the caller should make sure the
 901         buffer's size is big enough. this should be done with ->do_index_try()
 902         method.
 903 dio_declare_insert
 904         is called to notify OSD the caller is going to insert key=value pair in
 905         a transaction. exact key is specified by a caller so OSD can use this to
 906         make reservation better (i.e. smaller).
 907 dio_insert
 908         is called to insert key/value pair into an index object. it's up to OSD
 909         whether to allow concurrent inserts or not. the caller is not required
 910         to serialize access to an index
 911 dio_declare_delete
 912         is called to notify OSD the caller is going to remove a specified key
 913         in a transaction. exact key is specified by a caller so OSD can use this
 914         to make reservation better.
 915 dio_delete
 916         is called to remove a key/value pair specified by a caller.
 917
 918 To iterate over all key=value pair stored in an index, OSD should provide the
 919 following set of methods:
 920
 921 struct dt_it *(*init)(const struct lu_env *, struct dt_object *, __u32);
 922 void  (*fini)(const struct lu_env *, struct dt_it *);
 923 int   (*get)(const struct lu_env *, struct dt_it *, const struct dt_key *);
 924 void  (*put)(const struct lu_env *, struct dt_it *);
 925 int   (*next)(const struct lu_env *, struct dt_it *);
 926 struct dt_key *(*key)(const struct lu_env *, const struct dt_it *);
 927 int   (*key_size)(const struct lu_env *, const struct dt_it *);
 928 int   (*rec)(const struct lu_env *, const struct dt_it *, struct dt_rec *,
 929              __u32);
 930 __u64 (*store)(const struct lu_env *, const struct dt_it *);
 931 int   (*load)(const struct lu_env *, const struct dt_it *, __u64);
 932 int   (*key_rec)(const struct lu_env *, const struct dt_it *, void *);
 933
 934 init
 935         is called to allocate and initialize an instance of "iterator" which
 936         subsequent methods will be passed in. the structure is not accessed by
 937         Lustre and its content is totally internal to OSD. Usually it contains a
 938         reference to index, current position in an index.
 939         It may contain prefetched key/value pairs. It's not required to maintain
 940         this cache up-to-date, if index changes this is not required to be
 941         reflected by an already initialized iterator. In the extreme case
 942         ->init() can prefetch all existing pairs to be returned by subsequent
 943         calls to an iterator.
 944 fini
 945         is called to release an iterator and all its resources.
 946         For example, iterator can unpin an index, free prefetched pairs, etc.
 947 get
 948         is called to move an iterator to a specified key. if key does not exist
 949         then it should be the closest position from the beginning of iteration.
 950 put
 951         is called to release an iterator.
 952 next
 953         is called to move an iterator to a next item
 954 key
 955         is called to fill specified buffer with a key at a current position of
 956         an iterator. it’s the caller responsibility to pass big enough buffer.
 957         In turn OSD should not exceed sizes negotiated with ->do_index_try()
 958         method
 959 key_size
 960         is called to learn size of a key at current position of an iterator
 961 rec
 962         is called to fill specified buffer with a value at a current position of
 963         an iterator. it’s the caller responsibility to pass big enough buffer.
 964         in turn OSD should not exceed sizes negotiated with ->do_index_try()
 965         method.
 966 store
 967         is called to get a 64bit cookie of a current position of an iterator.
 968 load
 969         is called to reset current position of an iterator to match 64bit
 970         cookie ->store() method returns. these two methods allow to implement
 971         functionality like POSIX readdir where current position is stored as an
 972         integer.
 973 key_rec
 974         is not used currently
 975
 976 3. Transactions
 977 ===============
 978
 979 i. Description
 980 --------------
 981 Transactions are used by Lustre to implement recovery protocol and support
 982 failover. The main purpose of transactions is to atomically update backend file
 983 system. This include as regular changes (file creation, for example) as special
 984 Lustre changes (last_rcvd file, lastid, llogs). OSD is supposed to provide the
 985 transactional mechanism and let Lustre to control what specific updates to put
 986 into transactions.
 987
 988 Lustre relies on the following rule for transactions order: if transaction T1
 989 starts before transaction T2 starts, then the commit of T2 means that T1 is
 990 committed at the same time or earlier. Notice that the creation of a transaction
 991 does not imply the immediate start of the updates on storage, do not confuse
 992 creation of a transaction with start of a transaction.
 993
 994 It’s up to OSD and backend file system to group few transactions for better
 995 performance given it still follow the rule above.
 996
 997 Transactions are identified in the OSD API by an opaque transaction handle,
 998 which is a pointer to an OSD-private data structure that it can use to track
 999 (and optionally verify) the updates done within that transaction. This handle is
1000 returned by the OSD to the caller when the transaction is first created.
1001 Any potential updates (modifications to the underlying storage) must be declared
1002 as part of a transaction, after the transaction has been created, and before the
1003 transaction is started. The transaction handle is passed when declaring all
1004 updates. If any part of the declaration should fail, the transaction is aborted
1005 without having modified the storage.
1006
1007 After all updates have been declared, and have completed successfully, the
1008 handle is passed to the transaction start. After the transaction has started,
1009 the handle will be passed to every update that is done as part of that
1010 transaction. All updates done under the transaction must previously have been
1011 declared. Once the transaction has started, it is not permitted to add new
1012 updates to the transaction, nor is it possible to roll back the transaction
1013 after this point. Should some update to the storage fail, the caller will try
1014 to undo the previous updates within the context of the transaction itself, to
1015 ensure that the resulting OSD state is correct.
1016
1017 Any update that was not previously declared is an implementation error in the
1018 caller. Not all declared updates need to be executed, as they form a worst-case
1019 superset of the possible updates that may be required in order to complete the
1020 desired operation in a consistent manner.
1021
1022 OSD should let a caller to register callback function(s) to be called on
1023 transaction commit to a disk. Also OSD should be able to call a special of
1024 transaction hooks on all the stages (creation, start, stop, commit) on
1025 per-devices basis so that high-level services (like MDT) which are not involved
1026 directly into controlling transactions still can be involved.
1027 Every commit callback gets a result of transaction commit, if disk filesystem was
1028 not able to commit the transaction, then an appropriate error code will be passed.
1029
1030 It’s important to note that OSD and disk file system should use asynchronous IO
1031 to implement transactions, otherwise the performance is expected to be bad.
1032
1033 The maximum number of updates that make up a single transaction is OSD-specific,
1034 but is expected to be at least in the tens of updates to multiple objects in the
1035 OSD (extending writes of multiple MB of data, modifying or adding attributes,
1036 extended attributes, references, etc). For example, in ext4, each update to the
1037 filesystem will modify one or more blocks of storage. Since one transaction is
1038 limited to one quarter of the journal size, if the caller declares a series of
1039 updates that modify more than this number of blocks, the declaration must fail
1040 or it could not be committed atomically.
1041 In general, every constraint must be checked here to ensure that all changes
1042 that must commit atomically can complete successfully.
1043
1044 ii. Lifetime
1045 ------------
1046 From Lustre point of view a transaction goes through the following steps:
1047 1. creation
1048 2. declaration of all possible changes planned in transaction
1049 3. transaction start
1050 4. execution of planned and declared changes
1051 5. transaction stop
1052 6. commit callback(s)
1053
1054 iii. Methods
1055 ------------
1056 OSD should implement the following methods to let Lustre control transactions:
1057
1058 struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *);
1059 int (*dt_trans_start)(const struct lu_env *, struct dt_device *,
1060                       struct thandle *);
1061 int   (*dt_trans_stop)(const struct lu_env *, struct thandle *);
1062 int   (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *);
1063
1064 dt_trans_create
1065         is called to allocate and initialize transaction handle (see struct
1066         thandle). This structure has no pointer to a private data so, it should
1067         be embedded into private representation of transaction at OSD layer.
1068         This method can block.
1069 dt_trans_start
1070         is called to notify OSD a specified transaction has got all the
1071         declarations and now OSD should tell whether it has enough resources to
1072         proceed with declared changes or to return an error to a caller.
1073         This method can block. OSD should call dt_txn_hook_start() function
1074         before underlying file system’s transaction starts to support per-device
1075         transaction hooks. If OSD (or disk files system) can not start
1076         transaction, then an error is returned and transaction handle is
1077         destroyed, no commit callbacks are called.
1078 dt_trans_stop
1079         is called to notify OSD a specified transaction has been executed and no
1080         more changes are expected in a context of that. Usually this mean that at
1081         this point OSD is free to start writeout preserving notion
1082         all-or-nothing. This method can block.
1083         If th_sync flag is set at this point, then OSD should start to commit
1084         this transaction and block until the transaction is committed. the order
1085         of unblock event and transaction’s commit callback functions is not
1086         defined by the API. OSD should call dt_txn_hook_stop() functions once
1087         underlying file system’s transaction is stopped to support per-device
1088         transaction hooks.
1089 dt_trans_cb_add
1090         is called to register commit callback function(s), which OSD will be
1091         calling up on transaction commit to a storage. when all the callback
1092         functions are processed, transaction handle can be freed by OSD.
1093         There are no constraints on how many callback functions can be running
1094         concurrently. They should not be running in an interrupt context.
1095         Usually this method should not block and use spinlocks. As part of
1096         commit callback functions processing dt_txn_hook_commit() function
1097         should be called to support per-device transaction hooks.
1098
1099 The callback mechanism let layers not commanding transactions be involved.
1100 For example, MDT registers its set and now every transaction happening on
1101 corresponded OSD will be seen by MDT, which adds recovery information to the
1102 transactions: generate transaction number, puts it into a special file -- all
1103 this happen within the context of the transaction, so atomically.
1104 Similarly VBR functionality in MDT updates objects versions.
1105
1106 4. Locking
1107 ==========
1108
1109 i. Description
1110 --------------
1111 OSD is expected to maintain internal consistency of the file system and its
1112 object on its own, requiring no additional locking or serialization from higher
1113 levels. This let OSD to control how fine the locking is depending on the
1114 internal structuring of a specific file system. If few update conflict then the
1115 result is not defined by OSD API and left to OSD.
1116
1117 OSD should provide the caller with few methods to serialize access to an object
1118 in shared and exclusive mode. It’s up to caller how to use them, to define order
1119 of locking. In general the locks provided by OSD are used to group complex
1120 updates so that other threads do not see intermediate result of operations.
1121
1122 ii. Methods
1123 -----------
1124 Methods to lock/unlock object
1125 The set of methods exported by each OSD to manage locking is the following:
1126 void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned);
1127 void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned);
1128 void (*do_read_unlock)(const struct lu_env *, struct dt_object *);
1129 void (*do_write_unlock)(const struct lu_env *, struct dt_object *);
1130 int  (*do_write_locked)(const struct lu_env *, struct dt_object *);
1131
1132 do_read_lock
1133         get a shared lock on the object, this is a blocking lock.
1134 do_write_lock
1135         get an exclusive lock on the object, this is a blocking lock.
1136 do_read_unlock
1137         release a shared lock on an object, this is a blocking lock.
1138 do_write_unlock
1139         release an exclusive lock on an object, this is a blocking lock.
1140 do_write_locked
1141         check whether an object is exclusive-locked.
1142
1143 It is highly desirable that an OSD object can be accessed and modified by
1144 multiple threads concurrently.
1145
1146 For regular objects, the preferred implementation allows an object to be read
1147 concurrently at overlapping offsets, and written by multiple threads at
1148 non-overlapping offsets with the minimum amount of contention possible, or any
1149 combination of concurrent read/write operations. Lustre will not itself perform
1150 concurrent overlapping writes to a single region of the object, due to
1151 serialization at a higher level.
1152
1153 For index objects, the preferred implementation allows key/value pair to be
1154 looked up concurrently, allows non-conflicting keys to be inserted or removed
1155 concurrently, or any combination of concurrent lookup, insertion, or removal.
1156 Lustre does not require the storage of multiple identical keys. Operations on
1157 the same key should be serialized.
1158
1159 ========================
1160 = V. Quota Enforcement =
1161 ========================
1162
1163 1. Overview
1164 ===========
1165
1166 The OSD layer is in charge of setting up a Quota Slave Device (aka QSD) to
1167 manage quota enforcement for a specific OSD device. The QSD is implemented under
1168 the form of a library. Each OSD device should create a QSD instance which will
1169 be used to manage quota enforcement for this device. This implies:
1170 - completing the reintegration procedure with the quota master (aka QMT) to
1171   to retrieve the latest quota settings and quota space distribution for each
1172   UID/GID.
1173 - managing quota locks in order to be notified of configuration changes.
1174 - acquiring space from the QMT when quota space for a given user/group is
1175   close to exhaustion.
1176 - allocating quota space to service threads for local request processing.
1177
1178 The reintegration procedure allows a disconnected slave to re-synchronize with
1179 the quota master, which means:
1180 - re-acquiring quota locks,
1181 - fetching up-to-date quota settings (e.g. list of UIDs with quota enforced),
1182 - reporting space usage to master for newly (e.g. setquota was run while the
1183   slave wasn't connected) enforced UID/GID,
1184 - adjusting spare quota space (e.g. slave hold a large amount of unused quota
1185   space for a user which ran out of quota space on the master while the slave
1186   was disconnected).
1187
1188 The latter two actions are known as reconciliation.
1189
1190 2. QSD API
1191 ==========
1192
1193 The QSD API is defined in lustre/include/lustre_quota.h as follows:
1194
1195 struct qsd_instance *qsd_init(const struct lu_env *, char *, struct dt_device *,
1196                               struct proc_dir_entry *);
1197 int qsd_prepare(const struct lu_env *, struct qsd_instance *);
1198 int qsd_start(const struct lu_env *, struct qsd_instance *);
1199 void qsd_fini(const struct lu_env *, struct qsd_instance *);
1200 int qsd_op_begin(const struct lu_env *, struct qsd_instance *,
1201                  struct lquota_trans *, struct lquota_id_info *, int *);
1202 void qsd_op_end(const struct lu_env *, struct qsd_instance *,
1203                 struct lquota_trans *);
1204 void qsd_op_adjust(const struct lu_env *, struct qsd_instance *,
1205                    union lquota_id *, int);
1206
1207 qsd_init
1208         The OSD module should first allocate a qsd instance via qsd_init.
1209         This creates all required structures to manage quota enforcement for
1210         this target and performs all low-level initialization which does not
1211         involve any lustre object. qsd_init should typically be called when
1212         the OSD is being set up.
1213
1214 qsd_prepare
1215         This sets up on-disk objects associated with the quota slave feature
1216         and initiates the quota reintegration procedure if needed.
1217         qsd_prepare should typically be called when ->ldo_prepare is invoked.
1218
1219 qsd_start
1220         a qsd instance should be started once recovery is completed (i.e. when
1221         ->ldo_recovery_complete is called). This is used to notify the qsd layer
1222         that quota should now be enforced again via the qsd_op_begin/end
1223         functions. The last step of the reintegration procedure (namely usage
1224         reconciliation) will be completed during start.
1225
1226 qsd_fini
1227         is used to release a qsd_instance structure allocated with qsd_init.
1228         This releases all quota slave objects and frees the structures
1229         associated with the qsd_instance.
1230
1231 qsd_op_begin
1232         is used to enforce quota, it must be called in the declaration of each
1233         operation. qsd_op_end should then be invoked later once all operations
1234         have been completed in order to release/adjust the quota space.
1235         Running qsd_op_begin before qsd_start isn't fatal and will return
1236         success. Once qsd_start has been run, qsd_op_begin will block until the
1237         reintegration procedure is completed.
1238
1239 qsd_op_end
1240         performs the post operation quota processing. This must be called after
1241         the operation transaction stopped. While qsd_op_begin must be invoked
1242         each time a new operation is declared, qsd_op_end should be called only
1243         once for the whole transaction.
1244
1245 qsd_op_adjust
1246         Trigger pre-acquire/release if necessary, it's only used for ldiskfs osd
1247         so far. When unlink a file in ldiskfs, the quota accounting isn't
1248         updated when the transaction stopped. Instead, it'll be updated on the
1249         final iput, so qsd_op_adjust() will be called then (in
1250         osd_object_delete()) to trigger quota release if necessary.
1251
1252 Appendix 1. A brief note on Lustre configuration.
1253 =================================================
1254
1255 In the current versions (1.8, 2.x) MGS is used to store configuration of the
1256 servers, so called profile. The profile stores configuration commands and
1257 arguments to setup specific stack. To see how it looks exactly you can fetch
1258 MDT profile with debugfs -R "dump /CONFIGS/lustre-MDT0000 <tempfile>", then
1259 parse it with: llog_reader <tempfile>. Here is a short extract:
1260
1261 #02 (136)attach    0:lustre-MDT0000-mdtlov  1:lov  2:lustre-MDT0000-mdtlov_UUID
1262 #03 (176)lov_setup 0:lustre-MDT0000-mdtlov  1:(struct lov_desc)
1263                 uuid=lustre-MDT0000-mdtlov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
1264 #06 (120)attach    0:lustre-MDT0000  1:mdt  2:lustre-MDT0000_UUID
1265 #07 (112)mount_option 0:  1:lustre-MDT0000  2:lustre-MDT0000-mdtlov
1266 #08 (160)setup     0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f
1267 #23 (080)add_uuid  nid=10.0.2.15@tcp(0x200000a00020f)  0:  1:10.0.2.15@tcp
1268 #24 (144)attach    0:lustre-OST0000-osc-MDT0000  1:osc  2:lustre-MDT0000-mdtlov_UUID
1269 #25 (144)setup     0:lustre-OST0000-osc-MDT0000  1:lustre-OST0000_UUID  2:10.0.2.15@tcp
1270 #26 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-OST0000_UUID  2:0  3:1
1271 #32 (080)add_uuid  nid=10.0.2.15@tcp(0x200000a00020f)  0:  1:10.0.2.15@tcp
1272 #33 (144)attach    0:lustre-OST0001-osc-MDT0000  1:osc  2:lustre-MDT0000-mdtlov_UUID
1273 #34 (144)setup     0:lustre-OST0001-osc-MDT0000  1:lustre-OST0001_UUID  2:10.0.2.15@tcp
1274 #35 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-OST0001_UUID  2:1  3:1
1275 #41 (120)param 0:  1:sys.jobid_var=procname_uid  2:procname_uid
1276 #44 (080)set_timeout=20
1277 #48 (112)param 0:lustre-MDT0000-mdtlov  1:lov.stripesize=1048576
1278 #51 (112)param 0:lustre-MDT0000-mdtlov  1:lov.stripecount=-1
1279 #54 (160)param 0:lustre-MDT0000  1:mdt.identity_upcall=/work/lustre/head/lustre-release/lustre/utils/l_getidentity
1280
1281 Every line starts with a specific command (attach, lov_setup, set, etc) to do
1282 specific configuration action. Then arguments follow. Often the first argument
1283 is a device name. For example,
1284 #02 (136)attach    0:lustre-MDT0000-mdtlov  1:lov  2:lustre-MDT0000-mdtlov_UUID
1285
1286 This command will be setting up device “lustre-MDT0000-mdtlov” of type “lov”
1287 with additional argument “lustre-MDT0000-mdtlov_UUID”. All these arguments are
1288 packed into lustre configuration buffers ( struct lustre_cfg).
1289
1290 Another commands will be attaching device into the stack (like setup and
1291 lov_modify_tgts).
1292
1293 Appendix 2. Sample Code
1294 =======================
1295
1296 Lustre currently has 2 different OSD implementations:
1297 - ldiskfs OSD under lustre/osd-ldiskfs
1298   http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-ldiskfs;hb=HEAD
1299 - ZFS OSD under lustre/zfs-osd
1300   http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-zfs;hb=HEAD