LU-1981 doc: reorganize osd API documentation

author Johann Lombardi <johann.lombardi@intel.com>

Tue, 9 Oct 2012 22:32:16 +0000 (00:32 +0200)

committer Oleg Drokin <green@whamcloud.com>

Fri, 12 Oct 2012 03:38:55 +0000 (23:38 -0400)
author Johann Lombardi <johann.lombardi@intel.com>
Tue, 9 Oct 2012 22:32:16 +0000 (00:32 +0200)
committer Oleg Drokin <green@whamcloud.com>
Fri, 12 Oct 2012 03:38:55 +0000 (23:38 -0400)
diff --git a/doc/osd-api.txt b/doc/osd-api.txt

deleted file mode 100644 (file)

index 8a20376..0000000
--- a/doc/osd-api.txt
+++ /dev/null
@@ -1,1059 +0,0 @@
-Overview of the Lustre Object Storage Device API
-
-Original Authors: 
-Johann Lombardi (johann.lombardi@intel.com)\r
-Alex Zhuravlev (alexey.zhuravlev@intel.com)
-Li Wei (wei.g.li@intel.com)
-Andreas Dilger (andreas.dilger@intel.com)
-Niu Yawei (yawei.niu@intel.com)
-
-Last Updated: September 28, 2012\r
-
-Copyright © 2012 Intel, Corp.
-
-This file is released under the GPLv2.
-\r
-\r
-Introduction
-=============\r
-\r
-What OSD API is
----------------\r
-OSD API is the interface to access and modify data that is supposed to be stored persistently.  This API layer is the interface to code that bridges individual file systems such as ext4 or ZFS to Lustre.  The API is a generic interface to transaction and journaling based file systems so many backend file systems can be supported in a Lustre implementation.  Data can be cached within the OSD or backend target and could be destroyed before hitting storage, but in general the final target is a persistent storage.   This API creates many possibilities, including using object-storage devices or other new persistent storage technologies.\r
-\r
-What OSD API is Not
--------------------\r
-OSD API should not be used to control in-core-only state (like ldlm locking), configuration, etc.  The upper layers of the IO/metadata stack should not be involved with the underlying layout or allocation in the OSD storage.\r
-\r
-Audience/Goal
--------------\r
-The goal of this document is to provide the reader with the information necessary to accurately construct a new Object Storage Device (OSD) module interface layer for Lustre in order to use a new backend file system with Lustre 2.4 and greater.
-\r
-Guidance for New OSD Implementers
-=================================\r
-\r
-LU Infrastructure Overview\r
---------------------------\r
-Lustre is composed of different kernel modules, each implementing different layers in the software stack in an object oriented approach. Generally, each layer builds (or stacks) upon another, and each object is a child of the generic LU object class. Hence the term "LU stack" is often used to reference this hierarchy of Lustre modules and objects. Each layer (i.e. mdt/mdd/lod/osp/ofd/osd) defines its own generic item (lu_object/lu_device) which are thus gathered in a compound item (lu_site/lu_object_layer) representing the multi-layered stacks.\r
-Different classes of operations can then be implemented by each layer, depending on its natures. There are currently 3 classes of devices:\r
-- LU_DEVICE_DT: data device (e.g. lod, osp, osd, ofd), see dt_device_operations\r
-- LU_DEVICE_MD: metadata device (e.g. mdt, mdd), see md_device_operations\r
-- LU_DEVICE_CL: client I/O device (e.g. vvp, lov, lovsub, osc), see cl_device_operations\r
-\r
-\r
-As a member of the LU stack, the OSD should define its own object and device structures as well as methods associated. It is up to the OSD layer to host the lu_site instance. This latter is usually defined in the osd_device structure.\r
-\r
-\r
-The first thing to do when developing a new OSD is to define a lu_device_type structure to define and register the new OSD type. The following fields of the lu_device_type needs to be filled appropriately:\r
-- ldt_tags: is the type of device, typically data, metadata or client (see lu_device_tag). An OSD device is of data type and should always registers as such by setting this field to LU_DEVICE_DT.\r
-- ldt_name: is the name associated with the new OSD type. See LUSTRE_OSD_{LDISKFS,ZFS}_NAME for reference.\r
-- ldt_ops: is the vector of lu_device_type operations, please see below for further details\r
-- ldt_ctxt_type: is the lu_context_tag to be used for operations. This should be set to LCT_LOCAL for OSDs.\r
-\r
-\r
-In the original 2.0 MDS stack the devices were built from the top down and OSD was the final device to setup. This schema does not work very well when you have to access on-disk data early and when you have OSD shared among few services (e.g. MDS + MGS on a same storage). So the schema has changed to a reverse one: mount procedure sets up correct OSD, then the stack is built from the bottom up. And instead of introducing another set of methods we decided to use existing obd_connect() and obd_disconnect() given that many existing devices have been already configured this way by the configuration component. Notice also that configuration profiles are organized in this order (LOV/LOD go first, then MDT). Given that device “below” is ready at every step, there is no point in calling separate init method. \r
-\r
-Due to complexity in other modules, when the device itself can be referenced by number of entities like exports, RPCs, transactions, callbacks, access via procfs, the notion of precleanup was introduced to be able all the activity safely before the actual cleanup takes place. Similarly ->ldto_device_fini() and ->ldto_device_free() were introduced. So, the former should be used to break any interaction with the outside, the latter - to actually free the device.\r
-\r
-So, the configuration component meets SETUP command in the configuration profile (see Appendix), finds appropriate device and calls ->ldto_device_alloc() to set up it as an LU device.\r
-\r
-The osd_device_type_ops defines methods that will be called in order to create/destroy a new OSD instance:\r
-->ldto_device_alloc(): is called to allocate a new device instance. A pointer to the lustre configuration buffer[c] is supplied to identify the backend device to be configured. More details about configuration buffers can be found in the Appendix XX.\r
-\r
-->ldto_device_init(): is used to perform additional device initialization with the next device in the stack passed as a parameter. Not used on the servers since Orion[d], see the explanation below[e].\r
-\r
-->ldto_device_fini(): is the companion of ->ldto_device_init and is used to finalize the device before freeing it.\r
-->ldto_device_free(): is the companion of ->ldto_device_alloc and is in charge of releasing the osd device. It’s called when the last reference to device has gone. \r
-\r
-Now that the osd device can be set up, we need to export methods to handle device-level operation. All those methods are documented in the lu_device_operations structure, this includes:\r
-\r
-->ldo_object_alloc(): this is called to allocate an osd_object for the given osd device.  Allocates memory, semaphores etc associated with the osd object.\r
-\r
-->ldo_process_config(): is invoked to process lustre configuration log specific to this device[g]. it’s usually called by the configuration component of Lustre to notify device about changes in configuration, change tunables.\r
-\r
-->ldo_start[h](): is called once all the layers of the stack have been successfully initialized (after LCFG_SETUP stage) and before serving any client requests. This method is required as the stack is built from number of devices (i.e. MDT->MDD->LOD->OSD + number of OSPs). While MDT is the top device, it’s completeness is not enough and OSPs devices are setup later (see example in the Appendix). So, we need an additional method to notify the stack when the full configuration is over and stack is complete.\r
-\r
-->ldo_recovery_complete(): is used to notify all layers in the stack that recovery is completed and new requests are going to be served. The recovery process is driven by the top service (like MDT). Once MDT recovery is over (the clients have reconnected/replayed their requests, locks are recovered, etc), MDT tells others and then additional processes starts. For example, in order to improve create rate OSP (OSC in 1.8, 2.[0123] pre-Orion era[i]) pre-creates objects on OST and then MDS can consume them in non-blocking (RPC-free) manner most of time. But this can lead to leaked objects (so called OST orphans) when MDS crashes. To prevent this OSP tracks all objects being used and once MDT recovery is over, it destroys all pre-created but unused OST objects (so called orphan cleanup procedure). Similarly, MDD tracks all open files and when MDT recovery is over, MDD can find all unlinked but not-destroyed files and remove them (usually result of missing clients).\r
-Object Lifecycle\r
-\r
-When the user wants to access some object, she calls lu_object_find() with already known FID. This generic function lookup object in the site (collection of objects associated with specific OSD and the stack above) and if the object is not found, then lu_object_alloc() is called. Now the top layer for this object is called first (usually top service like MDT or OFD), few ->loo_object_alloc() and ->loo_object_init() are called filling layer by layer to prepare a full representation (see the picture above).\r
-\r
-Every object being accessed is supposed to be represented with an in-core structure(s) in the site, indexed by FID. Given FID is known before actual creation we need in-core representation to serialize creation and make sure no more than 1 objects with this FID is created.\r
-\r
- ->loo_object_init(): initializes structure specific to this OSD layer. As part of the initialization OSD is supposed to search on-disk representation for object with it’s FID (zfs-osd and ldiskfs-osd use internal Object Index to map FID to dnode/inode). If such an object exists then LOHA_FLAG in loh_flags (struct lu_object_header) is set. The additional struct lu_object_conf can be passed to the method. Currently it’s used to tell OSD that object is known to be non-existing and there is no need to search on a disk. \r
-\r
- ->loo_object_delete(): called before lu_object_operations::loo_object_free() to signal that object is being destroyed. Dual to lu_object_operations::loo_object_init().\r
-\r
-\r
- ->loo_object_free[j](): called to release memory\r
-\r
-\r
- ->loo_object_release(): called when last active reference to the object is released (and object returns to the cache). This method is optional.\r
-\r
-OBD Methods
------------\r
-\r
-Although the lu infrastructure aims at replacing the storage operations of the legacy OBD API (see struct obd_ops in lustre/include/obd.h).  The OBD API is used [k]in several places for device configuration and on the Lustre client (e.g. it’s still used on the client for LDLM locking).  The OBD API storage operations are not needed for server components, and should be ignored. As far as the OSD layer is concerned, upper layers still connect/disconnect to/from the OSD instance via obd_ops::o_connect/disconnect. As a result, each OSD should implement those two operations[l]:\r
-- obd_ops::o_connect should just call class_connect() and return a struct obd_export via class_conn2export(), see osd_obd_connect(). The structure holds a reference on the device, preventing it from early release.\r
-- obd_ops::o_disconnect() should just invoke class_disconnect() and then call class_manual_cleanup() when the last export has disconnected, see osd_obd_disconnect(). class_manual_cleanup() schedules the device for a final cleanup.\r
-\r
-\r
-Transaction Overview\r
---------------------\r
-\r
-The transaction methods specified by the OSD API must be mapped to transaction methods of the underlying OSD persistent storage.  All updates to the underlying storage will be done in the context of a transaction.  It is required that all updates in a single transaction are atomic (either all updates committed to stable storage as one group, or all discarded in case of a fatal system error).  Lustre does not require that transactions be rolled back, though this may happen as a consequence of the server or storage on which the OSD is running suffering a catastrophic failure.  It is also not required that each transaction be committed individually to storage.  It is possible to aggregate multiple transaction requests at the OSD layer to a single larger transaction at the storage layer for improved efficiency and reduced overhead.\r
-\r
-Transactions are identified in the OSD API by an opaque transaction handle, which is a pointer to an OSD-private data structure that it can use to track (and optionally verify) the updates done within that transaction.  This handle is returned by the OSD to the caller when the transaction is first created.  Any potential updates (modifications to the underlying storage) must be declared as part of a transaction, after the transaction has been created, and before the transaction is started. The transaction handle is passed when declaring all updates.  If any part of the declaration should fail, the transaction is aborted without having modified the storage.\r
-\r
-After all updates have been declared, and have completed successfully, the handle is passed to the transaction start.  After the transaction has started, the handle will be passed to every update that is done as part of that transaction.  All updates done under the transaction must previously have been declared. Once the transaction has started, it is not permitted to add new updates to the transaction, nor is it possible to roll back the transaction after this point.  Should some update to the storage fail, the caller will try to undo the previous updates within the context of the transaction itself, to ensure that the resulting OSD state is correct.\r
-\r
-Any update that was not previously declared is an implementation error in the caller.  Not all declared updates need to be executed, as they form a worst-case superset of the possible updates that may be required in order to complete the desired operation in a consistent manner.\r
-\r
-Upper layers of the stack may register callback(s) for any open transaction.  These callbacks are to be called after the transaction has committed to stable storage.  The purpose of this callback is so the upper layers can do cleanup or other tasks when the transaction has safely committed to stable storage, and also notify the Lustre client that the request which generated this transaction does not need to replayed and can be discarded from its cache.  In the case of catastrophic failure of the OSD, the Lustre client will replay any transactions that were completed by the OSD, but which had not yet committed to persistent storage, in the order that they were originally performed by the OSD.  By using an asynchronous request and notification method for modifying operations, the Lustre client/server can avoid waiting for synchronous operations to complete.  Supporting commit callbacks is a requirement of any storage used with the OSD API.\r
-Once all of the actual updates in that transaction are complete, the transaction is stopped.  After this point, no more updates can be done using this transaction handle.  It is possible to mark a transaction handle to be completed synchronously.  In this case, when the transaction is stopped, the dt_trans_stop() method should not return until all of the updates have committed to stable storage.  If there is an error committing the updates to storage, the OSD must abort all operations and discard any in-flight transactions, returning to a consistent transaction state. For some backends this can be non-trivial to roll back, thus they can go to read-only mode to prevent further corruptions. Then the problem should be solved with help from an administrator. \r
-To let the users to register per-transaction callback OSD should export method ->dt_trans_cb_add() with the following descriptor:\r
-\r
-\r
-struct dt_txn_commit_cb {\r
-        cfs_list_t        dcb_linkage;                /* used internally */\r
-        dt_cb_t                dcb_func;                /* user’s function to be called upon commit */\r
-        __u32                dcb_magic;                /* used internally */\r
-        char                dcb_name[MAX_COMMIT_CB_STR_LEN];\r
-};\r
-\r
-\r
-Another set of callback can be register on per-device basis with dt_txn_callback_add() using the following description:\r
-\r
-\r
-struct dt_txn_callback {\r
-        int (*dtc_txn_start)(const struct lu_env *env, struct thandle *txn, void *cookie);\r
-        int (*dtc_txn_stop)(const struct lu_env *env, struct thandle *txn, void *cookie);\r
-        void (*dtc_txn_commit)(struct thandle *txn, void *cookie);\r
-        void                *dtc_cookie;\r
-        __u32                dtc_tag;\r
-        cfs_list_t           dtc_linkage;\r
-};\r
-\r
-\r
-These callback let layers not commanding transactions be involved. For example, MDT registers its set and now every transaction happening on corresponded OSD will be seen by MDT, which adds recovery information to the transactions: generate transaction number, puts it into a special file -- all this happen within the context of the transaction, so atomically. Similarly VBR functionality in MDT updates objects versions.\r
-\r
-\r
-Transactions, or groups of transactions, should be committed sequentially. If transaction T1 starts before transaction T2 starts, then the commit of T2 means that T1 is committed at the same time or earlier. Notice that the creation of a transaction does not imply the immediate start of the updates on storage. \r
-The declaration stage is used in order to calculate credits needed by the underlying filesystem in order to perform the specified updates in an atomic manner.  For example, for a write operation the amount of space required can be calculated at the declaration stage, thus allowing the file system to ensure that enough space is reserved to complete the transaction atomically without failure once it has started.\r
-\r
-\r
-Every transaction is done in few steps:\r
-1) creation of transaction handle -- ->dt_trans_create()\r
-2) declaration of one or more updates that move the file system from one consistent state to another\r
--  This will make sure you will have enough resource to commit the requested changes atomically.\r
-3) transaction start -- ->dt_trans_start()\r
-4) execute steps \r
-- perform all the operations declared in the declaration stage 2).\r
-- fewer operations may be performed at this stage than were declared in 2),\r
-- additional operations than were not declared in 2) may not be executed.  \r
-5) transaction stop -- ->dt_trans_stop()\r
-\r
-\r
-thandle::th_sync set to 1 requests commands ->dt_trans_stop() to commit the transaction to a persistent storage as soon as possible, the caller gets control back not sooner than the transaction is committed.\r
-OSD should provide a caller a way to start committing as soon as possible and don’t be block on this: ->dt_commit_async().\r
-\r
-\r
-Transaction handle is created by ->dt_trans_create() and usually destroyed upon commit (as it holds list of callbacks and their private data). The only exception is that ->dt_trans_start() can’t actually start transaction due to problems with the file system or lack of resources.\r
-\r
-\r
-The maximum number of updates that make up a single transaction is OSD-specific, but is expected to be at least in the tens of updates to multiple objects in the OSD (extending writes of multiple MB of data, modifying or adding attributes, extended attributes, references, etc).     For example, in ext4, each update to the filesystem will modify one or more blocks of storage.  Since one transaction is limited to one quarter of the journal size, if the caller declares a series of updates that modify more than this number of blocks, the declaration must fail or it could not be committed atomically. In general, every constraint must be checked here to ensure that all changes that must commit atomically can complete successfully.\r
-\r
-\r
-Objects Overview\r
-----------------\r
-\r
-Lustre identifies objects in the underlying OSD storage by a unique 128-bit File IDentifier (FID) that is specified by Lustre and is the only identifier that Lustre is aware of for this object.  The FID is known to Lustre before any access to the object is done (even before it is created), using lu_object_find(). Since Lustre only uses the FID to identify an object, if the underlying OSD storage cannot directly use the Lustre-specified FID to retrieve the object at a later time, it must create a table or index object (normally called the Object Index (OI)) to map Lustre FIDs to an internal object identifier.  Lustre does not need to understand the format or value of the internal object identifier at any time outside of the OSD.\r
-\r
-\r
-The FID itself is composed of 3 members:\r
-\r
-\r
-struct lu_fid {\r
-                __u64        f_seq;\r
-        __u32        f_oid;\r
-                __u32        f_ver;\r
-};\r
-\r
-\r
-While the OSD itself should typically not interpret the FID, it may be possible to optimize the OSD performance by understanding the properties of a FID.  The f_seq (sequence) component is allocated in piecewise (though not contiguous) manner to different nodes, and each sequence forms a “group” of related objects.  The sequence number may be any value in the range [1, 263], but there are typically not a huge number of sequences in use at one time (typically less than one million at the maximum). Within a single sequence, it is likely that tens to thousands (and less commonly millions) of mostly-sequential f_oid values will be allocated. In order to efficiently map FIDs into objects, it is desirable to also be able to associate the OSD-internal index with key-value pairs.\r
-\r
-\r
-There are two major types of the objects:\r
-1) regular, storing unstructured data (e.g. flat files, OST objects, llog objects)\r
-2) index, storing key=value pairs (e.g. directories, quota indexes, FLDB)\r
-\r
-\r
-There are 3 sets of methods that should be implemented by the OSD layer:\r
-1. core methods (i.e. dt_object_operations) used to create/destroy/manipulate attributes of objects\r
-\r
-2. data methods (i.e. dt_body_operations) used to access the object body as a flat address space (read/write/truncate/punch) for regular objects\r
-\r
-3. index operations (i.e. dt_index_operations) to access index objects as a key-value association\r
-\r
-== common methods (dt_object_operations) ==\r
-\r
-These methods must exist in the OSD and mapped to the appropriate function in the backend file system.\r
-\r
-\r
-** Preconditions (locking requirements), mandatory parameters, parameter ranges, pre-calls, post-calls, error codes to be returned, etc.\r
-\r
-\r
- ->do_ah_init(): is an object init allocation hint using parent and child objects. OSD can fill struct dt_allocation_hint with information helping to allocate objects in optimal way. OSD can also transfer additional information into a child object which will be created soon.\r
-\r
- ->do_declare_create(): is called to reserve resources (on-disk, in-memory) to create the object, including all internal resources like OI, accounting, etc. The object shouldn’t exist already (i.e. dt_object_exist() should return false)\r
-\r
-->do_create(): is called to perform the actual object creation, including OI update[m], accounting, if necessary. Along with allocation hint (see ->do_ah_init()) the method take struct dt_object_format which can specify format of index (dt_object_format.u.dof_idx). \r
-\r
-->do_declare_destroy(): is called to reserve resource for object deletion. Semantically it’s dual to object creation and does not care about on-disk reference to the object (in contrast with POSIX unlink operation).\r
-\r
-->do_destroy(): is used to execute the object destruction, including OI update. The object must exist (i.e. dt_object_exist() must return true)\r
-\r
-->do_attr_get(): is called to fetch the regular attribute (i.e. lu_attr structure) associated with an object. The lu_attr fields maps the usual unix file attributes, like ownership or size. The object must exist.\r
-\r
-->do_declare_attr_set(): is used to reserve resource in transaction in order to modify some attributes. Can be called on an non-existing object.\r
-\r
-->do_attr_set(): is called to perform the actual attribute changes. The object must exist.\r
-\r
-->do_xattr_get(): is called to fetch the extended attribute of an object with a certain name.  If the struct lu_buf argument has a null lb_buf, the size of the extended attribute should be returned. If the requested extended attribute does not exist, -ENODATA should be returned.  The object must exist. If buffer space (specified in lu_buf.lb_len) is not enough to fit the value, then return -ERANGE. \r
-\r
-->do_declare_xattr_set(): is called to reserve resources in a transaction in order to set an extended attribute of an object. Can be called on an non-existing object.\r
-\r
-->do_xattr_set(): is called to create or update an extended attribute of an object.  If the fl argument has LU_XATTR_CREATE, the extended argument must not exist, otherwise -EEXIST should be returned.  If the fl argument has LU_XATTR_REPLACE, the extended argument must exist, otherwise -ENODATA should be returned.  The object must exist. The maximum size of extended attribute supported by OSD should be present in struct dt_device_param the caller can get with ->dt_conf_get() method.\r
-\r
-->do_declare_xattr_del(): is called to reserve resources in the transaction in order to delete an extended attribute of an object.\r
-\r
-->do_xattr_del(): is called to delete an extended attribute of an object.  Deleting an nonexistent extended attribute is allowed.  The object must exist. The method called on a non-existing attribute returns 0.\r
-\r
-->do_xattr_list(): is called to get a list of the names of existing extended attributes.  The size of the list is returned, including the string terminator.  If the lu_buf argument has a null lb_buf, how many bytes the list would require is returned to help the caller to allocate a buffer of an appropriate size.  The object must exist.\r
-\r
-->do_declare_ref_add(): is called to reserve resources in a transaction in order to increment the object’s nlink.\r
-\r
-->do_ref_add(): is called to increment the nlink of an object. This is typically done on an object when a record referring to it is added to an index object.  The object must exist.\r
-\r
-->do_declare_ref_del(): is called to reserve credits in a transaction in order to decrement the object’s nlink.\r
-\r
-->do_ref_del():  is called to decrement the nlink of an object.  This is typically done on an object when a record referring to it is deleted from an index object.  The object must exist.\r
-\r
-== data methods (dt_body_operations) ==\r
-\r
-Set of methods described in struct dt_body_operations which should be used with regular objects storing unstructured data.\r
-\r
-->dbo_read(): is called to read data from an object.\r
-\r
-->dbo_declare_write(): is called to reserve resources in a transaction in order to write data into an object.\r
-\r
-->dbo_write(): is called to write data into an object.  This is mostly used to update symbolic links and objects used for internal purposes by Lustre.  Data written with this method is subject to regular transactional rules: commit with other changes in the transaction or discarded together.\r
-\r
-->dbo_bufs_get[n](): is called to get array of buffers for the range described by (offset, length). Each buffer contains a pointer to Linux page, offset within this page and size:\r
-\r
-struct niobuf_local {\r
-        /* fields filled by OSD */\r
-__u64                lnb_file_offset;                /* offset within object */\r
-        __u32                lnb_page_offset;        /* offset within page */\r
-__u32                len;                        /* actual data stored in this buffer */\r
-cfs_page_t        *page;\r
-cfs_dentry_t        *dentry;\r
-\r
-\r
-/* internal fields used by obdfilter/ofd */\r
-__u32                flags;\r
-int                lnb_grant_used;\r
-int                rc;\r
-};\r
-\r
-\r
-The size of the array should be PTLRPC_MAX_BRW_PAGES.\r
-\r
-->dbo_bufs_put(): is called to release buffers obtained by ->dbo_bufs_get(). Methods operating with struct niobuf_loca (buffers) are used to implement zero-copy IO.\r
-\r
-->dbo_write_prep(): is called before doing bulk transfer from the network to the buffers.  The purpose of the method is to let OSD fill partial buffers with actual data. if the whole buffer is supposed to be overwritten, then OSD can skip this buffer.\r
-\r
-\r
-->dbo_declare_write_commit(): is called to reserve resources in a transaction in order to write data described by the array of buffers into an object with ->dbo_write_commit(). The transactional rules for \r
-\r
-\r
-->dbo_write_commit(): is called to write out the data in the buffers.  The transactional rules are the same: by the time the transaction is reported committed all the data written with the method should be stored persistently as well.\r
-\r
-\r
-->dbo_read_prep(): is called to fetch data into the buffers prepared by ->dbo_bufs_get()\r
-\r
-\r
-->dbo_fiemap_get(): is called to get logical -> physical mapping information for given range in the object:\r
-\r
-\r
-\r
-\r
-struct ll_user_fiemap {\r
-        __u64        fm_start;        /* logical offset (inclusive) */\r
-        __u64        fm_length;        /* logical length the range */\r
-        __u32        fm_flags;  /* FIEMAP_FLAG_* flags for request (in/out) */\r
-        __u32        fm_mapped_extents;/* number of extents that were mapped (out) */\r
-        __u32        fm_extent_count;  /* size of fm_extents array (in) */\r
-        __u32        fm_reserved;\r
-        struct ll_fiemap_extent fm_extents[0]; /* array of mapped extents (out) */\r
-};\r
-\r
-\r
-->do_declare_punch(): is called to reserve resources in a transaction in order to release (deallocate) specified range of data in an object.\r
-\r
-\r
-->do_punch(): is called to release (deallocate) specified range of data in an object. Currently used only in form of truncate where the range is [offset; EOF].\r
-\r
-\r
-== index methods (dt_index_operations) ==\r
-\r
-\r
-To be used with index objects storing key=value pairs\r
-\r
-\r
- ->do_index_try(): Announce that an object is going to be used as an index. This operations checks that the object support indexing operations and supports features described in passed struct dt_index_feature.\r
-\r
-\r
-struct dt_index_features {\r
-        __u32        dif_flags;                /** required feature flags from enum dt_index_flags */\r
-        size_t        dif_keysize_min;        /** minimal required key size */\r
-        size_t        dif_keysize_max;        /** maximal required key size, 0 if no limit */\r
-        size_t        dif_recsize_min;                /** minimal required record size */\r
-        size_t        dif_recsize_max;        /** maximal required record size, 0 if no limit */\r
-        size_t        dif_ptrsize;                /** pointer size for record */\r
-};\r
-\r
-\r
-enum dt_index_flags {\r
-        DT_IND_VARKEY = 1 << 0,        /** index supports variable sized keys */\r
-        DT_IND_VARREC = 1 << 1,        /** index supports variable sized records */\r
-        DT_IND_UPDATE = 1 << 2,        /** index can be modified */\r
-        DT_IND_NONUNQ = 1 << 3,        /** index supports records with non-unique (duplicate) keys */\r
-        /**\r
-         * index support fixed-size keys sorted with natural numerical way\r
-         * and is able to return left-side value if no exact value found\r
-         */\r
-        DT_IND_RANGE = 1 << 4,\r
-};\r
-\r
-\r
- ->dio_lookup(): look up a record associated with a key in a given index object. \r
-\r
-\r
- ->dio_declare_insert(): reserve resources for inserting a key/record pair in an index object\r
-\r
-\r
- ->dio_insert(): insert key/record pair in an index object\r
-\r
-\r
- ->dio_declare_delete(): reserve resources for deleting of a key/record pair in an index object\r
-\r
-\r
- ->dio_delete(): delete a key/record pair in an index object\r
-\r
-\r
-To let users to fetch all or a subset of key/record pairs OSD should provide with iterator methods:\r
-\r
-\r
-1. ->init(): allocate and initializes the iterator (defined within OSD implementation)\r
-1. ->fini(): release the iterator returned by ->init()\r
-2. ->get(): tries to set the iterator to the closest position which <= the key \r
-3. ->next(): move the iterator by one record\r
-4. ->key(): return a pointer to the key the iterator at currently\r
-5. ->key_size(): return the size of the key the iterator at currently\r
-6. ->rec(): return a pointer to the buffer holding the record the iterator at currently\r
-7. ->store(): return the current position of the iterator\r
-8. ->load(): set the iterator to the position with hash equal specified\r
-\r
-\r
-** Add iterator example here\r
-\r
-\r
-Special objects\r
-\r
-\r
-A special object with fid [ FID_SEQ_LOCAL_FILE; OTABLE_OT_OID, 0 ] should be accessible via OSD: this is an index object providing list of all existing objects on this storage. The key is an opaque string and the record  is FID. This object is used by high-level components like LFSCK to iterate over objects.\r
-\r
-\r
-Locking Description\r
-\r
-\r
-Internal object consistency is maintained by OSD implementation and/or the backend. This allow OSD implementation to define how fine the locking will be. The API does not define result of conflicting updates.\r
-\r
-locking provided by OSD[o] is used to support atomic in-core changes, so that the state visible by other threads accessing the OSD concurrently is consistent.\r
-\r
-\r
-OSD provides with methods to lock/unlock objects in shared and exclusive modes. This locking is not used by OSD internally, rather they are means to let the user group and serialize accesses/updates. OSD API puts no constraints on the locking order, it’s up to the caller.\r
-\r
-\r
-Methods to lock/unlock object\r
-\r
-\r
- * ->do_read_lock() - used to get shared lock on the object\r
-\r
-\r
- * ->do_read_unlock() - used to release shared lock on the object\r
-\r
-\r
- * ->do_write_lock() - used to get exclusive lock on the object\r
-\r
-\r
- * ->do_write_unlock() - used to release exclusive lock on the object\r
-\r
-\r
-Quota Management\r
-\r
-\r
-The OSD layer is in charge of setting up a Quota Slave Device (aka QSD) to manage quota enforcement for a specific OSD device. The QSD instance is in charge of:\r
-- completing the reintegration procedure [p]with the quota master (aka QMT, see qmt_dev.c) to retrieve the latest quota settings and space distribution.\r
-- managing quota locks in order to be notified of configuration changes.\r
-- acquiring space from the QMT when quota space for a given user/group is close to exhaustion.\r
-- allocating quota space to service thread for local request processing.\r
-\r
-\r
-The QSD API is the following: \r
-- qsd_init()\r
-Initialize the quota slave instance, it should be called when starting the osd device: osd_start().\r
-- qsd_fini()\r
-Finalize the quota slave instance, it should be called when shuting down the osd device: osd_shutdown().\r
-- qsd_start()\r
-Mark the qsd slave instance as 'started' and trigger the 3rd step of quota reintegration: acquire/release quota up/down to usage or acquire per-ID lock if necessary[q]. This function should be called after the osd device has completed recovery: osd_recovery_complete().\r
-- qsd_op_begin()\r
-This function is used to enforce quota, and should be called in the declaration of each operation subject to quota enforcement:\r
-* osd_declare_attr_set()\r
-* osd_declare_object_create()\r
-* osd_declare_object_destroy()\r
-* osd_declare_punch()\r
-* osd_declare_write()\r
-* osd_delcare_write_commit()\r
-- qsd_op_end()\r
-Perform post quota operation: pre-acquire/release quota from/to master, it should be called after the transaction stopped: osd_trans_stop().\r
-- qsd_adjust_quota()\r
-Trigger pre-acquire/release if necessary, it's only used for ldiskfs osd so far. When unlink a file in ldiskfs, the quota accounting isn't updated when the transaction stopped, instead, it'll be udpated on the final iput, so qsd_adjust_quota() will be called then (in osd_object_delete()) to trigger quota release if neccessary.\r
-\r
-It is highly desirable that an OSD object can be accessed and modified by multiple threads concurrently.\r
-\r
-For regular objects, the preferred implementation allows an object to be read concurrently at overlapping offsets, and written by multiple threads at non-overlapping offsets with the minimum amount of contention possible, or any combination of concurrent read/write operations.  Lustre will not itself perform concurrent overlapping writes to a single region of the object, due to serialization at a higher layer[s]. \r
-\r
-For index objects, the preferred implementation allows key/value pair to be looked up concurrently, allows non-conflicting keys to be inserted or removed concurrently, or any combination of concurrent lookup, insertion, or removal.  Lustre does not require the storage of multiple identical keys.[t]  Operations on the same key should be serialized[u].\r
- \r
-Requirements for Storage Subsystems Below the OSD API\r
-As previously discussed, the underlying OSD storage must be able to provide some form of atomic commit for multiple arbitrary updates to OSD storage within a single transaction.  It will always know in advance of the transaction starting which objects will be modified, and how they will be modified.\r
-\r
-Atomicity of Updates\r
-If any of the updates associated with a transaction are stored persistently (i.e. some state in the OSD is modified), then all of the updates in that transaction must also be stored persistently (Atomic).  If the OSD should fail in some manner that prevents all the updates of a transaction from being completed, then none of the updates shall be completed (Consistent).  Once the updates have been reported committed to the caller (i.e. commit callbacks have been run), they cannot be rolled back for any reason (Durable).\r
-\r
-\r
-Object Attributes\r
-The OSD object should be able to store normal POSIX attributes on each object as specified by Lustre:\r
-- user ID (32 bits)\r
-- group ID (32 bits)\r
-- object type (16 bits)\r
-- access mode (16 bits)\r
-- metadata change time (96 bits, 64-bit seconds, 32-bit nanoseconds)\r
-- data modification time (96 bits, 64-bit seconds, 32-bit nanoseconds)\r
-- data access time (96 bits, 64-bit seconds, 32-bit nanoseconds)\r
-- creation time (96 bits, 64-bit seconds, 32-bit nanoseconds, optional)\r
-- object size (64 bits)\r
-- link count (32 bits)\r
-- flags (32 bits)\r
-- object version (64 bits)\r
-\r
-\r
-The OSD object shall not modify these attributes itself.\r
-\r
-In addition, it is desirable track the object allocation size (“blocks”), which the OSD manages itself.  Lustre will query the object allocation size, but will never modify it.  If these attributes are not managed by the OSD natively as part of the object itself, they can be stored in an extended attribute[v] associated with the object.\r
-\r
-Extended Attributes\r
-The OSD should have an efficient mechanism for storing small extended attributes with each object.  This implies that the extended attributes can be accessed at the same time as the object (without extra seek/read operations). There is also a requirement to store larger extended attributes in some cases (over 1kB in size), but the performance of such attributes can be slower proportional to the attribute size.\r
-\r
-Efficient Index\r
-The OSD must provide a mechanism for efficient key=value retrieval, for both fixed-length and variable length keys and values.  It is expected that an index may hold tens of millions of keys, and must be able to do random key lookups in an efficient manner. It must also provide a mechanism for iterating over all of the keys in a particular index and returning these to the caller in a consistent order across multiple calls.  It must be able to provide a cookie that defines the current index at which the iteration is positioned, and must be able to continue iteration at this index at a later time.\r
-\r
-Commit Callbacks\r
-The OSD must provide some mechanism to register multiple arbitrary callback functions for each transaction, and call these functions after the transaction with which they are associated has committed to persistent storage.  It is not required that they be called immediately at transaction commit time, but they cannot be delayed an arbitrarily long time, or other parts of the system may suffer resource exhaustion.  If this mechanism is not implemented by the underlying storage, then it needs to be provided in some manner by the OSD implementation itself.\r
-\r
-\r
-Optional\r
-In order to provide quota functionality for the OSD, it must be able to track the object allocation size against at least two different keys (typically User ID and Group ID).  The actual mechanism of tracking this allocation is internal to the OSD.  Lustre will specify the owners of the object against which to track this space. Space accounting information will be accessed by Lustre via the index API on special objects dedicated to space allocation management.\r
-
-Sampel Code\r
-- http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-zfs[w];\r
-\r
-Appendix 1. A brief note on Lustre configuration.\r
-=================================================\r
-\r
-In the current versions (1.8, 2.x) MGS is used to store configuration of the servers, so called profile. The profile stores configuration commands and arguments to setup specific stack. To see how it looks exactly you can fetch MDT profile with debugfs -R "dump /CONFIGS/lustre-MDT0000 <tempfile>", then parse it with: llog_reader <tempfile>. Here is a short extract:\r
-#02 (136)attach    0:lustre-MDT0000-mdtlov  1:lov  2:lustre-MDT0000-mdtlov_UUID  \r
-#03 (176)lov_setup 0:lustre-MDT0000-mdtlov  1:(struct lov_desc)\r
-                uuid=lustre-MDT0000-mdtlov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1\r
-#06 (120)attach    0:lustre-MDT0000  1:mdt  2:lustre-MDT0000_UUID  \r
-#07 (112)mount_option 0:  1:lustre-MDT0000  2:lustre-MDT0000-mdtlov  \r
-#08 (160)setup     0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  \r
-#23 (080)add_uuid  nid=10.0.2.15@tcp(0x200000a00020f)  0:  1:10.0.2.15@tcp  \r
-#24 (144)attach    0:lustre-OST0000-osc-MDT0000  1:osc  2:lustre-MDT0000-mdtlov_UUID  \r
-#25 (144)setup     0:lustre-OST0000-osc-MDT0000  1:lustre-OST0000_UUID  2:10.0.2.15@tcp  \r
-#26 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-OST0000_UUID  2:0  3:1  \r
-#32 (080)add_uuid  nid=10.0.2.15@tcp(0x200000a00020f)  0:  1:10.0.2.15@tcp  \r
-#33 (144)attach    0:lustre-OST0001-osc-MDT0000  1:osc  2:lustre-MDT0000-mdtlov_UUID  \r
-#34 (144)setup     0:lustre-OST0001-osc-MDT0000  1:lustre-OST0001_UUID  2:10.0.2.15@tcp  \r
-#35 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-OST0001_UUID  2:1  3:1  \r
-#41 (120)param 0:  1:sys.jobid_var=procname_uid  2:procname_uid  \r
-#44 (080)set_timeout=20 \r
-#48 (112)param 0:lustre-MDT0000-mdtlov  1:lov.stripesize=1048576  \r
-#51 (112)param 0:lustre-MDT0000-mdtlov  1:lov.stripecount=-1  \r
-#54 (160)param 0:lustre-MDT0000  1:mdt.identity_upcall=/work/lustre/head/lustre-release/lustre/utils/l_getidentity  \r
-\r
-\r
-Every line starts with a specific command (attach, lov_setup, set, etc) to do specific configuration action. Then arguments follow. Often the first argument is a device name. For example,\r
-\r
-\r
-#02 (136)attach    0:lustre-MDT0000-mdtlov  1:lov  2:lustre-MDT0000-mdtlov_UUID  \r
-\r
-\r
-This command will be setting up device “lustre-MDT0000-mdtlov” of type “lov” with additional argument “lustre-MDT0000-mdtlov_UUID”. All these arguments are packed into lustre configuration buffers ( struct lustre_cfg).\r
-\r
-\r
-Another commands will be attaching device into the stack (like setup and lov_modify_tgts).\r
-\r
-\r
-=====================================================================\r
-\r
-\r
-Lustre Environment\r
-\r
-\r
-There is a notion of an environment represented by struct lu_env in many functions and methods. Literally this is a Thread Local Storage (TLS), which is bound to every service thread and used by that thread exclusively, there is no need to serialize access to the data stored here.\r
-The original purpose of the environment was to workaround small Linux stack (4-8K). A component (like device or library) can register its own descriptor (see LU_KEY_INIT macro) and then every new thread will be populating the environment with buffers described.\r
-Devices\r
-\r
-\r
-To access disk file system Lustre uses a notion of device which is represented by struct dt_device (which in turn a sub-class of generic lu_device). This structure holds the very basic data like reference counter, a reference to the site (Lustre object collection in-core, very similar to inode cache), a reference to struct lu_type which in turn describe this specific type of devices (type name, operations etc).\r
-\r
-\r
-OSD device is created and initialized at mount time to let configuration component access data it needs before the whole Lustre stack is ready. OSD device is destroyed when all the devices using that are destroyed too. Usually this happen when the server stack shuts down at umount time.\r
-\r
-\r
-There might be few OSD devices of the given type (say, few zfs-osd and ldiskfs-osd). The type stores method common for all OSD instances of given type (below they start with ldto_ prefix). Then every instance of OSD device can get few specific methods (below the start with ldo_ prefix).\r
-\r
-\r
-To connect devices into a stack, ->o_connect() method is used (see struct obd_ops). Currently OSD should implement this method to track all it’s users. Then to disconnect ->o_disconnect() method is used. OSD should implement this method, track remaining users and once no users left, call class_manual_cleanup() function which initiate removal of OSD.\r
-\r
-\r
-As the stack involves many devices and there may be cross-references between them, it’s easier to break the whole shutdown procedure into the two steps and do not set a specific order in which different devices shutdown: at the first step the devices should release all the resources they use internally (so-called pre-cleanup procedure), at the second step they are actually destroyed.\r
-Device Management Operations\r
-\r
-\r
-struct lu_device *(*ldto_device_alloc)(const struct lu_env *env, struct lu_device_type *t,\r
-                                     struct lustre_cfg *lcfg);\r
-struct lu_device *(*ldto_device_free)(const struct lu_env *, struct lu_device *);\r
-int  (*ldto_device_init)(const struct lu_env *env, struct lu_device *, const char *,\r
-                                    struct lu_device *);\r
-struct lu_device *(*ldto_device_fini)(const struct lu_env *env, struct lu_device *);\r
-int  (*ldto_init)(struct lu_device_type *t);\r
-void (*ldto_fini)(struct lu_device_type *t);\r
-void (*ldto_start)(struct lu_device_type *t);\r
-void (*ldto_stop)(struct lu_device_type *t);\r
-\r
-\r
-\r
-\r
-struct lu_object *(*ldo_object_alloc)(const struct lu_env *env, const struct lu_object_header *h,\r
-                                                          struct lu_device *d);\r
-int (*ldo_process_config)(const struct lu_env *env, struct lu_device *, struct lustre_cfg *);\r
-int (*ldo_recovery_complete)(const struct lu_env *, struct lu_device *);\r
-int (*ldo_prepare)(const struct lu_env *, struct lu_device *parent, struct lu_device *dev);\r
-int (*o_connect)(const struct lu_env *env, struct obd_export **exp, struct obd_device *src,\r
-  struct obd_uuid *cluuid, struct obd_connect_data *ocd, void *localdata);\r
-int (*o_reconnect)(const struct lu_env *env, struct obd_export *exp, struct obd_device *src,\r
-      struct obd_uuid *cluuid, struct obd_connect_data *ocd, void *localdata);\r
-int (*o_disconnect)(struct obd_export *exp);\r
-\r
-\r
-\r
-\r
-\r
-\r
-\r
-\r
-ldto_device_alloc\r
-       the method is called by configuration component (in case of disk file system OSD, this is lustre/obdclass/obd_mount.c) to allocate device. Notice generic struct lu_device does not hold a pointer to private data. Instead OSD should embbed struct lu_device into own structure (like struct osd_device) and return address of lu_device in that structure.\r
-       ldto_device_fini\r
-       the method is called when OSD is about to release. OSD should detach from resources like disk file system, procfs, release objects it holds internally, etc. This is so-called precleanup procedure.\r
-       ldto_device_free\r
-       the method is called to actually release memory allocated in ->ldto_device_alloc().\r
-       ldto_device_ini\r
-       the method is not used by OSD currently.\r
-       ldto_init\r
-       The method is called when specific type of OSD is registered in the system. Currently the method  is used to register OSD-specific data for environments (see Lustre environment).  see LU_TYPE_INIT_FINI() macro as an example.\r
-       ldto_fini\r
-       The method is called when specific type of OSD unregisters. Currently used to unregister OSD-specific data from environment.\r
-       ldto_start\r
-       The method is called when the first device of this type is being instantiated. Currently used to fill existing environments with OSD-specific data.\r
-       ldto_stop\r
-       This method is called when the last instance of specific OSD has gone.  Currently used to release OSD-specific data from environments.\r
-       ldo_object_alloc\r
-       The method is called when a high-level service wants to access an object not found in local lustre cache (see struct lu_site). OSD should allocate a structure, initialize object’s methods  and return a pointer to struct lu_device which is embedded into OSD object structure.\r
-       ldo_process_config\r
-       The method is called in case of configuration changes. Mostly used by high-level services to update local tunables. It’s also possible to let MGS store OSD tunables and set them properly on every server mount or when tunables change run-time.\r
-       ldto_recovery_complete\r
-       The method is called when recovery procedure between a server and clients is completed. This method is used by high-level devices mostly (like OSP to cleanup OST orphans, MDD to cleanup open unlinked files left by missing client, etc).\r
-\r
-\r
-       ldo_prepare\r
-       The method is called when all the devices belonging to the stack are configured and setup properly. At this point the server becomes ready to handle RPCs and start recovery procedure.\r
-In current implementation OSD uses this method to initialize local quota management.\r
-       \r
-\r
-       the method should also track number of connections made (i.e. number of active users of this OSD).\r
-       o_disconnect\r
-       the method is called then some one using this OSD does not need its service any more (i.e. at umount). For every passed struct export the method should call class_disconnect(export). Once the last user has gone, OSD should call class_manual_cleanup() to schedule the device removal.\r
-       \r
-\r
-Device Storage Operations\r
-\r
-\r
-int   (*dt_statfs)(const struct lu_env *env, struct dt_device *dev, struct obd_statfs *osfs);\r
-struct thandle *(*dt_trans_create)(const struct lu_env *env, struct dt_device *dev);\r
-int   (*dt_trans_start)(const struct lu_env *env, struct dt_device *dev, struct thandle *th);\r
-int   (*dt_trans_stop)(const struct lu_env *env, struct thandle *th);\r
-int   (*dt_trans_cb_add)(struct thandle *th, struct dt_txn_commit_cb *dcb);\r
-int   (*dt_root_get)(const struct lu_env *env, struct dt_device *dev, struct lu_fid *f);\r
-void  (*dt_conf_get)(const struct lu_env *env, const struct dt_device *dev,\r
-                             struct dt_device_param *param);\r
-int   (*dt_sync)(const struct lu_env *env, struct dt_device *dev);\r
-int   (*dt_ro)(const struct lu_env *env, struct dt_device *dev);\r
-int   (*dt_commit_async)(const struct lu_env *env, struct dt_device *dev);\r
-int   (*dt_init_capa_ctxt)(const struct lu_env *env, struct dt_device *dev,\r
-                                       int mode, unsigned long timeout,\r
-                                       __u32 alg, struct lustre_capa_key *keys);\r
-\r
-\r
-\r
-\r
-dt_statfs\r
-       called to report current file system usage information: all, free and available blocks/objects.\r
-       dt_trans_create\r
-       called to allocate/initialize transaction handler\r
-       dt_trans_start\r
-       called to start transaction with specific transaction handle\r
-       dt_trans_stop\r
-       called to stop transaction, at this point the transaction is considered complete and OSD/backend can start writeout preserving atomicity\r
-       dt_trans_cb_add\r
-       called to assign a commit callback to specified transaction handler. Used by recovery functionality, sequence manager.\r
-       dt_root_get\r
-       called to get FID of the root object. Used to follow backend filesystem rules and support backend file system in a state where users can mount it directly (with ldiskfs/zfs/etc).\r
-       dt_sync\r
-       called to flush all complete but not written transactions. Should block until the flush is completed.\r
-       dt_ro\r
-       called to turn backend into read-only mode. Used by testing infrastructure to simulate recovery cases.\r
-       dt_commit_async\r
-       called to notify OSD/backend that higher level need transaction to be flushed as soon as possible. Used by Commit-on-Share feature. Should return immediately and not block for long.\r
-       dt_init_capa_ctxt\r
-       called to initialize context for capabilities. not in use currently.\r
-       \r
-\r
-\r
-\r
-Objects\r
-\r
-\r
-Every object is represented with a header (struct lu_header) and so-called slice on every layer of the stack. Core Lustre code maintains a cache of objects (so-called lu-site, see struct lu_site). which is very similar to Linux inode cache.\r
-\r
-\r
-Object lifetime\r
-\r
-\r
-In-core object is created when high-level service need it to process RPC or do some background job like LFSCK. FID of the object is supposed to be known before the object is created. FID can come from RPC or from a disk. Having the FID lu_object_find() function is called, it search for the object in the cache (see struct lu_site) and if the object is not found, creates it using ->ldo_device_alloc(), ->loo_object_init() and ->loo_object_start() methods.\r
-Objects are referenced and tracked by Lustre core. If object is not in use, it’s put on LRU list and at some point (subject to internal caching policy or memory pressure callbacks from the kernel) Lustre schedules such an object for a removal from the cache. To do so Lustre core marks the object is going out and calls ->loo_object_release() and ->loo_object_free() iterating over all the layers involved.\r
-\r
-\r
-Locking on the objects and consistency\r
-\r
-\r
-OSD is expected to maintain internal consistency of the file system and its object on its own, requiring no additional locking or serialization from higher levels. This let OSD to control how fine the locking is depending on the internal structuring of a specific file system.  If few update conflict then the result is not defined by OSD API and left to OSD.\r
-\r
-\r
-OSD should provide the caller with few methods to serialize access to an object in shared and exclusive mode. It’s up to caller how to use them, to define order of locking. In general the locks provided by OSD are used to group complex updates so that other threads do not see intermediate result of operations.\r
-Object Management Methods\r
-\r
-\r
-Object management methods are called by Lustre to manipulate OSD-specific (private) data associated with a specific object during the lifetime of an object. Described in struct lu_object_operations. To allocate an object device’s ->ldo_object_alloc() method is used. It should allocate and initialize object’s methods.\r
-\r
-\r
-\r
-\r
-int (*loo_object_init)(const struct lu_env *env, struct lu_object *o,const struct lu_object_conf *);\r
-int (*loo_object_start)(const struct lu_env *env, struct lu_object *o);\r
-void (*loo_object_delete)(const struct lu_env *env, struct lu_object *o);\r
-void (*loo_object_free)(const struct lu_env *env, struct lu_object *o);\r
-void (*loo_object_release)(const struct lu_env *env, struct lu_object *o);\r
-int (*loo_object_print)(const struct lu_env *env, void *, lu_printer_t p, const struct lu_object *o);\r
-int (*loo_object_invariant)(const struct lu_object *o);\r
-\r
-\r
-\r
-\r
-loo_object_init\r
-       This method is called when a new object is being created (see lu_object_alloc(), it’s purpose is to initialize object’s internals, usually file system lookups object on a disk (notice a header storing FID is already created by a top device) using Object Index mapping FID to local object id like dnode. LOC_F_NEW can be passed to the method when the caller knows the object is new and OSD can skip OI lookup to improve performance.\r
-       loo_object_start\r
-       The method is called when all the structures and the header are initialized. Currently user by high-level service to as a post-init procedure (i.e. to setup own methods depending on object type which is brought into the header by OSD’s ->loo_object_init())\r
-       loo_object_delete\r
-       is called to let OSD release resources behind an object (except memory allocated for an object), like release file system’s inode. it’s separated from ->loo_object_free() to be able to iterate over still-existing objects. the main purpose to separate ->loo_object_delete() and ->loo_object_free() is to avoid recursion during potentially stack consuming resource release.\r
-       loo_object_free\r
-       is called to actually release memory allocated by ->ldo->object_alloc()\r
-       loo_object_release\r
-       is called when object last it’s last user and moves onto LRU list of unused objects. implementation of this method is optional to OSD.\r
-       loo_object_print\r
-       is used for debugging purpose, it should output details of an object in human-readable format. details usually include information like address of an object, local object number (dnode/inode), type of an object, etc.\r
-       loo_object_invariant\r
-       another optional method for debugging purposes which is called to verify internal consistency of object. \r
-       \r
-\r
-Object attributes\r
-\r
-\r
-The OSD object should be able to store normal POSIX attributes with every object as specified by Lustre:\r
-- user ID (32 bits)\r
-- group ID (32 bits)\r
-- object type (16 bits)\r
-- access mode (16 bits)\r
-- metadata change time (96 bits, 64-bit seconds, 32-bit nanoseconds)\r
-- data modification time (96 bits, 64-bit seconds, 32-bit nanoseconds)\r
-- data access time (96 bits, 64-bit seconds, 32-bit nanoseconds)\r
-- creation time (96 bits, 64-bit seconds, 32-bit nanoseconds, optional)\r
-- object size (64 bits)\r
-- link count (32 bits)\r
-- flags (32 bits)\r
-- object version (64 bits)\r
-\r
-\r
-It’s up to OSD and disk file system where to store these attributes. Regular disk file systems usually provide a space for that (inode/dnode).\r
-\r
-\r
-The OSD object shall not modify these attributes itself, all the attributes are controlled by the caller. The only exception is an attribute storing space occupied by object, it’s data and metadata. Lustre can not track this properly, so it’s a responsibility of OSD or disk file system to maintain this attribute. This is require for Lustre quota mechanism. OSD should be able to disable or workaround quota enforcement of disk filesystem. \r
-\r
-\r
-OSD should provide with a mechanism to store extended named attributes. Limits of the size of names and values should be provided by ->do_conf_get() method as it depends on a specific file system. To perform well it’s recommended that OSD store often used attributes in an object (or close to object) so that they can be accessed with a single disk I/O. At the moment the often used attributes are: \r
-1. "system.posix_acl_access"        - stores ACLs\r
-2.  "trusted.lov"                        - stores striping information\r
-3. "trusted.version"                - stores version of object\r
-\r
-\r
-\r
-\r
-Object Common Storage Methods\r
-\r
-\r
-\r
-\r
-void  (*do_read_lock)(const struct lu_env *env, struct dt_object *dt, unsigned role);\r
-void  (*do_write_lock)(const struct lu_env *env, struct dt_object *dt, unsigned role);\r
-void  (*do_read_unlock)(const struct lu_env *env, struct dt_object *dt);\r
-void  (*do_write_unlock)(const struct lu_env *env, struct dt_object *dt);\r
-int  (*do_write_locked)(const struct lu_env *env, struct dt_object *dt);\r
-int   (*do_attr_get)(const struct lu_env *env, struct dt_object *dt, struct lu_attr *attr,\r
-                             struct lustre_capa *capa);\r
-int   (*do_declare_attr_set)(const struct lu_env *env, struct dt_object *dt,\r
-                                     const struct lu_attr *attr, struct thandle *handle);\r
-int   (*do_attr_set)(const struct lu_env *env, struct dt_object *dt, const struct lu_attr *attr,\r
-                              struct thandle *handle, struct lustre_capa *capa);\r
-int   (*do_xattr_get)(const struct lu_env *env, struct dt_object *dt,\r
-                              struct lu_buf *buf, const char *name, struct lustre_capa *capa);\r
-int   (*do_declare_xattr_set)(const struct lu_env *env, struct dt_object *dt,\r
-          const struct lu_buf *buf, const char *name, int fl,\r
-                                              struct thandle *handle);\r
-int   (*do_xattr_set)(const struct lu_env *env, struct dt_object *dt, const struct lu_buf *buf,\r
-                              const char *name, int fl, struct thandle *handle, struct lustre_capa *capa);\r
-int   (*do_declare_xattr_del)(const struct lu_env *env, struct dt_object *dt,\r
-                                             const char *name, struct thandle *handle);\r
-int   (*do_xattr_del)(const struct lu_env *env, struct dt_object *dt, const char *name,\r
-                               struct thandle *handle, struct lustre_capa *capa);\r
-int   (*do_xattr_list)(const struct lu_env *env, struct dt_object *dt, struct lu_buf *buf,\r
-                                struct lustre_capa *capa);\r
-void  (*do_ah_init)(const struct lu_env *env, struct dt_allocation_hint *ah,\r
-                               struct dt_object *parent, struct dt_object *child, cfs_umode_t child_mode);\r
-int   (*do_declare_create)(const struct lu_env *env, struct dt_object *dt, struct lu_attr *attr,\r
-                                          struct dt_allocation_hint *hint,  struct dt_object_format *dof,\r
-                                          struct thandle *th);\r
-int   (*do_create)(const struct lu_env *env, struct dt_object *dt, struct lu_attr *attr,\r
-                            struct dt_allocation_hint *hint, struct dt_object_format *dof,\r
-                            struct thandle *th);\r
-int   (*do_declare_destroy)(const struct lu_env *env, struct dt_object *dt, struct thandle *th);\r
-int   (*do_destroy)(const struct lu_env *env, struct dt_object *dt, struct thandle *th);\r
-int   (*do_index_try)(const struct lu_env *env, struct dt_object *dt, \r
-                                 const struct dt_index_features *feat);\r
-int   (*do_declare_ref_add)(const struct lu_env *env, struct dt_object *dt, struct thandle *th);\r
-int   (*do_ref_add)(const struct lu_env *env, struct dt_object *dt, struct thandle *th);\r
-int   (*do_declare_ref_del)(const struct lu_env *env, struct dt_object *dt, struct thandle *th);\r
-int   (*do_ref_del)(const struct lu_env *env, struct dt_object *dt, struct thandle *th);\r
-struct obd_capa *(*do_capa_get)(const struct lu_env *env, struct dt_object *dt,\r
-                              struct lustre_capa *old, __u64 opc);\r
-int (*do_object_sync)(const struct lu_env *, struct dt_object *);\r
-\r
-\r
-\r
-\r
-\r
-\r
-do_read_lock\r
-       get a shared lock on the object, this is a blocking lock.\r
-       do_write_lock\r
-       get an exclusive lock on the object, this is a blocking lock.\r
-       do_read_unlock\r
-       release a shared lock on an object, this is a blocking lock.\r
-       do_write_unlock\r
-       release an exclusive lock on an object, this is a blocking lock.\r
-       do_write_locked\r
-       check whether an object is exclusive-locked.\r
-       do_attr_get\r
-       the method is called to get regular attributes an object stores.\r
-       do_declare_attr_set\r
-       the method is called to notify OSD the caller is going to modify regular attributes of an object in specified transaction. OSD should use this method to reserve resources needed to change attributes.\r
-       do_attr_set\r
-       the method is called to change attributes of an object.\r
-       do_xattr_get\r
-       called when the caller needs to get an extended attribute with a specified name\r
-       do_declare_xattr_set\r
-       called to notify OSD the caller is going to set/change an extended attribute on an object. OSD should use this method to reserve resources needed to change an attribute.\r
-       do_xattr_set\r
-       called when the caller needs to change an extended attribute with specified name.\r
-       do_declare_xattr_del\r
-       called to notify OSD the caller is going to remove an extended attribute with a specified name\r
-       do_xattr_del\r
-       called when the caller needs to remove an extended attribute with a specified name\r
-       do_xattr_list\r
-       called when the caller needs to get a list of existing extended attributes (only names of attributes are returned).\r
-       do_ah_init\r
-       called to let OSD to prepare allocation hint which stores information about object locality, type. later this allocation hint is passed to ->do_create() method and use OSD can use this information to optimize on-disk object location. allocation hint is opaque for the caller and can contain OSD-specific information.\r
-       do_declare_create\r
-       called to notify OSD the caller is going to create a new object in a specified transaction.\r
-       do_create\r
-       called to create an object on the OSD in a specified transaction. for index objects the caller can request a set of index properties (like key/value size). if OSD can not support requested properties, it should return an error.\r
-       do_declare_destroy\r
-       called to notify OSD the caller is going to destroy an object in a specified transaction.\r
-       do_destroy\r
-       called to destroy an object in a specified transaction.\r
-       do_index_try\r
-       called when the caller needs to use an object as an index (the object should be created as an index before). also the caller specify a set of properties she expect the index should support. \r
-       do_declare_ref_add\r
-       called to notify OSD the caller is going to increment nlink attribute in a specified transaction.\r
-       do_ref_add\r
-       called to increment nlink attribute in a specified transaction.\r
-       do_declare_ref_del\r
-       called to notify OSD the caller is going to decrement nlink attribute in a specified transaction.\r
-       do_ref_del\r
-       called to decrement nlink attribute in a specified transaction.\r
-       do_capa_get\r
-       called to get a capability for a specified object. not used currently.\r
-       do_object_sync\r
-       called to flush a given object on-disk. It’s a fine grained  version of ->do_sync() method which should make sure an object is stored on-disk. OSD (or backend file system) can track a status of every object and if an object is already flushed, then just the method can return immediately. the method is used on OSS now, but can also be used on MDS at some point to improve performance of COS.\r
-       do_data_get\r
-       the method is not used any more and planned for removal.\r
-       \r
-\r
-\r
-\r
-\r
-\r
-Data object operation\r
-\r
-\r
-ssize_t (*dbo_read)(const struct lu_env *env, struct dt_object *dt, struct lu_buf *buf, loff_t *pos,\r
-                                     struct lustre_capa *capa);\r
-ssize_t (*dbo_declare_write)(const struct lu_env *env, struct dt_object *dt,\r
-                                              const loff_t size, loff_t pos,struct thandle *handle);\r
-ssize_t (*dbo_write)(const struct lu_env *env, struct dt_object *dt, const struct lu_buf *buf,\r
-         loff_t *pos, struct thandle *handle, struct lustre_capa *capa, int ign_quota);\r
-int (*dbo_bufs_get)(const struct lu_env *env, struct dt_object *dt, loff_t pos, ssize_t len,\r
-       struct niobuf_local *lb, int rw, struct lustre_capa *capa);\r
-int (*dbo_bufs_put)(const struct lu_env *env, struct dt_object *dt,struct niobuf_local *lb, int nr);\r
-int (*dbo_write_prep)(const struct lu_env *env, struct dt_object *dt,struct niobuf_local *lb, int nr);\r
-int (*dbo_declare_write_commit)(const struct lu_env *env, struct dt_object *dt,\r
-     struct niobuf_local *,int, struct thandle *);\r
-int (*dbo_write_commit)(const struct lu_env *env, struct dt_object *dt, struct niobuf_local *, \r
-   int nr, struct thandle *);\r
-int (*dbo_read_prep)(const struct lu_env *env, struct dt_object *dt,  struct niobuf_local *, int nr);\r
-int (*dbo_fiemap_get)(const struct lu_env *env, struct dt_object *dt, struct ll_user_fiemap *fm);\r
-int   (*do_declare_punch)(const struct lu_env*,struct dt_object *,__u64,__u64,struct thandle *th);\r
-int   (*do_punch)(const struct lu_env *env, struct dt_object *dt,__u64 start, __u64 end, struct\r
-                            thandle *th, struct lustre_capa *capa);\r
-\r
-\r
-\r
-\r
-\r
-\r
-dbo_read\r
-       is called to read raw unstrustructed data from a specified range of an object. returns number of bytes read or an error. Usually OSD implements this method using internal buffering (to be able to put data at non-aligned address). So this method should not be used to move a lot of data. Lustre services use it to read to read small internal data like last_rcvd file, llog files. It's also used to fetch body symlinks.\r
-       dbo_declare_write\r
-       is called to notify OSD the caller will be writing data to a specific range of an object in a specified transaction.\r
-       dbo_write\r
-       is called to write raw unstructured data to a specified range of an object in a specified transaction. data should be written atomically with another change in the transaction. the method is used by Lustre services to update small portions on a disk. OSD should maintain size attribute consistent with data written. \r
-       dbo_bufs_get\r
-       is called to fill memory with buffer descriptors (see struct niobuf_local) for a specified range of an object. memory for the set is provided by the caller, no concurrent access to this memory is allowed. OSD can fill all fields of the descriptor except  lnb_grant_used. the caller specify whether buffers will be user to read or write data. this method is used to access file system's internal buffers for zero-copy IO. internal buffers referenced by descriptors  are supposed to be pinned in memory\r
-       dbo_bufs_put\r
-       is called to unpin/release internal buffers referenced by the descriptors dbo_bufs_get returns. after this point pointers in the descriptors are not valid.\r
-       dbo_write_prep\r
-       is called to fill internal buffers with actual data. this is required for\r
-buffers which do not match filesystem blocksize, as later the buffer is supposed to be written as a whole. for example, ldiskfs uses 4k blocks, but the caller wants to update just a half of that. to prevent data corruption, this method is called OSD compares range to be written with 4k, if they do not match, then OSD fetches data from a disk. if they do match, then all the data will be overwritten and there is no need to fetch data from a disk.\r
-       dbo_declare_write_commit\r
-       is called to notify OSD the caller is going to write internal buffers and OSD needs to reserve enough resource in a transaction. \r
-       dbo_write_commit\r
-       is called to actually make data in internal buffers part of a specified transaction. data is supposed to be written by the moment the transaction is considered committed. this is slighly different from generic transaction model because in this case it's allowed to have data written, but not have transaction committed. if no dbo_write_commit is called, then dbo_bufs_put should discard internal buffers and possible changes made to internal buffers should not be visible.\r
-       dbo_read_prep\r
-       is called to fill all internal buffers referenced by descriptors with actual data. buffers may already contain valid data (be cached), so OSD can just verify the data is valid and return immediately. \r
-       dbo_fiemap_get\r
-       is called to map logical range of an object to physical blocks where corresponded range of data is actually stored. \r
-       dbo_declare_punch\r
-       is called to notify OSD the caller is going to punch (deallocate) specified range in a transaction.\r
-       dbo_punch\r
-       is called to punch (deallocate) specified range of data in a transaction. this method is allowed to use few disk file system transactions (within the same lustre transaction handle. currently Lustre calls the method in form of truncate only where the end offset is EOF always.\r
-       \r
-\r
-Indices\r
-\r
-\r
-Another set of objects in Lustre is indices.\r
-\r
-\r
-In contrast with raw unstructured data they are collection of key=value pairs. OSD should provide with few methods to lookup, insert, delete and scan pairs. Indices may have different properties like key/value size, string/binary keys, etc. When user need to use an index, it needs to check whether the index has required properties with a special method. indices are used by Lustre services to maintain user-visible namespace, FLD, index of unlinked files, etc. \r
-\r
-\r
-int (*dio_lookup)(const struct lu_env *env, struct dt_object *dt, struct dt_rec *rec,\r
-   const struct dt_key *key, struct lustre_capa *capa);\r
-int (*dio_declare_insert)(const struct lu_env *env, struct dt_object *dt, const struct dt_rec *rec,\r
-    const struct dt_key *key, struct thandle *handle);\r
-int (*dio_insert)(const struct lu_env *env, struct dt_object *dt, const struct dt_rec *rec,\r
- const struct dt_key *key, struct thandle *handle, struct lustre_capa *capa,\r
-int ignore_quota);\r
-int (*dio_declare_delete)(const struct lu_env *env, struct dt_object *dt,\r
-    const struct dt_key *key, struct thandle *handle);\r
-int (*dio_delete)(const struct lu_env *env, struct dt_object *dt, const struct dt_key *key,\r
-  struct thandle *handle, struct lustre_capa *capa); \r
-\r
-\r
-\r
-\r
-\r
-\r
-dio_lookup\r
-       is called to lookup exact key=value pair. a value is copied into a buffer provided by the caller. so the caller should make sure the buffer's size\r
-is big enough. this should be done with ->do_index_try() method.\r
-       dio_declare_insert\r
-       is called to notify OSD the caller is going to insert key=value pair in a transaction. exact key is specifed by a caller so OSD can use this to make reservation better (i.e. smaller).\r
-       dio_insert\r
-       is called to insert key/value pair into an index object. it's up to OSD whether to allow concurrent inserts or not. the caller is not required to serialize access to an index\r
-       dio_declare_delete\r
-       is called to notify OSD the caller is going to remove a specified key\r
-in a transaction. exact key is specified by a caller so OSD can use this\r
-to make reservation better.\r
-       dio_delete\r
-       is called to remove a key/value pair specified by a caller.\r
-       \r
-\r
-To iterate over all key=value pair stored in an index, OSD should provide the following set of methods:\r
-\r
-\r
-struct dt_it *(*init)(const struct lu_env *env, struct dt_object *dt, __u32 attr,\r
-     struct lustre_capa *capa);\r
-void    (*fini)(const struct lu_env *env, struct dt_it *di);\r
-int       (*get)(const struct lu_env *env, struct dt_it *di, const struct dt_key *key);\r
-void    (*put)(const struct lu_env *env, struct dt_it *di);\r
-int       (*next)(const struct lu_env *env, struct dt_it *di);\r
-struct dt_key *(*key)(const struct lu_env *env, const struct dt_it *di);\r
-int       (*key_size)(const struct lu_env *env, const struct dt_it *di);\r
-int       (*rec)(const struct lu_env *env, const struct dt_it *di, struct dt_rec *rec, __u32 attr);\r
-__u64 (*store)(const struct lu_env *env, const struct dt_it *di);\r
-int       (*load)(const struct lu_env *env, const struct dt_it *di, __u64 hash);\r
-int       (*key_rec)(const struct lu_env *env, const struct dt_it *di, void* key_rec);\r
-\r
-\r
-init\r
-       is called to allocate and initialize an instance of "iterator" which subsequent methods will be passed in. the structure is not accessed by Lustre and its content is totally internal to OSD. Usually it contains a reference to index, current position in an index. It may contain prefetched key/value pairs. It's not required to maintain this cache up-to-date, if index changes this is not required to be reflected by an already initialized iterator. In the extreme case ->init() can prefetch all existing pairs to be returned by subsequent calls to an iterator.\r
-       fini\r
-       is called to release an iterator and all its resources. for example, iterator can unpin an index, free prefetched pairs, etc.\r
-       get\r
-       is called to move an iterator to a specified key. if key does not exist then it should be the closest position from the beginning of iteration.\r
-       put\r
-       \r
-\r
-       next\r
-       is called to move an iterator to a next item\r
-       key\r
-       is called to fill specified buffer with a key at a current position of an iterator. it’s the caller responsibility to pass big enough buffer. in turn OSD should not exceed sizes negotiated with ->do_index_try() method\r
-       key_size\r
-       is called to learn size of a key at current position of an iterator\r
-       rec\r
-       is called to fill specified buffer with a value at a current position of an iterator. it’s the caller responsibility to pass big enough buffer. in turn OSD should not exceed sizes negotiated with ->do_index_try() method.\r
-       store\r
-       is called to get a 64bit cookie of a current position of an iterator.\r
-       load\r
-       is called to reset current position of an iterator to match 64bit cookie ->store() method returns. these two methods allow to implement functionality like POSIX readdir where current position is stored as an integer.\r
-       key_rec\r
-       is not used currently\r
-       \r
-\r
-\r
-\r
-Transactions\r
-\r
-\r
-Transactions are used by Lustre to implement recovery protocol and support failover.  The main purpose of transactions is to atomically update backend file system. This include as regular changes (file creation, for example) as special Lustre changes (last_rcvd file, lastid, llogs). OSD is supposed to provide the transactional mechanism and let Lustre to control what specific updates to put into transactions.\r
-\r
-\r
-Lustre relies on the following rule for transactions order:  If transaction T1 starts before transaction T2 starts, then the commit of T2 means that T1 is committed at the same time or earlier. Notice that the creation of a transaction does not imply the immediate start of the updates on storage, do not confuse creation of a transaction with start of a transaction.\r
-\r
-\r
-It’s up to OSD and backend file system to group few transactions for better performance given it still follow the rule above.\r
-\r
-\r
-Transactions are identified in the OSD API by an opaque transaction handle, which is a pointer to an OSD-private data structure that it can use to track (and optionally verify) the updates done within that transaction.  This handle is returned by the OSD to the caller when the transaction is first created.  Any potential updates (modifications to the underlying storage) must be declared as part of a transaction, after the transaction has been created, and before the transaction is started. The transaction handle is passed when declaring all updates.  If any part of the declaration should fail, the transaction is aborted without having modified the storage.\r
-\r
-\r
-After all updates have been declared, and have completed successfully, the handle is passed to the transaction start.  After the transaction has started, the handle will be passed to every update that is done as part of that transaction.  All updates done under the transaction must previously have been declared. Once the transaction has started, it is not permitted to add new updates to the transaction, nor is it possible to roll back the transaction after this point.  Should some update to the storage fail, the caller will try to undo the previous updates within the context of the transaction itself, to ensure that the resulting OSD state is correct.\r
-\r
-\r
-Any update that was not previously declared is an implementation error in the caller.  Not all declared updates need to be executed, as they form a worst-case superset of the possible updates that may be required in order to complete the desired operation in a consistent manner.\r
-\r
-\r
-OSD should let a caller to register callback function(s) to be called on transaction commit to a disk. Also OSD should be able to call a special of transaction hooks on all the stages (creation, start, stop, commit) on per-devices basis so that high-level services (like MDT) which are not involved directly into controlling transactions still can be involved. Every commit callback gets a result of transaction commit, if disk filesystem was not able to commit the transaction, then an appropriate error code will be passed.\r
-\r
-\r
-It’s important to note that OSD and disk file system should use asynchronous IO to implement transactions, otherwise the performance is expected to be bad.\r
-\r
-\r
-The maximum number of updates that make up a single transaction is OSD-specific, but is expected to be at least in the tens of updates to multiple objects in the OSD (extending writes of multiple MB of data, modifying or adding attributes, extended attributes, references, etc).     For example, in ext4, each update to the filesystem will modify one or more blocks of storage.  Since one transaction is limited to one quarter of the journal size, if the caller declares a series of updates that modify more than this number of blocks, the declaration must fail or it could not be committed atomically. In general, every constraint must be checked here to ensure that all changes that must commit atomically can complete successfully.\r
-\r
-\r
-Lifetime of a transaction\r
-\r
-\r
-From Lustre point of view a transaction goes through the following steps:\r
-1. creation\r
-2. declaration of all possible changes planned in transaction\r
-3. transaction start\r
-4. execution of planned and declared changes\r
-5. transaction stop\r
-6. commit callback(s) \r
-\r
-\r
-Methods to manage transactions\r
-\r
-\r
-OSD should implement the following methods to let Lustre control transactions:\r
-\r
-struct thandle *(*dt_trans_create)(const struct lu_env *env, struct dt_device *dev);\r
-int   (*dt_trans_start)(const struct lu_env *env, struct dt_device *dev, struct thandle *th);\r
-int   (*dt_trans_stop)(const struct lu_env *env, struct thandle *th);\r
-int   (*dt_trans_cb_add)(struct thandle *th, struct dt_txn_commit_cb *dcb);\r
-\r
-\r
-dt_trans_create\r
-       is called to allocate and initialize transaction handle (see struct thandle). this structure has no pointer to a private data so, it should be embedded into private representation of transaction at OSD layer. this method can block.\r
-       dt_trans_start\r
-       is called to notify OSD a specified transaction has got all the declarations and now OSD should tell whether it has enough resources to proceed with declared changes or to return an error to a caller. this method can block. OSD should call dt_txn_hook_start()  function before underlying file system’s transaction starts to support per-device transaction hooks. if OSD (or disk files ystem) can not start transaction, then an error is returned and transaction handle is destroyed, no commit callbacks are called.\r
-       dt_trans_stop\r
-       is called to notify OSD a specified transaction has been executed and no more changes are expected in a context of that. usually this mean that at this point OSD is free to start writeout preserving notion all-or-nothing. this method can block. if th_sync flag is set at this point, then OSD should start to commit this transaction and block until the transaction is committed. the order of unblock event and transaction’s commit callback functions is not defined by the API. OSD should call dt_txn_hook_stop() functions once underlying file system’s transaction is stopped to support per-device transaction hooks.\r
-       dt_trans_cb_add\r
-       is called to register commit callback function(s), which OSD will be calling up on transaction commit to a storage. when all the callback functions are processed, transaction handle can be freed by OSD. There are no constraints on how many callback functions can be running concurrently. They should not be running in an interrupt context. usually this method should not block and use spinlocks. As part of commit callback functions processing dt_txn_hook_commit() function should be called to support per-device transaction hooks.
-\ No newline at end of file
diff --git a/lustre/doc/osd-api.txt b/lustre/doc/osd-api.txt

new file mode 100644 (file)

index 0000000..a039e45
--- /dev/null
+++ b/lustre/doc/osd-api.txt
@@ -0,0 +1,1310 @@
+               ****************************************************
+               * Overview of the Lustre Object Storage Device API *
+               ****************************************************
+
+Original Authors:
+=================
+Alex    Zhuravlev <alexey.zhuravlev@intel.com>
+Andreas Dilger    <andreas.dilger@intel.com>
+Johann  Lombardi  <johann.lombardi@intel.com>
+Li      Wei       <wei.g.li@intel.com>
+Niu     Yawei     <yawei.niu@intel.com>
+
+Last Updated: October 9, 2012
+
+Copyright © 2012 Intel, Corp.
+
+This file is released under the GPLv2.
+
+Topics
+======
+
+I.   Introduction
+       1. What OSD API is
+       2. What OSD API is Not
+       3. Layering
+       4. Audience/Goal
+II.  Backend Storage Subsystem Requirements
+       1. Atomicity of Updates
+       2. Object Attributes
+               i.  Standard POSIX Attributes
+               ii. Extended Attributes
+       3. Efficient Index
+       4. Commit Callbacks
+       5. Space Accounting
+III. OSD & LU Infrastructure
+       1. Devices
+               i.   Device Overview
+               ii.  Device Type & Operations
+               iii. Device Operations
+               iv.  OBD Methods
+       2. Objects
+               i.   Object Overview
+               ii.  Object Lifecycle
+               iii. Special Objects
+               iv.  Object Operations
+       3. Lustre Environment
+IV.  Data (DT) API
+       1. Data Device
+       2. Data Objects
+               i.   Common Storage Operations
+               ii.  Data Object Operations
+               iii. Indice Operations
+       3. Transactions
+               i.   Description
+               ii.  Lifetime
+               iii. Methods
+       4. Locking
+               i.   Description
+               ii.  Methods
+V.   Quota Enforcement
+       1. Overview
+       2. QSD API
+Appendix 1. A brief note on Lustre configuration.
+Appendix 2. Sample Code
+
+===================
+= I. Introduction =
+===================
+
+1. What OSD API is
+==================
+
+OSD API is the interface to access and modify data that is supposed to be stored
+persistently. This API layer is the interface to code that bridges individual
+file systems such as ext4 or ZFS to Lustre.
+The API is a generic interface to transaction and journaling based file systems
+so many backend file systems can be supported in a Lustre implementation.
+Data can be cached within the OSD or backend target and could be destroyed
+before hitting storage, but in general the final target is a persistent storage.
+This API creates many possibilities, including using object-storage devices or
+other new persistent storage technologies.
+
+2. What OSD API is Not
+======================
+
+OSD API should not be used to control in-core-only state (like ldlm locking),
+configuration, etc. The upper layers of the IO/metadata stack should not be
+involved with the underlying layout or allocation in the OSD storage.
+
+3. Layering
+===========
+
+Lustre is composed of different kernel modules, each implementing different
+layers in the software stack in an object-oriented approach. Generally, each
+layer builds (or stacks) upon another, and each object is a child of the
+generic LU object class. Hence the term "LU stack" is often used to reference
+this hierarchy of lustre modules and objects.
+
+Each layer (i.e. mdt/mdd/lod/osp/ofd/osd) defines its own generic item
+(lu_object/lu_device) which are thus gathered in a compound item (lu_site/
+lu_object_layer) representing the multi-layered stacks. Different classes of
+operations can then be implemented by each layer, depending on its natures.
+
+As a result, each OSD is expected to implement:
+- the generic LU API used to manage the device stack and objects (see chapter
+  III)
+- the DT API (most commonly called OSD API) used to manipulate on-disk
+  structures (see chapter IV).
+
+4. Audience/Goal
+================
+
+The goal of this document is to provide the reader with the information
+necessary to accurately construct a new Object Storage Device (OSD) module
+interface layer for Lustre in order to use a new backend file system with
+Lustre 2.4 and greater.
+
+==============================================
+= II. Backend Storage Subsystem Requirements =
+==============================================
+
+The purpose of this section is to gather the requirements for the storage
+subsystems below the OSD API.
+
+1. Atomicity of Updates
+=======================
+
+The underlying OSD storage must be able to provide some form of atomic commit
+for multiple arbitrary updates to OSD storage within a single transaction.
+It will always know in advance of the transaction starting which objects will
+be modified, and how they will be modified.
+
+If any of the updates associated with a transaction are stored persistently
+(i.e. some state in the OSD is modified), then all of the updates in that
+transaction must also be stored persistently (Atomic). If the OSD should fail
+in some manner that prevents all the updates of a transaction from being
+completed, then none of the updates shall be completed (Consistent).
+Once the updates have been reported committed to the caller (i.e. commit
+callbacks have been run), they cannot be rolled back for any reason (Durable).
+
+2. Object Attributes
+====================
+
+i. Standard POSIX Attributes
+----------------------------
+The OSD object should be able to store normal POSIX attributes on each object
+as specified by Lustre:
+- user ID (32 bits)
+- group ID (32 bits)
+- object type (16 bits)
+- access mode (16 bits)
+- metadata change time (96 bits, 64-bit seconds, 32-bit nanoseconds)
+- data modification time (96 bits, 64-bit seconds, 32-bit nanoseconds)
+- data access time (96 bits, 64-bit seconds, 32-bit nanoseconds)
+- creation time (96 bits, 64-bit seconds, 32-bit nanoseconds, optional)
+- object size (64 bits)
+- link count (32 bits)
+- flags (32 bits)
+- object version (64 bits)
+
+The OSD object shall not modify these attributes itself.
+
+In addition, it is desirable track the object allocation size (“blocks”), which
+the OSD manages itself. Lustre will query the object allocation size, but will
+never modify it. If these attributes are not managed by the OSD natively as part
+of the object itself, they can be stored in an extended attribute associated
+with the object.
+
+ii. Extended Attributes
+------------------------
+The OSD should have an efficient mechanism for storing small extended attributes
+with each object. This implies that the extended attributes can be accessed at
+the same time as the object (without extra seek/read operations). There is also
+a requirement to store larger extended attributes in some cases (over 1kB in
+size), but the performance of such attributes can be slower proportional to the
+attribute size.
+
+3. Efficient Index
+==================
+
+The OSD must provide a mechanism for efficient key=value retrieval, for both
+fixed-length and variable length keys and values. It is expected that an index
+may hold tens of millions of keys, and must be able to do random key lookups
+in an efficient manner. It must also provide a mechanism for iterating over all
+of the keys in a particular index and returning these to the caller in a
+consistent order across multiple calls. It must be able to provide a cookie that
+defines the current index at which the iteration is positioned, and must be able
+to continue iteration at this index at a later time.
+
+4. Commit Callbacks
+===================
+
+The OSD must provide some mechanism to register multiple arbitrary callback
+functions for each transaction, and call these functions after the transaction
+with which they are associated has committed to persistent storage.
+It is not required that they be called immediately at transaction commit time,
+but they cannot be delayed an arbitrarily long time, or other parts of the
+system may suffer resource exhaustion. If this mechanism is not implemented by
+the underlying storage, then it needs to be provided in some manner by the OSD
+implementation itself.
+
+5. Space Accounting
+===================
+
+In order to provide quota functionality for the OSD, it must be able to track
+the object allocation size against at least two different keys (typically User
+ID and Group ID). The actual mechanism of tracking this allocation is internal
+to the OSD. Lustre will specify the owners of the object against which to track
+this space. Space accounting information will be accessed by Lustre via the
+index API on special objects dedicated to space allocation management.
+
+================================
+= III. OSD & LU Infrastructure =
+================================
+
+As a member of the LU stack, each OSD module is expected to implement the
+generic LU API used to manage devices and objects.
+
+1. Devices
+==========
+
+i. Device Overview
+------------------
+Each layer in the stack is represented by a lu_device structure which holds
+the very basic data like reference counter, a reference to the site (Lustre
+object collection in-core, very similar to inode cache), a reference to
+struct lu_type which in turn describe this specific type of devices
+(type name, operations etc).
+
+OSD device is created and initialized at mount time to let configuration
+component access data it needs before the whole Lustre stack is ready.
+OSD device is destroyed when all the devices using that are destroyed too.
+Usually this happen when the server stack shuts down at unmount time.
+
+There might be few OSD devices of the given type (say, few zfs-osd and
+ldiskfs-osd). The type stores method common for all OSD instances of given type
+(below they start with ldto_ prefix). Then every instance of OSD device can get
+few specific methods (below the start with ldo_ prefix).
+
+To connect devices into a stack, ->o_connect() method is used (see struct
+obd_ops). Currently OSD should implement this method to track all it’s users.
+Then to disconnect ->o_disconnect() method is used. OSD should implement this
+method, track remaining users and once no users left, call
+class_manual_cleanup() function which initiate removal of OSD.
+
+As the stack involves many devices and there may be cross-references between
+them, it’s easier to break the whole shutdown procedure into the two steps and
+do not set a specific order in which different devices shutdown: at the first
+step the devices should release all the resources they use internally
+(so-called pre-cleanup procedure), at the second step they are actually
+destroyed.
+
+ii. Device Type & Operations
+----------------------------
+The first thing to do when developing a new OSD is to define a lu_device_type
+structure to define and register the new OSD type. The following fields of the
+lu_device_type needs to be filled appropriately:
+ldt_tags
+       is the type of device, typically data, metadata or client (see
+       lu_device_tag). An OSD device is of data type and should always
+       registers as such by setting this field to LU_DEVICE_DT.
+ldt_name
+       is the name associated with the new OSD type.
+       See LUSTRE_OSD_{LDISKFS,ZFS}_NAME for reference.
+ldt_ops
+       is the vector of lu_device_type operations, please see below for
+       further details
+ldt_ctxt_type
+       is the lu_context_tag to be used for operations.
+       This should be set to LCT_LOCAL for OSDs.
+
+In the original 2.0 MDS stack the devices were built from the top down and OSD
+was the final device to setup. This schema does not work very well when you have
+to access on-disk data early and when you have OSD shared among few services
+(e.g. MDS + MGS on a same storage). So the schema has changed to a reverse one:
+mount procedure sets up correct OSD, then the stack is built from the bottom up.
+And instead of introducing another set of methods we decided to use existing
+obd_connect() and obd_disconnect() given that many existing devices have been
+already configured this way by the configuration component. Notice also that
+configuration profiles are organized in this order (LOV/LOD go first, then MDT).
+Given that device “below” is ready at every step, there is no point in calling
+separate init method.
+
+Due to complexity in other modules, when the device itself can be referenced by
+number of entities like exports, RPCs, transactions, callbacks, access via
+procfs, the notion of precleanup was introduced to be able all the activity
+safely before the actual cleanup takes place. Similarly ->ldto_device_fini()
+and ->ldto_device_free() were introduced. So, the former should be used to break
+any interaction with the outside, the latter - to actually free the device.
+
+So, the configuration component meets SETUP command in the configuration profile
+(see Appendix 1), finds appropriate device and calls ->ldto_device_alloc() to
+set up it as an LU device.
+
+The prototypes of device type operations are the following:
+
+struct lu_device *(*ldto_device_alloc)(const struct lu_env *,
+                                       struct lu_device_type *,
+                                       struct lustre_cfg *);
+struct lu_device *(*ldto_device_free)(const struct lu_env *,
+                                      struct lu_device *);
+int  (*ldto_device_init)(const struct lu_env *, struct lu_device *,
+                         const char *, struct lu_device *);
+struct lu_device *(*ldto_device_fini)(const struct lu_env *env, struct lu_device *);
+int  (*ldto_init)(struct lu_device_type *t);
+void (*ldto_fini)(struct lu_device_type *t);
+void (*ldto_start)(struct lu_device_type *t);
+void (*ldto_stop)(struct lu_device_type *t);
+
+ldto_device_alloc
+       The method is called by configuration component (in case of disk file
+       system OSD, this is lustre/obdclass/obd_mount.c) to allocate device.
+       Notice generic struct lu_device does not hold a pointer to private data.
+       Instead OSD should embed struct lu_device into own structure (like
+       struct osd_device) and return address of lu_device in that structure.
+ldto_device_fini
+       The method is called when OSD is about to release. OSD should detach
+       from resources like disk file system, procfs, release objects it holds
+       internally, etc. This is so-called precleanup procedure.
+ldto_device_free
+       The method is called to actually release memory allocated in
+       ->ldto_device_alloc().
+ldto_device_ini
+       The method is not used by OSD currently.
+ldto_init
+       The method is called when specific type of OSD is registered in the
+       system. Currently the method is used to register OSD-specific data for
+       environments (see Lustre environment in section 3).
+       See LU_TYPE_INIT_FINI() macro as an example.
+ldto_fini
+       The method is called when specific type of OSD unregisters.
+       Currently used to unregister OSD-specific data from environment.
+ldto_start
+       The method is called when the first device of this type is being
+       instantiated. Currently used to fill existing environments with
+       OSD-specific data.
+ldto_stop
+       This method is called when the last instance of specific OSD has gone.
+       Currently used to release OSD-specific data from environments.
+
+iii. Device Operations
+----------------------
+Now that the osd device can be set up, we need to export methods to handle
+device-level operation. All those methods are listed in the lu_device_operations
+structure, this includes:
+
+struct lu_object *(*ldo_object_alloc)(const struct lu_env *,
+                                     const struct lu_object_header *,
+                                     struct lu_device *);
+int (*ldo_process_config)(const struct lu_env *, struct lu_device *,
+                         struct lustre_cfg *);
+int (*ldo_recovery_complete)(const struct lu_env *, struct lu_device *);
+int (*ldo_prepare)(const struct lu_env *, struct lu_device *,
+                  struct lu_device *);
+
+ldo_object_alloc
+       The method is called when a high-level service wants to access an
+       object not found in local lustre cache (see struct lu_site).
+       OSD should allocate a structure, initialize object’s methods and return
+       a pointer to struct lu_device which is embedded into OSD object
+       structure.
+ldo_process_config
+       The method is called in case of configuration changes. Mostly used by
+       high-level services to update local tunables. It’s also possible to let
+       MGS store OSD tunables and set them properly on every server mount or
+       when tunables change run-time.
+ldto_recovery_complete
+       The method is called when recovery procedure between a server and
+       clients is completed. This method is used by high-level devices mostly
+       (like OSP to cleanup OST orphans, MDD to cleanup open unlinked files
+       left by missing client, etc).
+ldo_prepare
+       The method is called when all the devices belonging to the stack are
+       configured and setup properly. At this point the server becomes ready
+       to handle RPCs and start recovery procedure.
+       In current implementation OSD uses this method to initialize local quota
+       management.
+
+iv.  OBD Methods
+----------------
+Although the LU infrastructure aims at replacing the storage operations of the
+legacy OBD API (see struct obd_ops in lustre/include/obd.h). The OBD API is
+still used in several places for device configuration and on the Lustre client
+(e.g. it’s still used on the client for LDLM locking). The OBD API storage
+operations are not needed for server components, and should be ignored.
+
+As far as the OSD layer is concerned, upper layers still connect/disconnect
+to/from the OSD instance via obd_ops::o_connect/disconnect. As a result, each
+OSD should implement those two operations:
+
+int (*o_connect)(const struct lu_env *, struct obd_export **,
+                struct obd_device *, struct obd_uuid *,
+                struct obd_connect_data *, void *);
+int (*o_disconnect)(struct obd_export *);
+
+o_connect
+       The method should track number of connections made (i.e. number of
+       active users of this OSD) and call class_connect() and return a struct
+       obd_export via class_conn2export(), see osd_obd_connect(). The structure
+       holds a reference on the device, preventing it from early release.
+o_disconnect
+       The method is called then some one using this OSD does not need its
+       service any more (i.e. at unmount). For every passed struct export the
+       method should call class_disconnect(export). Once the last user has
+       gone, OSD should call class_manual_cleanup() to schedule the device
+       removal.
+
+2. Objects
+==========
+
+i. Object Overview
+------------------
+Lustre identifies objects in the underlying OSD storage by a unique 128-bit
+File IDentifier (FID) that is specified by Lustre and is the only identifier
+that Lustre is aware of for this object. The FID is known to Lustre before any
+access to the object is done (even before it is created), using
+lu_object_find(). Since Lustre only uses the FID to identify an object, if the
+underlying OSD storage cannot directly use the Lustre-specified FID to retrieve
+the object at a later time, it must create a table or index object (normally
+called the Object Index (OI)) to map Lustre FIDs to an internal object
+identifier. Lustre does not need to understand the format or value of the
+internal object identifier at any time outside of the OSD.
+
+The FID itself is composed of 3 members:
+struct lu_fid {
+       __u64   f_seq;
+       __u32   f_oid;
+       __u32   f_ver;
+};
+
+While the OSD itself should typically not interpret the FID, it may be possible
+to optimize the OSD performance by understanding the properties of a FID.
+
+The f_seq (sequence) component is allocated in piecewise (though not contiguous)
+manner to different nodes, and each sequence forms a “group” of related objects.
+The sequence number may be any value in the range [1, 263], but there are
+typically not a huge number of sequences in use at one time (typically less than
+one million at the maximum). Within a single sequence, it is likely that tens to
+thousands (and less commonly millions) of mostly-sequential f_oid values will be
+allocated. In order to efficiently map FIDs into objects, it is desirable to
+also be able to associate the OSD-internal index with key-value pairs.
+
+Every object is represented with a header (struct lu_header) and so-called slice
+on every layer of the stack. Core Lustre code maintains a cache of objects
+(so-called lu-site, see struct lu_site). which is very similar to Linux inode
+cache.
+
+ii. Object Lifecycle
+--------------------
+In-core object is created when high-level service needs it to process RPC or
+perform some background job like LFSCK. FID of the object is supposed to be
+known before the object is created. FID can come from RPC or from a disk.
+Having the FID lu_object_find() function is called, it search for the object in
+the cache (see struct lu_site) and if the object is not found, creates it
+using ->ldo_device_alloc(), ->loo_object_init() and ->loo_object_start()
+methods.
+
+Objects are referenced and tracked by Lustre core. If object is not in use,
+it’s put on LRU list and at some point (subject to internal caching policy or
+memory pressure callbacks from the kernel) Lustre schedules such an object for
+removal from the cache. To do so Lustre core marks the object is going out and
+calls ->loo_object_release() and ->loo_object_free() iterating over all the
+layers involved.
+
+iii. Special Objects
+--------------------
+Lustre uses a set of special objects using the FID_SEQ_LOCAL_FILE sequence.
+All the objects are listed in the local_oid enum, which includes:
+- OTABLE_OT_OID which is an index object providing list of all existing
+  objects on this storage. The key is an opaque string and the record is FID.
+  This object is used by high-level components like LFSCK to iterate over
+  objects.
+- ACCT_USER_OID/ACCT_GROUP_OID are used for accessing space accounting
+  information for respectively users and groups.
+- MDT_LAST_RECV_OID/OFD_LAST_RECV_OID is the last_rcvd file for respectively
+  the MDT and OST.
+
+iv. Object Operations
+---------------------
+Object management methods are called by Lustre to manipulate OSD-specific
+(private) data associated with a specific object during the lifetime of an
+object. All the object operations are described in struct lu_object_operations:
+
+int (*loo_object_init)(const struct lu_env *, struct lu_object *,
+                      const struct lu_object_conf *);
+int (*loo_object_start)(const struct lu_env *, struct lu_object *);
+void (*loo_object_delete)(const struct lu_env *, struct lu_object *);
+void (*loo_object_free)(const struct lu_env *, struct lu_object *);
+void (*loo_object_release)(const struct lu_env *, struct lu_object *);
+int (*loo_object_print)(const struct lu_env *, void *, lu_printer_t,
+                       const struct lu_object *);
+int (*loo_object_invariant)(const struct lu_object *);
+
+loo_object_init
+       This method is called when a new object is being created (see
+       lu_object_alloc(), it’s purpose is to initialize object’s internals,
+       usually file system lookups object on a disk (notice a header storing
+       FID is already created by a top device) using Object Index mapping FID
+       to local object id like dnode. LOC_F_NEW can be passed to the method
+       when the caller knows the object is new and OSD can skip OI lookup to
+       improve performance. If the object exists, then the LOHA_FLAG flag in
+       loh_flags (struct lu_object_header) is set.
+loo_object_start
+       The method is called when all the structures and the header are
+       initialized. Currently user by high-level service to as a post-init
+       procedure (i.e. to setup own methods depending on object type which is
+       brought into the header by OSD’s ->loo_object_init())
+loo_object_delete
+       is called to let OSD release resources behind an object (except memory
+       allocated for an object), like release file system’s inode.
+       It’s separated from ->loo_object_free() to be able to iterate over
+       still-existing objects. the main purpose to separate
+       ->loo_object_delete() and ->loo_object_free() is to avoid recursion
+       during potentially stack consuming resource release.
+loo_object_free
+       is called to actually release memory allocated by ->ldo->object_alloc()
+loo_object_release
+       is called when object last it’s last user and moves onto LRU list of
+       unused objects. implementation of this method is optional to OSD.
+loo_object_print
+       is used for debugging purpose, it should output details of an object in
+       human-readable format. Details usually include information like address
+       of an object, local object number (dnode/inode), type of an object, etc.
+loo_object_invariant
+       another optional method for debugging purposes which is called to verify
+       internal consistency of object.
+
+3. Lustre Environment
+=====================
+
+There is a notion of an environment represented by struct lu_env in many
+functions and methods. Literally this is a Thread Local Storage (TLS), which is
+bound to every service thread and used by that thread exclusively, there is no
+need to serialize access to the data stored here.
+The original purpose of the environment was to workaround small Linux stack
+(4-8K). A component (like device or library) can register its own descriptor
+(see LU_KEY_INIT macro) and then every new thread will be populating the
+environment with buffers described.
+
+=====================
+= IV. Data (DT) API =
+=====================
+
+The previous section listed all the methods that have to be provided by an OSD
+module in order to fit in the LU stack. In addition to those generic functions,
+each layer should implement a different class of operations depending on its
+natures. There are currently 3 classes of devices:
+- LU_DEVICE_DT: DaTa device (e.g. lod, osp, osd, ofd),
+- LU_DEVICE_MD: MetaData device (e.g. mdt, mdd),
+- LU_DEVICE_CL: CLient I/O device (e.g. vvp, lov, lovsub, osc).
+
+The purpose of this section is to document the DT API (used for devices and
+objects) which has to be implemented by each OSD module. The DT API is most
+commonly called the OSD API.
+
+1. Data Device
+==============
+
+To access disk file system, Lustre defines a new device type called dt_device
+which is a sub-class of generic lu_device. It includes a new operation vector
+(namely dt_device_operations structure) defining all the actions that can be
+performed against a dt_device. Here are the operation prototypes:
+
+int   (*dt_statfs)(const struct lu_env *, struct dt_device *,
+                  struct obd_statfs *);
+struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *);
+int   (*dt_trans_start)(const struct lu_env *, struct dt_device *,
+                       struct thandle *th);
+int   (*dt_trans_stop)(const struct lu_env *, struct thandle *);
+int   (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *);
+int   (*dt_root_get)(const struct lu_env *, struct dt_device *,
+                    struct lu_fid *);
+void  (*dt_conf_get)(const struct lu_env *, const struct dt_device *,
+                     struct dt_device_param *);
+int   (*dt_sync)(const struct lu_env *, struct dt_device *);
+int   (*dt_ro)(const struct lu_env *, struct dt_device *);
+int   (*dt_commit_async)(const struct lu_env *, struct dt_device *);
+int   (*dt_init_capa_ctxt)(const struct lu_env *, struct dt_device *, int,
+                          unsigned long, __u32, struct lustre_capa_key *);
+
+dt_trans_create
+dt_trans_start
+dt_trans_stop
+dt_trans_cb_add
+       please refer to IV.3
+dt_statfs
+       called to report current file system usage information: all, free and
+       available blocks/objects.
+dt_root_get
+       called to get FID of the root object. Used to follow backend filesystem
+       rules and support backend file system in a state where users can mount
+       it directly (with ldiskfs/zfs/etc).
+dt_sync
+       called to flush all complete but not written transactions. Should block
+       until the flush is completed.
+dt_ro
+       called to turn backend into read-only mode.
+       Used by testing infrastructure to simulate recovery cases.
+dt_commit_async
+       called to notify OSD/backend that higher level need transaction to be
+       flushed as soon as possible. Used by Commit-on-Share feature.
+       Should return immediately and not block for long.
+dt_init_capa_ctxt
+       called to initialize context for capabilities. not in use currently.
+
+2. Data Objects
+===============
+
+There are two types of DT objects:
+1) regular objects, storing unstructured data (e.g. flat files, OST objects,
+   llog objects)
+2) index objects, storing key=value pairs (e.g. directories, quota indexes,
+   FLDB)
+
+As a result, there are 3 sets of methods that should be implemented by the OSD
+layer:
+- core methods used to create/destroy/manipulate attributes of objects
+- data methods used to access the object body as a flat address space
+  (read/write/truncate/punch) for regular objects
+- index operations to access index objects as a key-value association
+
+A data object is represented by the dt_object structure which is defined as
+a sub-class of lu_object, plus operation vectors for the core, data and index
+methods as listed above.
+
+i. Common Storage Operations
+----------------------------
+The core methods are defined in dt_object_operations as follows:
+
+void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned);
+void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned);
+void (*do_read_unlock)(const struct lu_env *, struct dt_object *);
+void (*do_write_unlock)(const struct lu_env *, struct dt_object *);
+int  (*do_write_locked)(const struct lu_env *, struct dt_object *);
+int  (*do_attr_get)(const struct lu_env *, struct dt_object *,
+                    struct lu_attr *, struct lustre_capa *);
+int  (*do_declare_attr_set)(const struct lu_env *, struct dt_object *,
+                            const struct lu_attr *, struct thandle *);
+int  (*do_attr_set)(const struct lu_env *, struct dt_object *,
+                   const struct lu_attr *, struct thandle *,
+                   struct lustre_capa *);
+int  (*do_xattr_get)(const struct lu_env *, struct dt_object *,
+                     struct lu_buf *, const char *, struct lustre_capa *);
+int  (*do_declare_xattr_set)(const struct lu_env *, struct dt_object *,
+                             const struct lu_buf *, const char *, int,
+                            struct thandle *);
+int  (*do_xattr_set)(const struct lu_env *, struct dt_object *,
+                     const struct lu_buf *, const char *, int,
+                     struct thandle *, struct lustre_capa *);
+int  (*do_declare_xattr_del)(const struct lu_env *, struct dt_object *,
+                             const char *, struct thandle *);
+int  (*do_xattr_del)(const struct lu_env *, struct dt_object *, const char *,
+                     struct thandle *, struct lustre_capa *);
+int  (*do_xattr_list)(const struct lu_env *, struct dt_object *,
+                       struct lu_buf *, struct lustre_capa *);
+void (*do_ah_init)(const struct lu_env *, struct dt_allocation_hint *,
+                    struct dt_object *, struct dt_object *, cfs_umode_t);
+int  (*do_declare_create)(const struct lu_env *, struct dt_object *,
+                          struct lu_attr *, struct dt_allocation_hint *,
+                          struct dt_object_format *, struct thandle *);
+int  (*do_create)(const struct lu_env *, struct dt_object *, struct lu_attr *,
+                  struct dt_allocation_hint *, struct dt_object_format *,
+                  struct thandle *);
+int  (*do_declare_destroy)(const struct lu_env *, struct dt_object *,
+                          struct thandle *);
+int  (*do_destroy)(const struct lu_env *, struct dt_object *, struct thandle *);
+int  (*do_index_try)(const struct lu_env *, struct dt_object *, 
+                    const struct dt_index_features *);
+int  (*do_declare_ref_add)(const struct lu_env *, struct dt_object *,
+                          struct thandle *);
+int  (*do_ref_add)(const struct lu_env *, struct dt_object *, struct thandle *);
+int  (*do_declare_ref_del)(const struct lu_env *, struct dt_object *,
+                          struct thandle *);
+int  (*do_ref_del)(const struct lu_env *, struct dt_object *, struct thandle *);
+struct obd_capa *(*do_capa_get)(const struct lu_env *, struct dt_object *,
+                                struct lustre_capa *, __u64);
+int  (*do_object_sync)(const struct lu_env *, struct dt_object *);
+
+do_read_lock
+do_write_lock
+do_read_unlock
+do_write_unlock
+do_write_locked
+       please refer to IV.4
+do_attr_get
+       The method is called to get regular attributes an object stores.
+       The lu_attr fields maps the usual unix file attributes, like ownership
+       or size. The object must exist.
+do_declare_attr_set
+       the method is called to notify OSD the caller is going to modify regular
+       attributes of an object in specified transaction. OSD should use this
+       method to reserve resources needed to change attributes. Can be called
+       on an non-existing object.
+do_attr_set
+       the method is called to change attributes of an object. The object
+       must exist. If the fl argument has LU_XATTR_CREATE, the extended
+       argument must not exist, otherwise -EEXIST should be returned.
+       If the fl argument has LU_XATTR_REPLACE, the extended argument must
+       exist, otherwise -ENODATA should be returned. The object must exist.
+       The maximum size of extended attribute supported by OSD should be
+       present in struct dt_device_param the caller can get with
+       ->dt_conf_get() method.
+do_xattr_get
+       called when the caller needs to get an extended attribute with a
+       specified name. If the struct lu_buf argument has a null lb_buf, the
+       size of the extended attribute should be returned. If the requested
+       extended attribute does not exist, -ENODATA should be returned.
+       The object must exist. If buffer space (specified in lu_buf.lb_len) is
+       not enough to fit the value, then return -ERANGE.
+do_declare_xattr_set
+       called to notify OSD the caller is going to set/change an extended
+       attribute on an object. OSD should use this method to reserve resources
+       needed to change an attribute.
+do_xattr_set
+       called when the caller needs to change an extended attribute with
+       specified name.
+do_declare_xattr_del
+       called to notify OSD the caller is going to remove an extended attribute
+       with a specified name
+do_xattr_del
+       called when the caller needs to remove an extended attribute with a
+       specified name. Deleting an nonexistent extended attribute is allowed.
+       The object must exist. The method called on a non-existing attribute
+       returns 0.
+do_xattr_list
+       called when the caller needs to get a list of existing extended
+       attributes (only names of attributes are returned). The size of the list
+       is returned, including the string terminator. If the lu_buf argument has
+       a null lb_buf, how many bytes the list would require is returned to help
+       the caller to allocate a buffer of an appropriate size.
+       The object must exist.
+do_ah_init
+       called to let OSD to prepare allocation hint which stores information
+       about object locality, type. later this allocation hint is passed to
+       ->do_create() method and use OSD can use this information to optimize
+       on-disk object location. allocation hint is opaque for the caller and
+       can contain OSD-specific information.
+do_declare_create
+       called to notify OSD the caller is going to create a new object in a
+       specified transaction.
+do_create
+       called to create an object on the OSD in a specified transaction.
+       For index objects the caller can request a set of index properties (like
+       key/value size). If OSD can not support requested properties, it should
+       return an error. The object shouldn't exist already (i.e.
+       dt_object_exist() should return false).
+do_declare_destroy
+       called to notify OSD the caller is going to destroy an object in a
+       specified transaction.
+do_destroy
+       called to destroy an object in a specified transaction. Semantically,
+       it’s dual to object creation and does not care about on-disk reference
+       to the object (in contrast with POSIX unlink operation).
+       The object must exist (i.e. dt_object_exist() must return true).
+do_index_try
+       called when the caller needs to use an object as an index (the object
+       should be created as an index before). Also the caller specify a set of
+       properties she expect the index should support.
+do_declare_ref_add
+       called to notify OSD the caller is going to increment nlink attribute
+       in a specified transaction.
+do_ref_add
+       called to increment nlink attribute in a specified transaction.
+       The object must exist.
+do_declare_ref_del
+       called to notify OSD the caller is going to decrement nlink attribute
+       in a specified transaction.
+do_ref_del
+       called to decrement nlink attribute in a specified transaction.
+       This is typically done on an object when a record referring to it is
+       deleted from an index object. The object must exist.
+do_capa_get
+       called to get a capability for a specified object. not used currently.
+do_object_sync
+       called to flush a given object on-disk. It’s a fine grained version of
+       ->do_sync() method which should make sure an object is stored on-disk.
+       OSD (or backend file system) can track a status of every object and if
+       an object is already flushed, then just the method can return
+       immediately. The method is used on OSS now, but can also be used on MDS
+       at some point to improve performance of COS.
+do_data_get
+       the method is not used any more and planned for removal.
+
+ii. Data Object Operations
+--------------------------
+Set of methods described in struct dt_body_operations which should be used with
+regular objects storing unstructured data:
+
+ssize_t (*dbo_read)(const struct lu_env *, struct dt_object *, struct lu_buf *,
+                   loff_t *pos, struct lustre_capa *);
+ssize_t (*dbo_declare_write)(const struct lu_env *, struct dt_object *,
+                            const loff_t, loff_t, struct thandle *);
+ssize_t (*dbo_write)(const struct lu_env , struct dt_object *,
+                    const struct lu_buf *, loff_t *, struct thandle *,
+                    struct lustre_capa *, int);
+int (*dbo_bufs_get)(const struct lu_env *, struct dt_object *, loff_t,
+                   ssize_t, struct niobuf_local *, int, struct lustre_capa *);
+int (*dbo_bufs_put)(const struct lu_env *, struct dt_object *,
+                   struct niobuf_local *, int);
+int (*dbo_write_prep)(const struct lu_env *, struct dt_object *,
+                     struct niobuf_local *, int);
+int (*dbo_declare_write_commit)(const struct lu_env *, struct dt_object *,
+                                struct niobuf_local *,int, struct thandle *);
+int (*dbo_write_commit)(const struct lu_env *, struct dt_object *,
+                       struct niobuf_local *, int, struct thandle *);
+int (*dbo_read_prep)(const struct lu_env *, struct dt_object *,
+                    struct niobuf_local *, int);
+int (*dbo_fiemap_get)(const struct lu_env *, struct dt_object *,
+                     struct ll_user_fiemap *);
+int (*do_declare_punch)(const struct lu_env*, struct dt_object *, __u64,
+                         __u64,struct thandle *);
+int (*do_punch)(const struct lu_env *, struct dt_object *, __u64, __u64,
+               struct thandle *, struct lustre_capa *);
+
+dbo_read
+       is called to read raw unstructured data from a specified range of an
+       object. It returns number of bytes read or an error. Usually OSD
+       implements this method using internal buffering (to be able to put data
+       at non-aligned address). So this method should not be used to move a
+       lot of data. Lustre services use it to read to read small internal data
+       like last_rcvd file, llog files. It's also used to fetch body symlinks.
+dbo_declare_write
+       is called to notify OSD the caller will be writing data to a specific
+       range of an object in a specified transaction.
+dbo_write
+       is called to write raw unstructured data to a specified range of an
+       object in a specified transaction. data should be written atomically
+       with another change in the transaction. The method is used by Lustre
+       services to update small portions on a disk. OSD should maintain size
+       attribute consistent with data written.
+dbo_bufs_get
+       is called to fill memory with buffer descriptors (see struct
+       niobuf_local) for a specified range of an object. memory for the set is
+       provided by the caller, no concurrent access to this memory is allowed.
+       OSD can fill all fields of the descriptor except lnb_grant_used.
+       The caller specify whether buffers will be user to read or write data.
+       This method is used to access file system's internal buffers for
+       zero-copy IO. Internal buffers referenced by descriptors are supposed to
+       be pinned in memory
+dbo_bufs_put
+       is called to unpin/release internal buffers referenced by the
+       descriptors dbo_bufs_get returns. After this point pointers in the
+       descriptors are not valid.
+dbo_write_prep
+       is called to fill internal buffers with actual data. this is required
+       for buffers which do not match filesystem blocksize, as later the buffer
+       is supposed to be written as a whole. for example, ldiskfs uses 4k
+       blocks, but the caller wants to update just a half of that. to prevent
+       data corruption, this method is called OSD compares range to be written
+       with 4k, if they do not match, then OSD fetches data from a disk.
+       If they do match, then all the data will be overwritten and there is no
+       need to fetch data from a disk.
+dbo_declare_write_commit
+       is called to notify OSD the caller is going to write internal buffers
+       and OSD needs to reserve enough resource in a transaction.
+dbo_write_commit
+       is called to actually make data in internal buffers part of a specified
+       transaction. Data is supposed to be written by the moment the
+       transaction is considered committed. This is slightly different from
+       generic transaction model because in this case it's allowed to have
+       data written, but not have transaction committed.
+       If no dbo_write_commit is called, then dbo_bufs_put should discard
+       internal buffers and possible changes made to internal buffers should
+       not be visible.
+dbo_read_prep
+       is called to fill all internal buffers referenced by descriptors with
+       actual data. buffers may already contain valid data (be cached), so OSD
+       can just verify the data is valid and return immediately.
+dbo_fiemap_get
+       is called to map logical range of an object to physical blocks where
+       corresponded range of data is actually stored.
+dbo_declare_punch
+       is called to notify OSD the caller is going to punch (deallocate)
+       specified range in a transaction.
+dbo_punch
+       is called to punch (deallocate) specified range of data in a
+       transaction. this method is allowed to use few disk file system
+       transactions (within the same lustre transaction handle).
+       Currently Lustre calls the method in form of truncate only where the end
+       offset is EOF always.
+
+iii. Indice Operations
+----------------------
+In contrast with raw unstructured data they are collection of key=value pairs.
+OSD should provide with few methods to lookup, insert, delete and scan pairs.
+Indices may have different properties like key/value size, string/binary keys,
+etc. When user need to use an index, it needs to check whether the index has
+required properties with a special method. indices are used by Lustre services
+to maintain user-visible namespace, FLD, index of unlinked files, etc.
+
+The method prototypes are defined in dt_index_operations as follows:
+
+int (*dio_lookup)(const struct lu_env *, struct dt_object *, struct dt_rec *,
+                 const struct dt_key *, struct lustre_capa *);
+int (*dio_declare_insert)(const struct lu_env *, struct dt_object *,
+                         const struct dt_rec *, const struct dt_key *,
+                         struct thandle *);
+int (*dio_insert)(const struct lu_env *, struct dt_object *,
+                 const struct dt_rec *, const struct dt_key *,
+                 struct thandle *, struct lustre_capa *, int);
+int (*dio_declare_delete)(const struct lu_env *, struct dt_object *,
+                          const struct dt_key *, struct thandle *);
+int (*dio_delete)(const struct lu_env *, struct dt_object *,
+                 const struct dt_key *, struct thandle *,
+                 struct lustre_capa *);
+
+dio_lookup
+       is called to lookup exact key=value pair. A value is copied into a
+       buffer provided by the caller. so the caller should make sure the
+       buffer's size is big enough. this should be done with ->do_index_try()
+       method.
+dio_declare_insert
+       is called to notify OSD the caller is going to insert key=value pair in
+       a transaction. exact key is specified by a caller so OSD can use this to
+       make reservation better (i.e. smaller).
+dio_insert
+       is called to insert key/value pair into an index object. it's up to OSD
+       whether to allow concurrent inserts or not. the caller is not required
+       to serialize access to an index
+dio_declare_delete
+       is called to notify OSD the caller is going to remove a specified key
+       in a transaction. exact key is specified by a caller so OSD can use this
+       to make reservation better.
+dio_delete
+       is called to remove a key/value pair specified by a caller.
+
+To iterate over all key=value pair stored in an index, OSD should provide the
+following set of methods:
+
+struct dt_it *(*init)(const struct lu_env *, struct dt_object *, __u32,
+                      struct lustre_capa *);
+void  (*fini)(const struct lu_env *, struct dt_it *);
+int   (*get)(const struct lu_env *, struct dt_it *, const struct dt_key *);
+void  (*put)(const struct lu_env *, struct dt_it *);
+int   (*next)(const struct lu_env *, struct dt_it *);
+struct dt_key *(*key)(const struct lu_env *, const struct dt_it *);
+int   (*key_size)(const struct lu_env *, const struct dt_it *);
+int   (*rec)(const struct lu_env *, const struct dt_it *, struct dt_rec *,
+            __u32);
+__u64 (*store)(const struct lu_env *, const struct dt_it *);
+int   (*load)(const struct lu_env *, const struct dt_it *, __u64);
+int   (*key_rec)(const struct lu_env *, const struct dt_it *, void *);
+
+init
+       is called to allocate and initialize an instance of "iterator" which
+       subsequent methods will be passed in. the structure is not accessed by
+       Lustre and its content is totally internal to OSD. Usually it contains a
+       reference to index, current position in an index.
+       It may contain prefetched key/value pairs. It's not required to maintain
+       this cache up-to-date, if index changes this is not required to be
+       reflected by an already initialized iterator. In the extreme case
+       ->init() can prefetch all existing pairs to be returned by subsequent
+       calls to an iterator.
+fini
+       is called to release an iterator and all its resources.
+       For example, iterator can unpin an index, free prefetched pairs, etc.
+get
+       is called to move an iterator to a specified key. if key does not exist
+       then it should be the closest position from the beginning of iteration.
+put
+       is called to release an iterator.
+next
+       is called to move an iterator to a next item
+key
+       is called to fill specified buffer with a key at a current position of
+       an iterator. it’s the caller responsibility to pass big enough buffer.
+       In turn OSD should not exceed sizes negotiated with ->do_index_try()
+       method
+key_size
+       is called to learn size of a key at current position of an iterator
+rec
+       is called to fill specified buffer with a value at a current position of
+       an iterator. it’s the caller responsibility to pass big enough buffer.
+       in turn OSD should not exceed sizes negotiated with ->do_index_try()
+       method.
+store
+       is called to get a 64bit cookie of a current position of an iterator.
+load
+       is called to reset current position of an iterator to match 64bit
+       cookie ->store() method returns. these two methods allow to implement
+       functionality like POSIX readdir where current position is stored as an
+       integer.
+key_rec
+       is not used currently
+
+3. Transactions
+===============
+
+i. Description
+--------------
+Transactions are used by Lustre to implement recovery protocol and support
+failover. The main purpose of transactions is to atomically update backend file
+system. This include as regular changes (file creation, for example) as special
+Lustre changes (last_rcvd file, lastid, llogs). OSD is supposed to provide the
+transactional mechanism and let Lustre to control what specific updates to put
+into transactions.
+
+Lustre relies on the following rule for transactions order: if transaction T1
+starts before transaction T2 starts, then the commit of T2 means that T1 is
+committed at the same time or earlier. Notice that the creation of a transaction
+does not imply the immediate start of the updates on storage, do not confuse
+creation of a transaction with start of a transaction.
+
+It’s up to OSD and backend file system to group few transactions for better
+performance given it still follow the rule above.
+
+Transactions are identified in the OSD API by an opaque transaction handle,
+which is a pointer to an OSD-private data structure that it can use to track
+(and optionally verify) the updates done within that transaction. This handle is
+returned by the OSD to the caller when the transaction is first created.
+Any potential updates (modifications to the underlying storage) must be declared
+as part of a transaction, after the transaction has been created, and before the
+transaction is started. The transaction handle is passed when declaring all
+updates. If any part of the declaration should fail, the transaction is aborted
+without having modified the storage.
+
+After all updates have been declared, and have completed successfully, the
+handle is passed to the transaction start. After the transaction has started,
+the handle will be passed to every update that is done as part of that
+transaction. All updates done under the transaction must previously have been
+declared. Once the transaction has started, it is not permitted to add new
+updates to the transaction, nor is it possible to roll back the transaction
+after this point. Should some update to the storage fail, the caller will try
+to undo the previous updates within the context of the transaction itself, to
+ensure that the resulting OSD state is correct.
+
+Any update that was not previously declared is an implementation error in the
+caller. Not all declared updates need to be executed, as they form a worst-case
+superset of the possible updates that may be required in order to complete the
+desired operation in a consistent manner.
+
+OSD should let a caller to register callback function(s) to be called on
+transaction commit to a disk. Also OSD should be able to call a special of
+transaction hooks on all the stages (creation, start, stop, commit) on
+per-devices basis so that high-level services (like MDT) which are not involved
+directly into controlling transactions still can be involved.
+Every commit callback gets a result of transaction commit, if disk filesystem was
+not able to commit the transaction, then an appropriate error code will be passed.
+
+It’s important to note that OSD and disk file system should use asynchronous IO
+to implement transactions, otherwise the performance is expected to be bad.
+
+The maximum number of updates that make up a single transaction is OSD-specific,
+but is expected to be at least in the tens of updates to multiple objects in the
+OSD (extending writes of multiple MB of data, modifying or adding attributes,
+extended attributes, references, etc). For example, in ext4, each update to the
+filesystem will modify one or more blocks of storage. Since one transaction is
+limited to one quarter of the journal size, if the caller declares a series of
+updates that modify more than this number of blocks, the declaration must fail
+or it could not be committed atomically.
+In general, every constraint must be checked here to ensure that all changes
+that must commit atomically can complete successfully.
+
+ii. Lifetime
+------------
+From Lustre point of view a transaction goes through the following steps:
+1. creation
+2. declaration of all possible changes planned in transaction
+3. transaction start
+4. execution of planned and declared changes
+5. transaction stop
+6. commit callback(s)
+
+iii. Methods
+------------
+OSD should implement the following methods to let Lustre control transactions:
+
+struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *);
+int (*dt_trans_start)(const struct lu_env *, struct dt_device *,
+                     struct thandle *);
+int   (*dt_trans_stop)(const struct lu_env *, struct thandle *);
+int   (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *);
+
+dt_trans_create
+       is called to allocate and initialize transaction handle (see struct
+       thandle). This structure has no pointer to a private data so, it should
+       be embedded into private representation of transaction at OSD layer.
+       This method can block.
+dt_trans_start
+       is called to notify OSD a specified transaction has got all the
+       declarations and now OSD should tell whether it has enough resources to
+       proceed with declared changes or to return an error to a caller.
+       This method can block. OSD should call dt_txn_hook_start() function
+       before underlying file system’s transaction starts to support per-device
+       transaction hooks. If OSD (or disk files system) can not start
+       transaction, then an error is returned and transaction handle is
+       destroyed, no commit callbacks are called.
+dt_trans_stop
+       is called to notify OSD a specified transaction has been executed and no
+       more changes are expected in a context of that. Usually this mean that at
+       this point OSD is free to start writeout preserving notion
+       all-or-nothing. This method can block.
+       If th_sync flag is set at this point, then OSD should start to commit
+       this transaction and block until the transaction is committed. the order
+       of unblock event and transaction’s commit callback functions is not
+       defined by the API. OSD should call dt_txn_hook_stop() functions once
+       underlying file system’s transaction is stopped to support per-device
+       transaction hooks.
+dt_trans_cb_add
+       is called to register commit callback function(s), which OSD will be
+       calling up on transaction commit to a storage. when all the callback
+       functions are processed, transaction handle can be freed by OSD.
+       There are no constraints on how many callback functions can be running
+       concurrently. They should not be running in an interrupt context.
+       Usually this method should not block and use spinlocks. As part of
+       commit callback functions processing dt_txn_hook_commit() function
+       should be called to support per-device transaction hooks.
+
+The callback mechanism let layers not commanding transactions be involved.
+For example, MDT registers its set and now every transaction happening on
+corresponded OSD will be seen by MDT, which adds recovery information to the
+transactions: generate transaction number, puts it into a special file -- all
+this happen within the context of the transaction, so atomically.
+Similarly VBR functionality in MDT updates objects versions.
+
+4. Locking
+==========
+
+i. Description
+--------------
+OSD is expected to maintain internal consistency of the file system and its
+object on its own, requiring no additional locking or serialization from higher
+levels. This let OSD to control how fine the locking is depending on the
+internal structuring of a specific file system. If few update conflict then the
+result is not defined by OSD API and left to OSD.
+
+OSD should provide the caller with few methods to serialize access to an object
+in shared and exclusive mode. It’s up to caller how to use them, to define order
+of locking. In general the locks provided by OSD are used to group complex
+updates so that other threads do not see intermediate result of operations.
+
+ii. Methods
+-----------
+Methods to lock/unlock object
+The set of methods exported by each OSD to manage locking is the following:
+void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned);
+void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned);
+void (*do_read_unlock)(const struct lu_env *, struct dt_object *);
+void (*do_write_unlock)(const struct lu_env *, struct dt_object *);
+int  (*do_write_locked)(const struct lu_env *, struct dt_object *);
+
+do_read_lock
+       get a shared lock on the object, this is a blocking lock.
+do_write_lock
+       get an exclusive lock on the object, this is a blocking lock.
+do_read_unlock
+       release a shared lock on an object, this is a blocking lock.
+do_write_unlock
+       release an exclusive lock on an object, this is a blocking lock.
+do_write_locked
+       check whether an object is exclusive-locked.
+
+It is highly desirable that an OSD object can be accessed and modified by
+multiple threads concurrently.
+
+For regular objects, the preferred implementation allows an object to be read
+concurrently at overlapping offsets, and written by multiple threads at
+non-overlapping offsets with the minimum amount of contention possible, or any
+combination of concurrent read/write operations. Lustre will not itself perform
+concurrent overlapping writes to a single region of the object, due to
+serialization at a higher level.
+
+For index objects, the preferred implementation allows key/value pair to be
+looked up concurrently, allows non-conflicting keys to be inserted or removed
+concurrently, or any combination of concurrent lookup, insertion, or removal.
+Lustre does not require the storage of multiple identical keys. Operations on
+the same key should be serialized.
+
+========================
+= V. Quota Enforcement =
+========================
+
+1. Overview
+===========
+
+The OSD layer is in charge of setting up a Quota Slave Device (aka QSD) to
+manage quota enforcement for a specific OSD device. The QSD is implemented under
+the form of a library. Each OSD device should create a QSD instance which will
+be used to manage quota enforcement for this device. This implies:
+- completing the reintegration procedure with the quota master (aka QMT) to
+  to retrieve the latest quota settings and quota space distribution for each
+  UID/GID.
+- managing quota locks in order to be notified of configuration changes.
+- acquiring space from the QMT when quota space for a given user/group is
+  close to exhaustion.
+- allocating quota space to service threads for local request processing.
+
+The reintegration procedure allows a disconnected slave to re-synchronize with
+the quota master, which means:
+- re-acquiring quota locks,
+- fetching up-to-date quota settings (e.g. list of UIDs with quota enforced),
+- reporting space usage to master for newly (e.g. setquota was run while the
+  slave wasn't connected) enforced UID/GID,
+- adjusting spare quota space (e.g. slave hold a large amount of unused quota
+  space for a user which ran out of quota space on the master while the slave
+  was disconnected).
+
+The latter two actions are known as reconciliation.
+
+2. QSD API
+==========
+
+The QSD API is defined in lustre/include/lustre_quota.h as follows:
+
+struct qsd_instance *qsd_init(const struct lu_env *, char *, struct dt_device *,
+                              cfs_proc_dir_entry_t *);
+int qsd_prepare(const struct lu_env *, struct qsd_instance *);
+int qsd_start(const struct lu_env *, struct qsd_instance *);
+void qsd_fini(const struct lu_env *, struct qsd_instance *);
+int qsd_op_begin(const struct lu_env *, struct qsd_instance *,
+                 struct lquota_trans *, struct lquota_id_info *, int *);
+void qsd_op_end(const struct lu_env *, struct qsd_instance *,
+                struct lquota_trans *);
+void qsd_adjust_quota(const struct lu_env *, struct qsd_instance *,
+                      union lquota_id *, int);
+
+qsd_init
+       The OSD module should first allocate a qsd instance via qsd_init.
+       This creates all required structures to manage quota enforcement for
+       this target and performs all low-level initialization which does not
+       involve any lustre object. qsd_init should typically be called when
+       the OSD is being set up.
+
+qsd_prepare
+       This sets up on-disk objects associated with the quota slave feature
+       and initiates the quota reintegration procedure if needed.
+       qsd_prepare should typically be called when ->ldo_prepare is invoked.
+
+qsd_start
+       a qsd instance should be started once recovery is completed (i.e. when
+       ->ldo_recovery_complete is called). This is used to notify the qsd layer
+       that quota should now be enforced again via the qsd_op_begin/end
+       functions. The last step of the reintegration procedure (namely usage
+       reconciliation) will be completed during start.
+
+qsd_fini
+       is used to release a qsd_instance structure allocated with qsd_init.
+       This releases all quota slave objects and frees the structures
+       associated with the qsd_instance.
+
+qsd_op_begin
+       is used to enforce quota, it must be called in the declaration of each
+       operation. qsd_op_end should then be invoked later once all operations
+       have been completed in order to release/adjust the quota space.
+       Running qsd_op_begin before qsd_start isn't fatal and will return
+       success. Once qsd_start has been run, qsd_op_begin will block until the
+       reintegration procedure is completed.
+
+qsd_op_end
+       performs the post operation quota processing. This must be called after
+       the operation transaction stopped. While qsd_op_begin must be invoked
+       each time a new operation is declared, qsd_op_end should be called only
+       once for the whole transaction.
+
+qsd_adjust_quota
+       Trigger pre-acquire/release if necessary, it's only used for ldiskfs osd
+       so far. When unlink a file in ldiskfs, the quota accounting isn't
+       updated when the transaction stopped. Instead, it'll be updated on the
+       final iput, so qsd_adjust_quota() will be called then (in
+       osd_object_delete()) to trigger quota release if necessary.
+
+Appendix 1. A brief note on Lustre configuration.
+=================================================
+
+In the current versions (1.8, 2.x) MGS is used to store configuration of the
+servers, so called profile. The profile stores configuration commands and
+arguments to setup specific stack. To see how it looks exactly you can fetch
+MDT profile with debugfs -R "dump /CONFIGS/lustre-MDT0000 <tempfile>", then
+parse it with: llog_reader <tempfile>. Here is a short extract:
+
+#02 (136)attach    0:lustre-MDT0000-mdtlov  1:lov  2:lustre-MDT0000-mdtlov_UUID
+#03 (176)lov_setup 0:lustre-MDT0000-mdtlov  1:(struct lov_desc)
+                uuid=lustre-MDT0000-mdtlov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
+#06 (120)attach    0:lustre-MDT0000  1:mdt  2:lustre-MDT0000_UUID
+#07 (112)mount_option 0:  1:lustre-MDT0000  2:lustre-MDT0000-mdtlov
+#08 (160)setup     0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f
+#23 (080)add_uuid  nid=10.0.2.15@tcp(0x200000a00020f)  0:  1:10.0.2.15@tcp
+#24 (144)attach    0:lustre-OST0000-osc-MDT0000  1:osc  2:lustre-MDT0000-mdtlov_UUID
+#25 (144)setup     0:lustre-OST0000-osc-MDT0000  1:lustre-OST0000_UUID  2:10.0.2.15@tcp
+#26 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-OST0000_UUID  2:0  3:1
+#32 (080)add_uuid  nid=10.0.2.15@tcp(0x200000a00020f)  0:  1:10.0.2.15@tcp
+#33 (144)attach    0:lustre-OST0001-osc-MDT0000  1:osc  2:lustre-MDT0000-mdtlov_UUID
+#34 (144)setup     0:lustre-OST0001-osc-MDT0000  1:lustre-OST0001_UUID  2:10.0.2.15@tcp
+#35 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-OST0001_UUID  2:1  3:1
+#41 (120)param 0:  1:sys.jobid_var=procname_uid  2:procname_uid
+#44 (080)set_timeout=20
+#48 (112)param 0:lustre-MDT0000-mdtlov  1:lov.stripesize=1048576
+#51 (112)param 0:lustre-MDT0000-mdtlov  1:lov.stripecount=-1
+#54 (160)param 0:lustre-MDT0000  1:mdt.identity_upcall=/work/lustre/head/lustre-release/lustre/utils/l_getidentity
+
+Every line starts with a specific command (attach, lov_setup, set, etc) to do
+specific configuration action. Then arguments follow. Often the first argument
+is a device name. For example,
+#02 (136)attach    0:lustre-MDT0000-mdtlov  1:lov  2:lustre-MDT0000-mdtlov_UUID
+
+This command will be setting up device “lustre-MDT0000-mdtlov” of type “lov”
+with additional argument “lustre-MDT0000-mdtlov_UUID”. All these arguments are
+packed into lustre configuration buffers ( struct lustre_cfg).
+
+Another commands will be attaching device into the stack (like setup and
+lov_modify_tgts).
+
+Appendix 2. Sample Code
+=======================
+
+Lustre currently has 2 different OSD implementations:
+- ldiskfs OSD under lustre/osd-ldiskfs
+  http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-ldiskfs;hb=HEAD
+- ZFS OSD under lustre/zfs-osd
+  http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-zfs;hb=HEAD
author	Johann Lombardi <johann.lombardi@intel.com>
	Tue, 9 Oct 2012 22:32:16 +0000 (00:32 +0200)
committer	Oleg Drokin <green@whamcloud.com>
	Fri, 12 Oct 2012 03:38:55 +0000 (23:38 -0400)
doc/osd-api.txt	[deleted file]	patch \| blob \| history
lustre/doc/osd-api.txt	[new file with mode: 0644]	patch \| blob