X-Git-Url: https://git.whamcloud.com/?a=blobdiff_plain;f=Documentation%2Fosd-api.txt;fp=Documentation%2Fosd-api.txt;h=e8439f461faf3276f63a8a4dcd01652611390102;hb=204b492ce0856cc03c6e8bf88e925c8c18bc3304;hp=0000000000000000000000000000000000000000;hpb=5ebc00ec79565ad62e978af65b023343ad360675;p=fs%2Flustre-release.git diff --git a/Documentation/osd-api.txt b/Documentation/osd-api.txt new file mode 100644 index 0000000..e8439f4 --- /dev/null +++ b/Documentation/osd-api.txt @@ -0,0 +1,1298 @@ + **************************************************** + * Overview of the Lustre Object Storage Device API * + **************************************************** + +Original Authors: +================= +Alex Zhuravlev +Andreas Dilger +Johann Lombardi +Li Wei +Niu Yawei + +Last Updated: October 9, 2012 + + Copyright (c) 2012, 2013, Intel Corporation. + +This file is released under the GPLv2. + +Topics +====== + +I. Introduction + 1. What OSD API is + 2. What OSD API is Not + 3. Layering + 4. Audience/Goal +II. Backend Storage Subsystem Requirements + 1. Atomicity of Updates + 2. Object Attributes + i. Standard POSIX Attributes + ii. Extended Attributes + 3. Efficient Index + 4. Commit Callbacks + 5. Space Accounting +III. OSD & LU Infrastructure + 1. Devices + i. Device Overview + ii. Device Type & Operations + iii. Device Operations + iv. OBD Methods + 2. Objects + i. Object Overview + ii. Object Lifecycle + iii. Special Objects + iv. Object Operations + 3. Lustre Environment +IV. Data (DT) API + 1. Data Device + 2. Data Objects + i. Common Storage Operations + ii. Data Object Operations + iii. Indice Operations + 3. Transactions + i. Description + ii. Lifetime + iii. Methods + 4. Locking + i. Description + ii. Methods +V. Quota Enforcement + 1. Overview + 2. QSD API +Appendix 1. A brief note on Lustre configuration. +Appendix 2. Sample Code + +=================== += I. Introduction = +=================== + +1. What OSD API is +================== + +OSD API is the interface to access and modify data that is supposed to be stored +persistently. This API layer is the interface to code that bridges individual +file systems such as ext4 or ZFS to Lustre. +The API is a generic interface to transaction and journaling based file systems +so many backend file systems can be supported in a Lustre implementation. +Data can be cached within the OSD or backend target and could be destroyed +before hitting storage, but in general the final target is a persistent storage. +This API creates many possibilities, including using object-storage devices or +other new persistent storage technologies. + +2. What OSD API is Not +====================== + +OSD API should not be used to control in-core-only state (like ldlm locking), +configuration, etc. The upper layers of the IO/metadata stack should not be +involved with the underlying layout or allocation in the OSD storage. + +3. Layering +=========== + +Lustre is composed of different kernel modules, each implementing different +layers in the software stack in an object-oriented approach. Generally, each +layer builds (or stacks) upon another, and each object is a child of the +generic LU object class. Hence the term "LU stack" is often used to reference +this hierarchy of lustre modules and objects. + +Each layer (i.e. mdt/mdd/lod/osp/ofd/osd) defines its own generic item +(lu_object/lu_device) which are thus gathered in a compound item (lu_site/ +lu_object_layer) representing the multi-layered stacks. Different classes of +operations can then be implemented by each layer, depending on its natures. + +As a result, each OSD is expected to implement: +- the generic LU API used to manage the device stack and objects (see chapter + III) +- the DT API (most commonly called OSD API) used to manipulate on-disk + structures (see chapter IV). + +4. Audience/Goal +================ + +The goal of this document is to provide the reader with the information +necessary to accurately construct a new Object Storage Device (OSD) module +interface layer for Lustre in order to use a new backend file system with +Lustre 2.4 and greater. + +============================================== += II. Backend Storage Subsystem Requirements = +============================================== + +The purpose of this section is to gather the requirements for the storage +subsystems below the OSD API. + +1. Atomicity of Updates +======================= + +The underlying OSD storage must be able to provide some form of atomic commit +for multiple arbitrary updates to OSD storage within a single transaction. +It will always know in advance of the transaction starting which objects will +be modified, and how they will be modified. + +If any of the updates associated with a transaction are stored persistently +(i.e. some state in the OSD is modified), then all of the updates in that +transaction must also be stored persistently (Atomic). If the OSD should fail +in some manner that prevents all the updates of a transaction from being +completed, then none of the updates shall be completed (Consistent). +Once the updates have been reported committed to the caller (i.e. commit +callbacks have been run), they cannot be rolled back for any reason (Durable). + +2. Object Attributes +==================== + +i. Standard POSIX Attributes +---------------------------- +The OSD object should be able to store normal POSIX attributes on each object +as specified by Lustre: +- user ID (32 bits) +- group ID (32 bits) +- object type (16 bits) +- access mode (16 bits) +- metadata change time (96 bits, 64-bit seconds, 32-bit nanoseconds) +- data modification time (96 bits, 64-bit seconds, 32-bit nanoseconds) +- data access time (96 bits, 64-bit seconds, 32-bit nanoseconds) +- creation time (96 bits, 64-bit seconds, 32-bit nanoseconds, optional) +- object size (64 bits) +- link count (32 bits) +- flags (32 bits) +- object version (64 bits) + +The OSD object shall not modify these attributes itself. + +In addition, it is desirable track the object allocation size (“blocks”), which +the OSD manages itself. Lustre will query the object allocation size, but will +never modify it. If these attributes are not managed by the OSD natively as part +of the object itself, they can be stored in an extended attribute associated +with the object. + +ii. Extended Attributes +------------------------ +The OSD should have an efficient mechanism for storing small extended attributes +with each object. This implies that the extended attributes can be accessed at +the same time as the object (without extra seek/read operations). There is also +a requirement to store larger extended attributes in some cases (over 1kB in +size), but the performance of such attributes can be slower proportional to the +attribute size. + +3. Efficient Index +================== + +The OSD must provide a mechanism for efficient key=value retrieval, for both +fixed-length and variable length keys and values. It is expected that an index +may hold tens of millions of keys, and must be able to do random key lookups +in an efficient manner. It must also provide a mechanism for iterating over all +of the keys in a particular index and returning these to the caller in a +consistent order across multiple calls. It must be able to provide a cookie that +defines the current index at which the iteration is positioned, and must be able +to continue iteration at this index at a later time. + +4. Commit Callbacks +=================== + +The OSD must provide some mechanism to register multiple arbitrary callback +functions for each transaction, and call these functions after the transaction +with which they are associated has committed to persistent storage. +It is not required that they be called immediately at transaction commit time, +but they cannot be delayed an arbitrarily long time, or other parts of the +system may suffer resource exhaustion. If this mechanism is not implemented by +the underlying storage, then it needs to be provided in some manner by the OSD +implementation itself. + +5. Space Accounting +=================== + +In order to provide quota functionality for the OSD, it must be able to track +the object allocation size against at least two different keys (typically User +ID and Group ID). The actual mechanism of tracking this allocation is internal +to the OSD. Lustre will specify the owners of the object against which to track +this space. Space accounting information will be accessed by Lustre via the +index API on special objects dedicated to space allocation management. + +================================ += III. OSD & LU Infrastructure = +================================ + +As a member of the LU stack, each OSD module is expected to implement the +generic LU API used to manage devices and objects. + +1. Devices +========== + +i. Device Overview +------------------ +Each layer in the stack is represented by a lu_device structure which holds +the very basic data like reference counter, a reference to the site (Lustre +object collection in-core, very similar to inode cache), a reference to +struct lu_type which in turn describe this specific type of devices +(type name, operations etc). + +OSD device is created and initialized at mount time to let configuration +component access data it needs before the whole Lustre stack is ready. +OSD device is destroyed when all the devices using that are destroyed too. +Usually this happen when the server stack shuts down at unmount time. + +There might be few OSD devices of the given type (say, few zfs-osd and +ldiskfs-osd). The type stores method common for all OSD instances of given type +(below they start with ldto_ prefix). Then every instance of OSD device can get +few specific methods (below the start with ldo_ prefix). + +To connect devices into a stack, ->o_connect() method is used (see struct +obd_ops). Currently OSD should implement this method to track all it’s users. +Then to disconnect ->o_disconnect() method is used. OSD should implement this +method, track remaining users and once no users left, call +class_manual_cleanup() function which initiate removal of OSD. + +As the stack involves many devices and there may be cross-references between +them, it’s easier to break the whole shutdown procedure into the two steps and +do not set a specific order in which different devices shutdown: at the first +step the devices should release all the resources they use internally +(so-called pre-cleanup procedure), at the second step they are actually +destroyed. + +ii. Device Type & Operations +---------------------------- +The first thing to do when developing a new OSD is to define a lu_device_type +structure to define and register the new OSD type. The following fields of the +lu_device_type needs to be filled appropriately: +ldt_tags + is the type of device, typically data, metadata or client (see + lu_device_tag). An OSD device is of data type and should always + registers as such by setting this field to LU_DEVICE_DT. +ldt_name + is the name associated with the new OSD type. + See LUSTRE_OSD_{LDISKFS,ZFS}_NAME for reference. +ldt_ops + is the vector of lu_device_type operations, please see below for + further details +ldt_ctxt_type + is the lu_context_tag to be used for operations. + This should be set to LCT_LOCAL for OSDs. + +In the original 2.0 MDS stack the devices were built from the top down and OSD +was the final device to setup. This schema does not work very well when you have +to access on-disk data early and when you have OSD shared among few services +(e.g. MDS + MGS on a same storage). So the schema has changed to a reverse one: +mount procedure sets up correct OSD, then the stack is built from the bottom up. +And instead of introducing another set of methods we decided to use existing +obd_connect() and obd_disconnect() given that many existing devices have been +already configured this way by the configuration component. Notice also that +configuration profiles are organized in this order (LOV/LOD go first, then MDT). +Given that device “below” is ready at every step, there is no point in calling +separate init method. + +Due to complexity in other modules, when the device itself can be referenced by +number of entities like exports, RPCs, transactions, callbacks, access via +procfs, the notion of precleanup was introduced to be able all the activity +safely before the actual cleanup takes place. Similarly ->ldto_device_fini() +and ->ldto_device_free() were introduced. So, the former should be used to break +any interaction with the outside, the latter - to actually free the device. + +So, the configuration component meets SETUP command in the configuration profile +(see Appendix 1), finds appropriate device and calls ->ldto_device_alloc() to +set up it as an LU device. + +The prototypes of device type operations are the following: + +struct lu_device *(*ldto_device_alloc)(const struct lu_env *, + struct lu_device_type *, + struct lustre_cfg *); +struct lu_device *(*ldto_device_free)(const struct lu_env *, + struct lu_device *); +int (*ldto_device_init)(const struct lu_env *, struct lu_device *, + const char *, struct lu_device *); +struct lu_device *(*ldto_device_fini)(const struct lu_env *env, struct lu_device *); +int (*ldto_init)(struct lu_device_type *t); +void (*ldto_fini)(struct lu_device_type *t); +void (*ldto_start)(struct lu_device_type *t); +void (*ldto_stop)(struct lu_device_type *t); + +ldto_device_alloc + The method is called by configuration component (in case of disk file + system OSD, this is lustre/obdclass/obd_mount.c) to allocate device. + Notice generic struct lu_device does not hold a pointer to private data. + Instead OSD should embed struct lu_device into own structure (like + struct osd_device) and return address of lu_device in that structure. +ldto_device_fini + The method is called when OSD is about to release. OSD should detach + from resources like disk file system, procfs, release objects it holds + internally, etc. This is so-called precleanup procedure. +ldto_device_free + The method is called to actually release memory allocated in + ->ldto_device_alloc(). +ldto_device_ini + The method is not used by OSD currently. +ldto_init + The method is called when specific type of OSD is registered in the + system. Currently the method is used to register OSD-specific data for + environments (see Lustre environment in section 3). + See LU_TYPE_INIT_FINI() macro as an example. +ldto_fini + The method is called when specific type of OSD unregisters. + Currently used to unregister OSD-specific data from environment. +ldto_start + The method is called when the first device of this type is being + instantiated. Currently used to fill existing environments with + OSD-specific data. +ldto_stop + This method is called when the last instance of specific OSD has gone. + Currently used to release OSD-specific data from environments. + +iii. Device Operations +---------------------- +Now that the osd device can be set up, we need to export methods to handle +device-level operation. All those methods are listed in the lu_device_operations +structure, this includes: + +struct lu_object *(*ldo_object_alloc)(const struct lu_env *, + const struct lu_object_header *, + struct lu_device *); +int (*ldo_process_config)(const struct lu_env *, struct lu_device *, + struct lustre_cfg *); +int (*ldo_recovery_complete)(const struct lu_env *, struct lu_device *); +int (*ldo_prepare)(const struct lu_env *, struct lu_device *, + struct lu_device *); + +ldo_object_alloc + The method is called when a high-level service wants to access an + object not found in local lustre cache (see struct lu_site). + OSD should allocate a structure, initialize object’s methods and return + a pointer to struct lu_device which is embedded into OSD object + structure. +ldo_process_config + The method is called in case of configuration changes. Mostly used by + high-level services to update local tunables. It’s also possible to let + MGS store OSD tunables and set them properly on every server mount or + when tunables change run-time. +ldto_recovery_complete + The method is called when recovery procedure between a server and + clients is completed. This method is used by high-level devices mostly + (like OSP to cleanup OST orphans, MDD to cleanup open unlinked files + left by missing client, etc). +ldo_prepare + The method is called when all the devices belonging to the stack are + configured and setup properly. At this point the server becomes ready + to handle RPCs and start recovery procedure. + In current implementation OSD uses this method to initialize local quota + management. + +iv. OBD Methods +---------------- +Although the LU infrastructure aims at replacing the storage operations of the +legacy OBD API (see struct obd_ops in lustre/include/obd.h). The OBD API is +still used in several places for device configuration and on the Lustre client +(e.g. it’s still used on the client for LDLM locking). The OBD API storage +operations are not needed for server components, and should be ignored. + +As far as the OSD layer is concerned, upper layers still connect/disconnect +to/from the OSD instance via obd_ops::o_connect/disconnect. As a result, each +OSD should implement those two operations: + +int (*o_connect)(const struct lu_env *, struct obd_export **, + struct obd_device *, struct obd_uuid *, + struct obd_connect_data *, void *); +int (*o_disconnect)(struct obd_export *); + +o_connect + The method should track number of connections made (i.e. number of + active users of this OSD) and call class_connect() and return a struct + obd_export via class_conn2export(), see osd_obd_connect(). The structure + holds a reference on the device, preventing it from early release. +o_disconnect + The method is called then some one using this OSD does not need its + service any more (i.e. at unmount). For every passed struct export the + method should call class_disconnect(export). Once the last user has + gone, OSD should call class_manual_cleanup() to schedule the device + removal. + +2. Objects +========== + +i. Object Overview +------------------ +Lustre identifies objects in the underlying OSD storage by a unique 128-bit +File IDentifier (FID) that is specified by Lustre and is the only identifier +that Lustre is aware of for this object. The FID is known to Lustre before any +access to the object is done (even before it is created), using +lu_object_find(). Since Lustre only uses the FID to identify an object, if the +underlying OSD storage cannot directly use the Lustre-specified FID to retrieve +the object at a later time, it must create a table or index object (normally +called the Object Index (OI)) to map Lustre FIDs to an internal object +identifier. Lustre does not need to understand the format or value of the +internal object identifier at any time outside of the OSD. + +The FID itself is composed of 3 members: +struct lu_fid { + __u64 f_seq; + __u32 f_oid; + __u32 f_ver; +}; + +While the OSD itself should typically not interpret the FID, it may be possible +to optimize the OSD performance by understanding the properties of a FID. + +The f_seq (sequence) component is allocated in piecewise (though not contiguous) +manner to different nodes, and each sequence forms a “group” of related objects. +The sequence number may be any value in the range [1, 263], but there are +typically not a huge number of sequences in use at one time (typically less than +one million at the maximum). Within a single sequence, it is likely that tens to +thousands (and less commonly millions) of mostly-sequential f_oid values will be +allocated. In order to efficiently map FIDs into objects, it is desirable to +also be able to associate the OSD-internal index with key-value pairs. + +Every object is represented with a header (struct lu_header) and so-called slice +on every layer of the stack. Core Lustre code maintains a cache of objects +(so-called lu-site, see struct lu_site). which is very similar to Linux inode +cache. + +ii. Object Lifecycle +-------------------- +In-core object is created when high-level service needs it to process RPC or +perform some background job like LFSCK. FID of the object is supposed to be +known before the object is created. FID can come from RPC or from a disk. +Having the FID lu_object_find() function is called, it search for the object in +the cache (see struct lu_site) and if the object is not found, creates it +using ->ldo_device_alloc(), ->loo_object_init() and ->loo_object_start() +methods. + +Objects are referenced and tracked by Lustre core. If object is not in use, +it’s put on LRU list and at some point (subject to internal caching policy or +memory pressure callbacks from the kernel) Lustre schedules such an object for +removal from the cache. To do so Lustre core marks the object is going out and +calls ->loo_object_release() and ->loo_object_free() iterating over all the +layers involved. + +iii. Special Objects +-------------------- +Lustre uses a set of special objects using the FID_SEQ_LOCAL_FILE sequence. +All the objects are listed in the local_oid enum, which includes: +- OTABLE_OT_OID which is an index object providing list of all existing + objects on this storage. The key is an opaque string and the record is FID. + This object is used by high-level components like LFSCK to iterate over + objects. +- ACCT_USER_OID/ACCT_GROUP_OID are used for accessing space accounting + information for respectively users and groups. +- LAST_RECV_OID is the last_rcvd file for respectively + the MDT and OST. + +iv. Object Operations +--------------------- +Object management methods are called by Lustre to manipulate OSD-specific +(private) data associated with a specific object during the lifetime of an +object. All the object operations are described in struct lu_object_operations: + +int (*loo_object_init)(const struct lu_env *, struct lu_object *, + const struct lu_object_conf *); +int (*loo_object_start)(const struct lu_env *, struct lu_object *); +void (*loo_object_delete)(const struct lu_env *, struct lu_object *); +void (*loo_object_free)(const struct lu_env *, struct lu_object *); +void (*loo_object_release)(const struct lu_env *, struct lu_object *); +int (*loo_object_print)(const struct lu_env *, void *, lu_printer_t, + const struct lu_object *); +int (*loo_object_invariant)(const struct lu_object *); + +loo_object_init + This method is called when a new object is being created (see + lu_object_alloc(), it’s purpose is to initialize object’s internals, + usually file system lookups object on a disk (notice a header storing + FID is already created by a top device) using Object Index mapping FID + to local object id like dnode. LOC_F_NEW can be passed to the method + when the caller knows the object is new and OSD can skip OI lookup to + improve performance. If the object exists, then the LOHA_FLAG flag in + loh_flags (struct lu_object_header) is set. +loo_object_start + The method is called when all the structures and the header are + initialized. Currently user by high-level service to as a post-init + procedure (i.e. to setup own methods depending on object type which is + brought into the header by OSD’s ->loo_object_init()) +loo_object_delete + is called to let OSD release resources behind an object (except memory + allocated for an object), like release file system’s inode. + It’s separated from ->loo_object_free() to be able to iterate over + still-existing objects. the main purpose to separate + ->loo_object_delete() and ->loo_object_free() is to avoid recursion + during potentially stack consuming resource release. +loo_object_free + is called to actually release memory allocated by ->ldo->object_alloc() +loo_object_release + is called when object last it’s last user and moves onto LRU list of + unused objects. implementation of this method is optional to OSD. +loo_object_print + is used for debugging purpose, it should output details of an object in + human-readable format. Details usually include information like address + of an object, local object number (dnode/inode), type of an object, etc. +loo_object_invariant + another optional method for debugging purposes which is called to verify + internal consistency of object. + +3. Lustre Environment +===================== + +There is a notion of an environment represented by struct lu_env in many +functions and methods. Literally this is a Thread Local Storage (TLS), which is +bound to every service thread and used by that thread exclusively, there is no +need to serialize access to the data stored here. +The original purpose of the environment was to workaround small Linux stack +(4-8K). A component (like device or library) can register its own descriptor +(see LU_KEY_INIT macro) and then every new thread will be populating the +environment with buffers described. + +===================== += IV. Data (DT) API = +===================== + +The previous section listed all the methods that have to be provided by an OSD +module in order to fit in the LU stack. In addition to those generic functions, +each layer should implement a different class of operations depending on its +natures. There are currently 3 classes of devices: +- LU_DEVICE_DT: DaTa device (e.g. lod, osp, osd, ofd), +- LU_DEVICE_MD: MetaData device (e.g. mdt, mdd), +- LU_DEVICE_CL: CLient I/O device (e.g. vvp, lov, lovsub, osc). + +The purpose of this section is to document the DT API (used for devices and +objects) which has to be implemented by each OSD module. The DT API is most +commonly called the OSD API. + +1. Data Device +============== + +To access disk file system, Lustre defines a new device type called dt_device +which is a sub-class of generic lu_device. It includes a new operation vector +(namely dt_device_operations structure) defining all the actions that can be +performed against a dt_device. Here are the operation prototypes: + +int (*dt_statfs)(const struct lu_env *, struct dt_device *, + struct obd_statfs *); +struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *); +int (*dt_trans_start)(const struct lu_env *, struct dt_device *, + struct thandle *th); +int (*dt_trans_stop)(const struct lu_env *, struct thandle *); +int (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *); +int (*dt_root_get)(const struct lu_env *, struct dt_device *, + struct lu_fid *); +void (*dt_conf_get)(const struct lu_env *, const struct dt_device *, + struct dt_device_param *); +int (*dt_sync)(const struct lu_env *, struct dt_device *); +int (*dt_ro)(const struct lu_env *, struct dt_device *); +int (*dt_commit_async)(const struct lu_env *, struct dt_device *); + +dt_trans_create +dt_trans_start +dt_trans_stop +dt_trans_cb_add + please refer to IV.3 +dt_statfs + called to report current file system usage information: all, free and + available blocks/objects. +dt_root_get + called to get FID of the root object. Used to follow backend filesystem + rules and support backend file system in a state where users can mount + it directly (with ldiskfs/zfs/etc). +dt_sync + called to flush all complete but not written transactions. Should block + until the flush is completed. +dt_ro + called to turn backend into read-only mode. + Used by testing infrastructure to simulate recovery cases. +dt_commit_async + called to notify OSD/backend that higher level need transaction to be + flushed as soon as possible. Used by Commit-on-Share feature. + Should return immediately and not block for long. + +2. Data Objects +=============== + +There are two types of DT objects: +1) regular objects, storing unstructured data (e.g. flat files, OST objects, + llog objects) +2) index objects, storing key=value pairs (e.g. directories, quota indexes, + FLDB) + +As a result, there are 3 sets of methods that should be implemented by the OSD +layer: +- core methods used to create/destroy/manipulate attributes of objects +- data methods used to access the object body as a flat address space + (read/write/truncate/punch) for regular objects +- index operations to access index objects as a key-value association + +A data object is represented by the dt_object structure which is defined as +a sub-class of lu_object, plus operation vectors for the core, data and index +methods as listed above. + +i. Common Storage Operations +---------------------------- +The core methods are defined in dt_object_operations as follows: + +void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned); +void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned); +void (*do_read_unlock)(const struct lu_env *, struct dt_object *); +void (*do_write_unlock)(const struct lu_env *, struct dt_object *); +int (*do_write_locked)(const struct lu_env *, struct dt_object *); +int (*do_attr_get)(const struct lu_env *, struct dt_object *, + struct lu_attr *); +int (*do_declare_attr_set)(const struct lu_env *, struct dt_object *, + const struct lu_attr *, struct thandle *); +int (*do_attr_set)(const struct lu_env *, struct dt_object *, + const struct lu_attr *, struct thandle *); +int (*do_xattr_get)(const struct lu_env *, struct dt_object *, + struct lu_buf *, const char *); +int (*do_declare_xattr_set)(const struct lu_env *, struct dt_object *, + const struct lu_buf *, const char *, int, + struct thandle *); +int (*do_xattr_set)(const struct lu_env *, struct dt_object *, + const struct lu_buf *, const char *, int, + struct thandle *); +int (*do_declare_xattr_del)(const struct lu_env *, struct dt_object *, + const char *, struct thandle *); +int (*do_xattr_del)(const struct lu_env *, struct dt_object *, const char *, + struct thandle *); +int (*do_xattr_list)(const struct lu_env *, struct dt_object *, + struct lu_buf *); +void (*do_ah_init)(const struct lu_env *, struct dt_allocation_hint *, + struct dt_object *, struct dt_object *, cfs_umode_t); +int (*do_declare_create)(const struct lu_env *, struct dt_object *, + struct lu_attr *, struct dt_allocation_hint *, + struct dt_object_format *, struct thandle *); +int (*do_create)(const struct lu_env *, struct dt_object *, struct lu_attr *, + struct dt_allocation_hint *, struct dt_object_format *, + struct thandle *); +int (*do_declare_destroy)(const struct lu_env *, struct dt_object *, + struct thandle *); +int (*do_destroy)(const struct lu_env *, struct dt_object *, struct thandle *); +int (*do_index_try)(const struct lu_env *, struct dt_object *, + const struct dt_index_features *); +int (*do_declare_ref_add)(const struct lu_env *, struct dt_object *, + struct thandle *); +int (*do_ref_add)(const struct lu_env *, struct dt_object *, struct thandle *); +int (*do_declare_ref_del)(const struct lu_env *, struct dt_object *, + struct thandle *); +int (*do_ref_del)(const struct lu_env *, struct dt_object *, struct thandle *); +int (*do_object_sync)(const struct lu_env *, struct dt_object *); + +do_read_lock +do_write_lock +do_read_unlock +do_write_unlock +do_write_locked + please refer to IV.4 +do_attr_get + The method is called to get regular attributes an object stores. + The lu_attr fields maps the usual unix file attributes, like ownership + or size. The object must exist. +do_declare_attr_set + the method is called to notify OSD the caller is going to modify regular + attributes of an object in specified transaction. OSD should use this + method to reserve resources needed to change attributes. Can be called + on an non-existing object. +do_attr_set + the method is called to change attributes of an object. The object + must exist. If the fl argument has LU_XATTR_CREATE, the extended + argument must not exist, otherwise -EEXIST should be returned. + If the fl argument has LU_XATTR_REPLACE, the extended argument must + exist, otherwise -ENODATA should be returned. The object must exist. + The maximum size of extended attribute supported by OSD should be + present in struct dt_device_param the caller can get with + ->dt_conf_get() method. +do_xattr_get + called when the caller needs to get an extended attribute with a + specified name. If the struct lu_buf argument has a null lb_buf, the + size of the extended attribute should be returned. If the requested + extended attribute does not exist, -ENODATA should be returned. + The object must exist. If buffer space (specified in lu_buf.lb_len) is + not enough to fit the value, then return -ERANGE. +do_declare_xattr_set + called to notify OSD the caller is going to set/change an extended + attribute on an object. OSD should use this method to reserve resources + needed to change an attribute. +do_xattr_set + called when the caller needs to change an extended attribute with + specified name. +do_declare_xattr_del + called to notify OSD the caller is going to remove an extended attribute + with a specified name +do_xattr_del + called when the caller needs to remove an extended attribute with a + specified name. Deleting an nonexistent extended attribute is allowed. + The object must exist. The method called on a non-existing attribute + returns 0. +do_xattr_list + called when the caller needs to get a list of existing extended + attributes (only names of attributes are returned). The size of the list + is returned, including the string terminator. If the lu_buf argument has + a null lb_buf, how many bytes the list would require is returned to help + the caller to allocate a buffer of an appropriate size. + The object must exist. +do_ah_init + called to let OSD to prepare allocation hint which stores information + about object locality, type. later this allocation hint is passed to + ->do_create() method and use OSD can use this information to optimize + on-disk object location. allocation hint is opaque for the caller and + can contain OSD-specific information. +do_declare_create + called to notify OSD the caller is going to create a new object in a + specified transaction. +do_create + called to create an object on the OSD in a specified transaction. + For index objects the caller can request a set of index properties (like + key/value size). If OSD can not support requested properties, it should + return an error. The object shouldn't exist already (i.e. + dt_object_exist() should return false). +do_declare_destroy + called to notify OSD the caller is going to destroy an object in a + specified transaction. +do_destroy + called to destroy an object in a specified transaction. Semantically, + it’s dual to object creation and does not care about on-disk reference + to the object (in contrast with POSIX unlink operation). + The object must exist (i.e. dt_object_exist() must return true). +do_index_try + called when the caller needs to use an object as an index (the object + should be created as an index before). Also the caller specify a set of + properties she expect the index should support. +do_declare_ref_add + called to notify OSD the caller is going to increment nlink attribute + in a specified transaction. +do_ref_add + called to increment nlink attribute in a specified transaction. + The object must exist. +do_declare_ref_del + called to notify OSD the caller is going to decrement nlink attribute + in a specified transaction. +do_ref_del + called to decrement nlink attribute in a specified transaction. + This is typically done on an object when a record referring to it is + deleted from an index object. The object must exist. +do_object_sync + called to flush a given object on-disk. It’s a fine grained version of + ->do_sync() method which should make sure an object is stored on-disk. + OSD (or backend file system) can track a status of every object and if + an object is already flushed, then just the method can return + immediately. The method is used on OSS now, but can also be used on MDS + at some point to improve performance of COS. +do_data_get + the method is not used any more and planned for removal. + +ii. Data Object Operations +-------------------------- +Set of methods described in struct dt_body_operations which should be used with +regular objects storing unstructured data: + +ssize_t (*dbo_read)(const struct lu_env *, struct dt_object *, struct lu_buf *, + loff_t *pos); +ssize_t (*dbo_declare_write)(const struct lu_env *, struct dt_object *, + const loff_t, loff_t, struct thandle *); +ssize_t (*dbo_write)(const struct lu_env , struct dt_object *, + const struct lu_buf *, loff_t *, struct thandle *, int); +int (*dbo_bufs_get)(const struct lu_env *, struct dt_object *, loff_t, + ssize_t, struct niobuf_local *, int); +int (*dbo_bufs_put)(const struct lu_env *, struct dt_object *, + struct niobuf_local *, int); +int (*dbo_write_prep)(const struct lu_env *, struct dt_object *, + struct niobuf_local *, int); +int (*dbo_declare_write_commit)(const struct lu_env *, struct dt_object *, + struct niobuf_local *,int, struct thandle *); +int (*dbo_write_commit)(const struct lu_env *, struct dt_object *, + struct niobuf_local *, int, struct thandle *); +int (*dbo_read_prep)(const struct lu_env *, struct dt_object *, + struct niobuf_local *, int); +int (*dbo_fiemap_get)(const struct lu_env *, struct dt_object *, + struct ll_user_fiemap *); +int (*dbo_declare_punch)(const struct lu_env*, struct dt_object *, __u64, + __u64,struct thandle *); +int (*dbo_punch)(const struct lu_env *, struct dt_object *, __u64, __u64, + struct thandle *); + +dbo_read + is called to read raw unstructured data from a specified range of an + object. It returns number of bytes read or an error. Usually OSD + implements this method using internal buffering (to be able to put data + at non-aligned address). So this method should not be used to move a + lot of data. Lustre services use it to read to read small internal data + like last_rcvd file, llog files. It's also used to fetch body symlinks. +dbo_declare_write + is called to notify OSD the caller will be writing data to a specific + range of an object in a specified transaction. +dbo_write + is called to write raw unstructured data to a specified range of an + object in a specified transaction. data should be written atomically + with another change in the transaction. The method is used by Lustre + services to update small portions on a disk. OSD should maintain size + attribute consistent with data written. +dbo_bufs_get + is called to fill memory with buffer descriptors (see struct + niobuf_local) for a specified range of an object. memory for the set is + provided by the caller, no concurrent access to this memory is allowed. + OSD can fill all fields of the descriptor except lnb_grant_used. + The caller specify whether buffers will be user to read or write data. + This method is used to access file system's internal buffers for + zero-copy IO. Internal buffers referenced by descriptors are supposed to + be pinned in memory +dbo_bufs_put + is called to unpin/release internal buffers referenced by the + descriptors dbo_bufs_get returns. After this point pointers in the + descriptors are not valid. +dbo_write_prep + is called to fill internal buffers with actual data. this is required + for buffers which do not match filesystem blocksize, as later the buffer + is supposed to be written as a whole. for example, ldiskfs uses 4k + blocks, but the caller wants to update just a half of that. to prevent + data corruption, this method is called OSD compares range to be written + with 4k, if they do not match, then OSD fetches data from a disk. + If they do match, then all the data will be overwritten and there is no + need to fetch data from a disk. +dbo_declare_write_commit + is called to notify OSD the caller is going to write internal buffers + and OSD needs to reserve enough resource in a transaction. +dbo_write_commit + is called to actually make data in internal buffers part of a specified + transaction. Data is supposed to be written by the moment the + transaction is considered committed. This is slightly different from + generic transaction model because in this case it's allowed to have + data written, but not have transaction committed. + If no dbo_write_commit is called, then dbo_bufs_put should discard + internal buffers and possible changes made to internal buffers should + not be visible. +dbo_read_prep + is called to fill all internal buffers referenced by descriptors with + actual data. buffers may already contain valid data (be cached), so OSD + can just verify the data is valid and return immediately. +dbo_fiemap_get + is called to map logical range of an object to physical blocks where + corresponded range of data is actually stored. +dbo_declare_punch + is called to notify OSD the caller is going to punch (deallocate) + specified range in a transaction. +dbo_punch + is called to punch (deallocate) specified range of data in a + transaction. this method is allowed to use few disk file system + transactions (within the same lustre transaction handle). + Currently Lustre calls the method in form of truncate only where the end + offset is EOF always. + +iii. Indice Operations +---------------------- +In contrast with raw unstructured data they are collection of key=value pairs. +OSD should provide with few methods to lookup, insert, delete and scan pairs. +Indices may have different properties like key/value size, string/binary keys, +etc. When user need to use an index, it needs to check whether the index has +required properties with a special method. indices are used by Lustre services +to maintain user-visible namespace, FLD, index of unlinked files, etc. + +The method prototypes are defined in dt_index_operations as follows: + +int (*dio_lookup)(const struct lu_env *, struct dt_object *, struct dt_rec *, + const struct dt_key *); +int (*dio_declare_insert)(const struct lu_env *, struct dt_object *, + const struct dt_rec *, const struct dt_key *, + struct thandle *); +int (*dio_insert)(const struct lu_env *, struct dt_object *, + const struct dt_rec *, const struct dt_key *, + struct thandle *, int); +int (*dio_declare_delete)(const struct lu_env *, struct dt_object *, + const struct dt_key *, struct thandle *); +int (*dio_delete)(const struct lu_env *, struct dt_object *, + const struct dt_key *, struct thandle *); + +dio_lookup + is called to lookup exact key=value pair. A value is copied into a + buffer provided by the caller. so the caller should make sure the + buffer's size is big enough. this should be done with ->do_index_try() + method. +dio_declare_insert + is called to notify OSD the caller is going to insert key=value pair in + a transaction. exact key is specified by a caller so OSD can use this to + make reservation better (i.e. smaller). +dio_insert + is called to insert key/value pair into an index object. it's up to OSD + whether to allow concurrent inserts or not. the caller is not required + to serialize access to an index +dio_declare_delete + is called to notify OSD the caller is going to remove a specified key + in a transaction. exact key is specified by a caller so OSD can use this + to make reservation better. +dio_delete + is called to remove a key/value pair specified by a caller. + +To iterate over all key=value pair stored in an index, OSD should provide the +following set of methods: + +struct dt_it *(*init)(const struct lu_env *, struct dt_object *, __u32); +void (*fini)(const struct lu_env *, struct dt_it *); +int (*get)(const struct lu_env *, struct dt_it *, const struct dt_key *); +void (*put)(const struct lu_env *, struct dt_it *); +int (*next)(const struct lu_env *, struct dt_it *); +struct dt_key *(*key)(const struct lu_env *, const struct dt_it *); +int (*key_size)(const struct lu_env *, const struct dt_it *); +int (*rec)(const struct lu_env *, const struct dt_it *, struct dt_rec *, + __u32); +__u64 (*store)(const struct lu_env *, const struct dt_it *); +int (*load)(const struct lu_env *, const struct dt_it *, __u64); +int (*key_rec)(const struct lu_env *, const struct dt_it *, void *); + +init + is called to allocate and initialize an instance of "iterator" which + subsequent methods will be passed in. the structure is not accessed by + Lustre and its content is totally internal to OSD. Usually it contains a + reference to index, current position in an index. + It may contain prefetched key/value pairs. It's not required to maintain + this cache up-to-date, if index changes this is not required to be + reflected by an already initialized iterator. In the extreme case + ->init() can prefetch all existing pairs to be returned by subsequent + calls to an iterator. +fini + is called to release an iterator and all its resources. + For example, iterator can unpin an index, free prefetched pairs, etc. +get + is called to move an iterator to a specified key. if key does not exist + then it should be the closest position from the beginning of iteration. +put + is called to release an iterator. +next + is called to move an iterator to a next item +key + is called to fill specified buffer with a key at a current position of + an iterator. it’s the caller responsibility to pass big enough buffer. + In turn OSD should not exceed sizes negotiated with ->do_index_try() + method +key_size + is called to learn size of a key at current position of an iterator +rec + is called to fill specified buffer with a value at a current position of + an iterator. it’s the caller responsibility to pass big enough buffer. + in turn OSD should not exceed sizes negotiated with ->do_index_try() + method. +store + is called to get a 64bit cookie of a current position of an iterator. +load + is called to reset current position of an iterator to match 64bit + cookie ->store() method returns. these two methods allow to implement + functionality like POSIX readdir where current position is stored as an + integer. +key_rec + is not used currently + +3. Transactions +=============== + +i. Description +-------------- +Transactions are used by Lustre to implement recovery protocol and support +failover. The main purpose of transactions is to atomically update backend file +system. This include as regular changes (file creation, for example) as special +Lustre changes (last_rcvd file, lastid, llogs). OSD is supposed to provide the +transactional mechanism and let Lustre to control what specific updates to put +into transactions. + +Lustre relies on the following rule for transactions order: if transaction T1 +starts before transaction T2 starts, then the commit of T2 means that T1 is +committed at the same time or earlier. Notice that the creation of a transaction +does not imply the immediate start of the updates on storage, do not confuse +creation of a transaction with start of a transaction. + +It’s up to OSD and backend file system to group few transactions for better +performance given it still follow the rule above. + +Transactions are identified in the OSD API by an opaque transaction handle, +which is a pointer to an OSD-private data structure that it can use to track +(and optionally verify) the updates done within that transaction. This handle is +returned by the OSD to the caller when the transaction is first created. +Any potential updates (modifications to the underlying storage) must be declared +as part of a transaction, after the transaction has been created, and before the +transaction is started. The transaction handle is passed when declaring all +updates. If any part of the declaration should fail, the transaction is aborted +without having modified the storage. + +After all updates have been declared, and have completed successfully, the +handle is passed to the transaction start. After the transaction has started, +the handle will be passed to every update that is done as part of that +transaction. All updates done under the transaction must previously have been +declared. Once the transaction has started, it is not permitted to add new +updates to the transaction, nor is it possible to roll back the transaction +after this point. Should some update to the storage fail, the caller will try +to undo the previous updates within the context of the transaction itself, to +ensure that the resulting OSD state is correct. + +Any update that was not previously declared is an implementation error in the +caller. Not all declared updates need to be executed, as they form a worst-case +superset of the possible updates that may be required in order to complete the +desired operation in a consistent manner. + +OSD should let a caller to register callback function(s) to be called on +transaction commit to a disk. Also OSD should be able to call a special of +transaction hooks on all the stages (creation, start, stop, commit) on +per-devices basis so that high-level services (like MDT) which are not involved +directly into controlling transactions still can be involved. +Every commit callback gets a result of transaction commit, if disk filesystem was +not able to commit the transaction, then an appropriate error code will be passed. + +It’s important to note that OSD and disk file system should use asynchronous IO +to implement transactions, otherwise the performance is expected to be bad. + +The maximum number of updates that make up a single transaction is OSD-specific, +but is expected to be at least in the tens of updates to multiple objects in the +OSD (extending writes of multiple MB of data, modifying or adding attributes, +extended attributes, references, etc). For example, in ext4, each update to the +filesystem will modify one or more blocks of storage. Since one transaction is +limited to one quarter of the journal size, if the caller declares a series of +updates that modify more than this number of blocks, the declaration must fail +or it could not be committed atomically. +In general, every constraint must be checked here to ensure that all changes +that must commit atomically can complete successfully. + +ii. Lifetime +------------ +From Lustre point of view a transaction goes through the following steps: +1. creation +2. declaration of all possible changes planned in transaction +3. transaction start +4. execution of planned and declared changes +5. transaction stop +6. commit callback(s) + +iii. Methods +------------ +OSD should implement the following methods to let Lustre control transactions: + +struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *); +int (*dt_trans_start)(const struct lu_env *, struct dt_device *, + struct thandle *); +int (*dt_trans_stop)(const struct lu_env *, struct thandle *); +int (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *); + +dt_trans_create + is called to allocate and initialize transaction handle (see struct + thandle). This structure has no pointer to a private data so, it should + be embedded into private representation of transaction at OSD layer. + This method can block. +dt_trans_start + is called to notify OSD a specified transaction has got all the + declarations and now OSD should tell whether it has enough resources to + proceed with declared changes or to return an error to a caller. + This method can block. OSD should call dt_txn_hook_start() function + before underlying file system’s transaction starts to support per-device + transaction hooks. If OSD (or disk files system) can not start + transaction, then an error is returned and transaction handle is + destroyed, no commit callbacks are called. +dt_trans_stop + is called to notify OSD a specified transaction has been executed and no + more changes are expected in a context of that. Usually this mean that at + this point OSD is free to start writeout preserving notion + all-or-nothing. This method can block. + If th_sync flag is set at this point, then OSD should start to commit + this transaction and block until the transaction is committed. the order + of unblock event and transaction’s commit callback functions is not + defined by the API. OSD should call dt_txn_hook_stop() functions once + underlying file system’s transaction is stopped to support per-device + transaction hooks. +dt_trans_cb_add + is called to register commit callback function(s), which OSD will be + calling up on transaction commit to a storage. when all the callback + functions are processed, transaction handle can be freed by OSD. + There are no constraints on how many callback functions can be running + concurrently. They should not be running in an interrupt context. + Usually this method should not block and use spinlocks. As part of + commit callback functions processing dt_txn_hook_commit() function + should be called to support per-device transaction hooks. + +The callback mechanism let layers not commanding transactions be involved. +For example, MDT registers its set and now every transaction happening on +corresponded OSD will be seen by MDT, which adds recovery information to the +transactions: generate transaction number, puts it into a special file -- all +this happen within the context of the transaction, so atomically. +Similarly VBR functionality in MDT updates objects versions. + +4. Locking +========== + +i. Description +-------------- +OSD is expected to maintain internal consistency of the file system and its +object on its own, requiring no additional locking or serialization from higher +levels. This let OSD to control how fine the locking is depending on the +internal structuring of a specific file system. If few update conflict then the +result is not defined by OSD API and left to OSD. + +OSD should provide the caller with few methods to serialize access to an object +in shared and exclusive mode. It’s up to caller how to use them, to define order +of locking. In general the locks provided by OSD are used to group complex +updates so that other threads do not see intermediate result of operations. + +ii. Methods +----------- +Methods to lock/unlock object +The set of methods exported by each OSD to manage locking is the following: +void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned); +void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned); +void (*do_read_unlock)(const struct lu_env *, struct dt_object *); +void (*do_write_unlock)(const struct lu_env *, struct dt_object *); +int (*do_write_locked)(const struct lu_env *, struct dt_object *); + +do_read_lock + get a shared lock on the object, this is a blocking lock. +do_write_lock + get an exclusive lock on the object, this is a blocking lock. +do_read_unlock + release a shared lock on an object, this is a blocking lock. +do_write_unlock + release an exclusive lock on an object, this is a blocking lock. +do_write_locked + check whether an object is exclusive-locked. + +It is highly desirable that an OSD object can be accessed and modified by +multiple threads concurrently. + +For regular objects, the preferred implementation allows an object to be read +concurrently at overlapping offsets, and written by multiple threads at +non-overlapping offsets with the minimum amount of contention possible, or any +combination of concurrent read/write operations. Lustre will not itself perform +concurrent overlapping writes to a single region of the object, due to +serialization at a higher level. + +For index objects, the preferred implementation allows key/value pair to be +looked up concurrently, allows non-conflicting keys to be inserted or removed +concurrently, or any combination of concurrent lookup, insertion, or removal. +Lustre does not require the storage of multiple identical keys. Operations on +the same key should be serialized. + +======================== += V. Quota Enforcement = +======================== + +1. Overview +=========== + +The OSD layer is in charge of setting up a Quota Slave Device (aka QSD) to +manage quota enforcement for a specific OSD device. The QSD is implemented under +the form of a library. Each OSD device should create a QSD instance which will +be used to manage quota enforcement for this device. This implies: +- completing the reintegration procedure with the quota master (aka QMT) to + to retrieve the latest quota settings and quota space distribution for each + UID/GID. +- managing quota locks in order to be notified of configuration changes. +- acquiring space from the QMT when quota space for a given user/group is + close to exhaustion. +- allocating quota space to service threads for local request processing. + +The reintegration procedure allows a disconnected slave to re-synchronize with +the quota master, which means: +- re-acquiring quota locks, +- fetching up-to-date quota settings (e.g. list of UIDs with quota enforced), +- reporting space usage to master for newly (e.g. setquota was run while the + slave wasn't connected) enforced UID/GID, +- adjusting spare quota space (e.g. slave hold a large amount of unused quota + space for a user which ran out of quota space on the master while the slave + was disconnected). + +The latter two actions are known as reconciliation. + +2. QSD API +========== + +The QSD API is defined in lustre/include/lustre_quota.h as follows: + +struct qsd_instance *qsd_init(const struct lu_env *, char *, struct dt_device *, + struct proc_dir_entry *); +int qsd_prepare(const struct lu_env *, struct qsd_instance *); +int qsd_start(const struct lu_env *, struct qsd_instance *); +void qsd_fini(const struct lu_env *, struct qsd_instance *); +int qsd_op_begin(const struct lu_env *, struct qsd_instance *, + struct lquota_trans *, struct lquota_id_info *, int *); +void qsd_op_end(const struct lu_env *, struct qsd_instance *, + struct lquota_trans *); +void qsd_op_adjust(const struct lu_env *, struct qsd_instance *, + union lquota_id *, int); + +qsd_init + The OSD module should first allocate a qsd instance via qsd_init. + This creates all required structures to manage quota enforcement for + this target and performs all low-level initialization which does not + involve any lustre object. qsd_init should typically be called when + the OSD is being set up. + +qsd_prepare + This sets up on-disk objects associated with the quota slave feature + and initiates the quota reintegration procedure if needed. + qsd_prepare should typically be called when ->ldo_prepare is invoked. + +qsd_start + a qsd instance should be started once recovery is completed (i.e. when + ->ldo_recovery_complete is called). This is used to notify the qsd layer + that quota should now be enforced again via the qsd_op_begin/end + functions. The last step of the reintegration procedure (namely usage + reconciliation) will be completed during start. + +qsd_fini + is used to release a qsd_instance structure allocated with qsd_init. + This releases all quota slave objects and frees the structures + associated with the qsd_instance. + +qsd_op_begin + is used to enforce quota, it must be called in the declaration of each + operation. qsd_op_end should then be invoked later once all operations + have been completed in order to release/adjust the quota space. + Running qsd_op_begin before qsd_start isn't fatal and will return + success. Once qsd_start has been run, qsd_op_begin will block until the + reintegration procedure is completed. + +qsd_op_end + performs the post operation quota processing. This must be called after + the operation transaction stopped. While qsd_op_begin must be invoked + each time a new operation is declared, qsd_op_end should be called only + once for the whole transaction. + +qsd_op_adjust + Trigger pre-acquire/release if necessary, it's only used for ldiskfs osd + so far. When unlink a file in ldiskfs, the quota accounting isn't + updated when the transaction stopped. Instead, it'll be updated on the + final iput, so qsd_op_adjust() will be called then (in + osd_object_delete()) to trigger quota release if necessary. + +Appendix 1. A brief note on Lustre configuration. +================================================= + +In the current versions (1.8, 2.x) MGS is used to store configuration of the +servers, so called profile. The profile stores configuration commands and +arguments to setup specific stack. To see how it looks exactly you can fetch +MDT profile with debugfs -R "dump /CONFIGS/lustre-MDT0000 ", then +parse it with: llog_reader . Here is a short extract: + +#02 (136)attach 0:lustre-MDT0000-mdtlov 1:lov 2:lustre-MDT0000-mdtlov_UUID +#03 (176)lov_setup 0:lustre-MDT0000-mdtlov 1:(struct lov_desc) + uuid=lustre-MDT0000-mdtlov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1 +#06 (120)attach 0:lustre-MDT0000 1:mdt 2:lustre-MDT0000_UUID +#07 (112)mount_option 0: 1:lustre-MDT0000 2:lustre-MDT0000-mdtlov +#08 (160)setup 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f +#23 (080)add_uuid nid=10.0.2.15@tcp(0x200000a00020f) 0: 1:10.0.2.15@tcp +#24 (144)attach 0:lustre-OST0000-osc-MDT0000 1:osc 2:lustre-MDT0000-mdtlov_UUID +#25 (144)setup 0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:10.0.2.15@tcp +#26 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov 1:lustre-OST0000_UUID 2:0 3:1 +#32 (080)add_uuid nid=10.0.2.15@tcp(0x200000a00020f) 0: 1:10.0.2.15@tcp +#33 (144)attach 0:lustre-OST0001-osc-MDT0000 1:osc 2:lustre-MDT0000-mdtlov_UUID +#34 (144)setup 0:lustre-OST0001-osc-MDT0000 1:lustre-OST0001_UUID 2:10.0.2.15@tcp +#35 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov 1:lustre-OST0001_UUID 2:1 3:1 +#41 (120)param 0: 1:sys.jobid_var=procname_uid 2:procname_uid +#44 (080)set_timeout=20 +#48 (112)param 0:lustre-MDT0000-mdtlov 1:lov.stripesize=1048576 +#51 (112)param 0:lustre-MDT0000-mdtlov 1:lov.stripecount=-1 +#54 (160)param 0:lustre-MDT0000 1:mdt.identity_upcall=/work/lustre/head/lustre-release/lustre/utils/l_getidentity + +Every line starts with a specific command (attach, lov_setup, set, etc) to do +specific configuration action. Then arguments follow. Often the first argument +is a device name. For example, +#02 (136)attach 0:lustre-MDT0000-mdtlov 1:lov 2:lustre-MDT0000-mdtlov_UUID + +This command will be setting up device “lustre-MDT0000-mdtlov” of type “lov” +with additional argument “lustre-MDT0000-mdtlov_UUID”. All these arguments are +packed into lustre configuration buffers ( struct lustre_cfg). + +Another commands will be attaching device into the stack (like setup and +lov_modify_tgts). + +Appendix 2. Sample Code +======================= + +Lustre currently has 2 different OSD implementations: +- ldiskfs OSD under lustre/osd-ldiskfs + http://git.hpdd.intel.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-ldiskfs;hb=HEAD +- ZFS OSD under lustre/zfs-osd + http://git.hpdd.intel.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-zfs;hb=HEAD