1 ****************************************************
2 * Overview of the Lustre Object Storage Device API *
3 ****************************************************
7 Alex Zhuravlev <alexey.zhuravlev@intel.com>
8 Andreas Dilger <andreas.dilger@intel.com>
9 Johann Lombardi <johann.lombardi@intel.com>
10 Li Wei <wei.g.li@intel.com>
11 Niu Yawei <yawei.niu@intel.com>
13 Last Updated: October 9, 2012
15 Copyright (c) 2012, 2013, Intel Corporation.
17 This file is released under the GPLv2.
24 2. What OSD API is Not
27 II. Backend Storage Subsystem Requirements
28 1. Atomicity of Updates
30 i. Standard POSIX Attributes
31 ii. Extended Attributes
35 III. OSD & LU Infrastructure
38 ii. Device Type & Operations
39 iii. Device Operations
50 i. Common Storage Operations
51 ii. Data Object Operations
52 iii. Indice Operations
63 Appendix 1. A brief note on Lustre configuration.
64 Appendix 2. Sample Code
73 OSD API is the interface to access and modify data that is supposed to be stored
74 persistently. This API layer is the interface to code that bridges individual
75 file systems such as ext4 or ZFS to Lustre.
76 The API is a generic interface to transaction and journaling based file systems
77 so many backend file systems can be supported in a Lustre implementation.
78 Data can be cached within the OSD or backend target and could be destroyed
79 before hitting storage, but in general the final target is a persistent storage.
80 This API creates many possibilities, including using object-storage devices or
81 other new persistent storage technologies.
83 2. What OSD API is Not
84 ======================
86 OSD API should not be used to control in-core-only state (like ldlm locking),
87 configuration, etc. The upper layers of the IO/metadata stack should not be
88 involved with the underlying layout or allocation in the OSD storage.
93 Lustre is composed of different kernel modules, each implementing different
94 layers in the software stack in an object-oriented approach. Generally, each
95 layer builds (or stacks) upon another, and each object is a child of the
96 generic LU object class. Hence the term "LU stack" is often used to reference
97 this hierarchy of lustre modules and objects.
99 Each layer (i.e. mdt/mdd/lod/osp/ofd/osd) defines its own generic item
100 (lu_object/lu_device) which are thus gathered in a compound item (lu_site/
101 lu_object_layer) representing the multi-layered stacks. Different classes of
102 operations can then be implemented by each layer, depending on its natures.
104 As a result, each OSD is expected to implement:
105 - the generic LU API used to manage the device stack and objects (see chapter
107 - the DT API (most commonly called OSD API) used to manipulate on-disk
108 structures (see chapter IV).
113 The goal of this document is to provide the reader with the information
114 necessary to accurately construct a new Object Storage Device (OSD) module
115 interface layer for Lustre in order to use a new backend file system with
116 Lustre 2.4 and greater.
118 ==============================================
119 = II. Backend Storage Subsystem Requirements =
120 ==============================================
122 The purpose of this section is to gather the requirements for the storage
123 subsystems below the OSD API.
125 1. Atomicity of Updates
126 =======================
128 The underlying OSD storage must be able to provide some form of atomic commit
129 for multiple arbitrary updates to OSD storage within a single transaction.
130 It will always know in advance of the transaction starting which objects will
131 be modified, and how they will be modified.
133 If any of the updates associated with a transaction are stored persistently
134 (i.e. some state in the OSD is modified), then all of the updates in that
135 transaction must also be stored persistently (Atomic). If the OSD should fail
136 in some manner that prevents all the updates of a transaction from being
137 completed, then none of the updates shall be completed (Consistent).
138 Once the updates have been reported committed to the caller (i.e. commit
139 callbacks have been run), they cannot be rolled back for any reason (Durable).
144 i. Standard POSIX Attributes
145 ----------------------------
146 The OSD object should be able to store normal POSIX attributes on each object
147 as specified by Lustre:
150 - object type (16 bits)
151 - access mode (16 bits)
152 - metadata change time (96 bits, 64-bit seconds, 32-bit nanoseconds)
153 - data modification time (96 bits, 64-bit seconds, 32-bit nanoseconds)
154 - data access time (96 bits, 64-bit seconds, 32-bit nanoseconds)
155 - creation time (96 bits, 64-bit seconds, 32-bit nanoseconds, optional)
156 - object size (64 bits)
157 - link count (32 bits)
159 - object version (64 bits)
161 The OSD object shall not modify these attributes itself.
163 In addition, it is desirable track the object allocation size (“blocks”), which
164 the OSD manages itself. Lustre will query the object allocation size, but will
165 never modify it. If these attributes are not managed by the OSD natively as part
166 of the object itself, they can be stored in an extended attribute associated
169 ii. Extended Attributes
170 ------------------------
171 The OSD should have an efficient mechanism for storing small extended attributes
172 with each object. This implies that the extended attributes can be accessed at
173 the same time as the object (without extra seek/read operations). There is also
174 a requirement to store larger extended attributes in some cases (over 1kB in
175 size), but the performance of such attributes can be slower proportional to the
181 The OSD must provide a mechanism for efficient key=value retrieval, for both
182 fixed-length and variable length keys and values. It is expected that an index
183 may hold tens of millions of keys, and must be able to do random key lookups
184 in an efficient manner. It must also provide a mechanism for iterating over all
185 of the keys in a particular index and returning these to the caller in a
186 consistent order across multiple calls. It must be able to provide a cookie that
187 defines the current index at which the iteration is positioned, and must be able
188 to continue iteration at this index at a later time.
193 The OSD must provide some mechanism to register multiple arbitrary callback
194 functions for each transaction, and call these functions after the transaction
195 with which they are associated has committed to persistent storage.
196 It is not required that they be called immediately at transaction commit time,
197 but they cannot be delayed an arbitrarily long time, or other parts of the
198 system may suffer resource exhaustion. If this mechanism is not implemented by
199 the underlying storage, then it needs to be provided in some manner by the OSD
200 implementation itself.
205 In order to provide quota functionality for the OSD, it must be able to track
206 the object allocation size against at least two different keys (typically User
207 ID and Group ID). The actual mechanism of tracking this allocation is internal
208 to the OSD. Lustre will specify the owners of the object against which to track
209 this space. Space accounting information will be accessed by Lustre via the
210 index API on special objects dedicated to space allocation management.
212 ================================
213 = III. OSD & LU Infrastructure =
214 ================================
216 As a member of the LU stack, each OSD module is expected to implement the
217 generic LU API used to manage devices and objects.
224 Each layer in the stack is represented by a lu_device structure which holds
225 the very basic data like reference counter, a reference to the site (Lustre
226 object collection in-core, very similar to inode cache), a reference to
227 struct lu_type which in turn describe this specific type of devices
228 (type name, operations etc).
230 OSD device is created and initialized at mount time to let configuration
231 component access data it needs before the whole Lustre stack is ready.
232 OSD device is destroyed when all the devices using that are destroyed too.
233 Usually this happen when the server stack shuts down at unmount time.
235 There might be few OSD devices of the given type (say, few zfs-osd and
236 ldiskfs-osd). The type stores method common for all OSD instances of given type
237 (below they start with ldto_ prefix). Then every instance of OSD device can get
238 few specific methods (below the start with ldo_ prefix).
240 To connect devices into a stack, ->o_connect() method is used (see struct
241 obd_ops). Currently OSD should implement this method to track all it’s users.
242 Then to disconnect ->o_disconnect() method is used. OSD should implement this
243 method, track remaining users and once no users left, call
244 class_manual_cleanup() function which initiate removal of OSD.
246 As the stack involves many devices and there may be cross-references between
247 them, it’s easier to break the whole shutdown procedure into the two steps and
248 do not set a specific order in which different devices shutdown: at the first
249 step the devices should release all the resources they use internally
250 (so-called pre-cleanup procedure), at the second step they are actually
253 ii. Device Type & Operations
254 ----------------------------
255 The first thing to do when developing a new OSD is to define a lu_device_type
256 structure to define and register the new OSD type. The following fields of the
257 lu_device_type needs to be filled appropriately:
259 is the type of device, typically data, metadata or client (see
260 lu_device_tag). An OSD device is of data type and should always
261 registers as such by setting this field to LU_DEVICE_DT.
263 is the name associated with the new OSD type.
264 See LUSTRE_OSD_{LDISKFS,ZFS}_NAME for reference.
266 is the vector of lu_device_type operations, please see below for
269 is the lu_context_tag to be used for operations.
270 This should be set to LCT_LOCAL for OSDs.
272 In the original 2.0 MDS stack the devices were built from the top down and OSD
273 was the final device to setup. This schema does not work very well when you have
274 to access on-disk data early and when you have OSD shared among few services
275 (e.g. MDS + MGS on a same storage). So the schema has changed to a reverse one:
276 mount procedure sets up correct OSD, then the stack is built from the bottom up.
277 And instead of introducing another set of methods we decided to use existing
278 obd_connect() and obd_disconnect() given that many existing devices have been
279 already configured this way by the configuration component. Notice also that
280 configuration profiles are organized in this order (LOV/LOD go first, then MDT).
281 Given that device “below” is ready at every step, there is no point in calling
282 separate init method.
284 Due to complexity in other modules, when the device itself can be referenced by
285 number of entities like exports, RPCs, transactions, callbacks, access via
286 procfs, the notion of precleanup was introduced to be able all the activity
287 safely before the actual cleanup takes place. Similarly ->ldto_device_fini()
288 and ->ldto_device_free() were introduced. So, the former should be used to break
289 any interaction with the outside, the latter - to actually free the device.
291 So, the configuration component meets SETUP command in the configuration profile
292 (see Appendix 1), finds appropriate device and calls ->ldto_device_alloc() to
293 set up it as an LU device.
295 The prototypes of device type operations are the following:
297 struct lu_device *(*ldto_device_alloc)(const struct lu_env *,
298 struct lu_device_type *,
299 struct lustre_cfg *);
300 struct lu_device *(*ldto_device_free)(const struct lu_env *,
302 int (*ldto_device_init)(const struct lu_env *, struct lu_device *,
303 const char *, struct lu_device *);
304 struct lu_device *(*ldto_device_fini)(const struct lu_env *env, struct lu_device *);
305 int (*ldto_init)(struct lu_device_type *t);
306 void (*ldto_fini)(struct lu_device_type *t);
307 void (*ldto_start)(struct lu_device_type *t);
308 void (*ldto_stop)(struct lu_device_type *t);
311 The method is called by configuration component (in case of disk file
312 system OSD, this is lustre/obdclass/obd_mount.c) to allocate device.
313 Notice generic struct lu_device does not hold a pointer to private data.
314 Instead OSD should embed struct lu_device into own structure (like
315 struct osd_device) and return address of lu_device in that structure.
317 The method is called when OSD is about to release. OSD should detach
318 from resources like disk file system, procfs, release objects it holds
319 internally, etc. This is so-called precleanup procedure.
321 The method is called to actually release memory allocated in
322 ->ldto_device_alloc().
324 The method is not used by OSD currently.
326 The method is called when specific type of OSD is registered in the
327 system. Currently the method is used to register OSD-specific data for
328 environments (see Lustre environment in section 3).
329 See LU_TYPE_INIT_FINI() macro as an example.
331 The method is called when specific type of OSD unregisters.
332 Currently used to unregister OSD-specific data from environment.
334 The method is called when the first device of this type is being
335 instantiated. Currently used to fill existing environments with
338 This method is called when the last instance of specific OSD has gone.
339 Currently used to release OSD-specific data from environments.
341 iii. Device Operations
342 ----------------------
343 Now that the osd device can be set up, we need to export methods to handle
344 device-level operation. All those methods are listed in the lu_device_operations
345 structure, this includes:
347 struct lu_object *(*ldo_object_alloc)(const struct lu_env *,
348 const struct lu_object_header *,
350 int (*ldo_process_config)(const struct lu_env *, struct lu_device *,
351 struct lustre_cfg *);
352 int (*ldo_recovery_complete)(const struct lu_env *, struct lu_device *);
353 int (*ldo_prepare)(const struct lu_env *, struct lu_device *,
357 The method is called when a high-level service wants to access an
358 object not found in local lustre cache (see struct lu_site).
359 OSD should allocate a structure, initialize object’s methods and return
360 a pointer to struct lu_device which is embedded into OSD object
363 The method is called in case of configuration changes. Mostly used by
364 high-level services to update local tunables. It’s also possible to let
365 MGS store OSD tunables and set them properly on every server mount or
366 when tunables change run-time.
367 ldto_recovery_complete
368 The method is called when recovery procedure between a server and
369 clients is completed. This method is used by high-level devices mostly
370 (like OSP to cleanup OST orphans, MDD to cleanup open unlinked files
371 left by missing client, etc).
373 The method is called when all the devices belonging to the stack are
374 configured and setup properly. At this point the server becomes ready
375 to handle RPCs and start recovery procedure.
376 In current implementation OSD uses this method to initialize local quota
381 Although the LU infrastructure aims at replacing the storage operations of the
382 legacy OBD API (see struct obd_ops in lustre/include/obd.h). The OBD API is
383 still used in several places for device configuration and on the Lustre client
384 (e.g. it’s still used on the client for LDLM locking). The OBD API storage
385 operations are not needed for server components, and should be ignored.
387 As far as the OSD layer is concerned, upper layers still connect/disconnect
388 to/from the OSD instance via obd_ops::o_connect/disconnect. As a result, each
389 OSD should implement those two operations:
391 int (*o_connect)(const struct lu_env *, struct obd_export **,
392 struct obd_device *, struct obd_uuid *,
393 struct obd_connect_data *, void *);
394 int (*o_disconnect)(struct obd_export *);
397 The method should track number of connections made (i.e. number of
398 active users of this OSD) and call class_connect() and return a struct
399 obd_export via class_conn2export(), see osd_obd_connect(). The structure
400 holds a reference on the device, preventing it from early release.
402 The method is called then some one using this OSD does not need its
403 service any more (i.e. at unmount). For every passed struct export the
404 method should call class_disconnect(export). Once the last user has
405 gone, OSD should call class_manual_cleanup() to schedule the device
413 Lustre identifies objects in the underlying OSD storage by a unique 128-bit
414 File IDentifier (FID) that is specified by Lustre and is the only identifier
415 that Lustre is aware of for this object. The FID is known to Lustre before any
416 access to the object is done (even before it is created), using
417 lu_object_find(). Since Lustre only uses the FID to identify an object, if the
418 underlying OSD storage cannot directly use the Lustre-specified FID to retrieve
419 the object at a later time, it must create a table or index object (normally
420 called the Object Index (OI)) to map Lustre FIDs to an internal object
421 identifier. Lustre does not need to understand the format or value of the
422 internal object identifier at any time outside of the OSD.
424 The FID itself is composed of 3 members:
431 While the OSD itself should typically not interpret the FID, it may be possible
432 to optimize the OSD performance by understanding the properties of a FID.
434 The f_seq (sequence) component is allocated in piecewise (though not contiguous)
435 manner to different nodes, and each sequence forms a “group” of related objects.
436 The sequence number may be any value in the range [1, 263], but there are
437 typically not a huge number of sequences in use at one time (typically less than
438 one million at the maximum). Within a single sequence, it is likely that tens to
439 thousands (and less commonly millions) of mostly-sequential f_oid values will be
440 allocated. In order to efficiently map FIDs into objects, it is desirable to
441 also be able to associate the OSD-internal index with key-value pairs.
443 Every object is represented with a header (struct lu_header) and so-called slice
444 on every layer of the stack. Core Lustre code maintains a cache of objects
445 (so-called lu-site, see struct lu_site). which is very similar to Linux inode
450 In-core object is created when high-level service needs it to process RPC or
451 perform some background job like LFSCK. FID of the object is supposed to be
452 known before the object is created. FID can come from RPC or from a disk.
453 Having the FID lu_object_find() function is called, it search for the object in
454 the cache (see struct lu_site) and if the object is not found, creates it
455 using ->ldo_device_alloc(), ->loo_object_init() and ->loo_object_start()
458 Objects are referenced and tracked by Lustre core. If object is not in use,
459 it’s put on LRU list and at some point (subject to internal caching policy or
460 memory pressure callbacks from the kernel) Lustre schedules such an object for
461 removal from the cache. To do so Lustre core marks the object is going out and
462 calls ->loo_object_release() and ->loo_object_free() iterating over all the
467 Lustre uses a set of special objects using the FID_SEQ_LOCAL_FILE sequence.
468 All the objects are listed in the local_oid enum, which includes:
469 - OTABLE_OT_OID which is an index object providing list of all existing
470 objects on this storage. The key is an opaque string and the record is FID.
471 This object is used by high-level components like LFSCK to iterate over
473 - ACCT_USER_OID/ACCT_GROUP_OID are used for accessing space accounting
474 information for respectively users and groups.
475 - LAST_RECV_OID is the last_rcvd file for respectively
478 iv. Object Operations
479 ---------------------
480 Object management methods are called by Lustre to manipulate OSD-specific
481 (private) data associated with a specific object during the lifetime of an
482 object. All the object operations are described in struct lu_object_operations:
484 int (*loo_object_init)(const struct lu_env *, struct lu_object *,
485 const struct lu_object_conf *);
486 int (*loo_object_start)(const struct lu_env *, struct lu_object *);
487 void (*loo_object_delete)(const struct lu_env *, struct lu_object *);
488 void (*loo_object_free)(const struct lu_env *, struct lu_object *);
489 void (*loo_object_release)(const struct lu_env *, struct lu_object *);
490 int (*loo_object_print)(const struct lu_env *, void *, lu_printer_t,
491 const struct lu_object *);
492 int (*loo_object_invariant)(const struct lu_object *);
495 This method is called when a new object is being created (see
496 lu_object_alloc(), it’s purpose is to initialize object’s internals,
497 usually file system lookups object on a disk (notice a header storing
498 FID is already created by a top device) using Object Index mapping FID
499 to local object id like dnode. LOC_F_NEW can be passed to the method
500 when the caller knows the object is new and OSD can skip OI lookup to
501 improve performance. If the object exists, then the LOHA_FLAG flag in
502 loh_flags (struct lu_object_header) is set.
504 The method is called when all the structures and the header are
505 initialized. Currently user by high-level service to as a post-init
506 procedure (i.e. to setup own methods depending on object type which is
507 brought into the header by OSD’s ->loo_object_init())
509 is called to let OSD release resources behind an object (except memory
510 allocated for an object), like release file system’s inode.
511 It’s separated from ->loo_object_free() to be able to iterate over
512 still-existing objects. the main purpose to separate
513 ->loo_object_delete() and ->loo_object_free() is to avoid recursion
514 during potentially stack consuming resource release.
516 is called to actually release memory allocated by ->ldo->object_alloc()
518 is called when object last it’s last user and moves onto LRU list of
519 unused objects. implementation of this method is optional to OSD.
521 is used for debugging purpose, it should output details of an object in
522 human-readable format. Details usually include information like address
523 of an object, local object number (dnode/inode), type of an object, etc.
525 another optional method for debugging purposes which is called to verify
526 internal consistency of object.
528 3. Lustre Environment
529 =====================
531 There is a notion of an environment represented by struct lu_env in many
532 functions and methods. Literally this is a Thread Local Storage (TLS), which is
533 bound to every service thread and used by that thread exclusively, there is no
534 need to serialize access to the data stored here.
535 The original purpose of the environment was to workaround small Linux stack
536 (4-8K). A component (like device or library) can register its own descriptor
537 (see LU_KEY_INIT macro) and then every new thread will be populating the
538 environment with buffers described.
540 =====================
541 = IV. Data (DT) API =
542 =====================
544 The previous section listed all the methods that have to be provided by an OSD
545 module in order to fit in the LU stack. In addition to those generic functions,
546 each layer should implement a different class of operations depending on its
547 natures. There are currently 3 classes of devices:
548 - LU_DEVICE_DT: DaTa device (e.g. lod, osp, osd, ofd),
549 - LU_DEVICE_MD: MetaData device (e.g. mdt, mdd),
550 - LU_DEVICE_CL: CLient I/O device (e.g. vvp, lov, lovsub, osc).
552 The purpose of this section is to document the DT API (used for devices and
553 objects) which has to be implemented by each OSD module. The DT API is most
554 commonly called the OSD API.
559 To access disk file system, Lustre defines a new device type called dt_device
560 which is a sub-class of generic lu_device. It includes a new operation vector
561 (namely dt_device_operations structure) defining all the actions that can be
562 performed against a dt_device. Here are the operation prototypes:
564 int (*dt_statfs)(const struct lu_env *, struct dt_device *,
565 struct obd_statfs *);
566 struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *);
567 int (*dt_trans_start)(const struct lu_env *, struct dt_device *,
569 int (*dt_trans_stop)(const struct lu_env *, struct thandle *);
570 int (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *);
571 int (*dt_root_get)(const struct lu_env *, struct dt_device *,
573 void (*dt_conf_get)(const struct lu_env *, const struct dt_device *,
574 struct dt_device_param *);
575 int (*dt_sync)(const struct lu_env *, struct dt_device *);
576 int (*dt_ro)(const struct lu_env *, struct dt_device *);
577 int (*dt_commit_async)(const struct lu_env *, struct dt_device *);
585 called to report current file system usage information: all, free and
586 available blocks/objects.
588 called to get FID of the root object. Used to follow backend filesystem
589 rules and support backend file system in a state where users can mount
590 it directly (with ldiskfs/zfs/etc).
592 called to flush all complete but not written transactions. Should block
593 until the flush is completed.
595 called to turn backend into read-only mode.
596 Used by testing infrastructure to simulate recovery cases.
598 called to notify OSD/backend that higher level need transaction to be
599 flushed as soon as possible. Used by Commit-on-Share feature.
600 Should return immediately and not block for long.
605 There are two types of DT objects:
606 1) regular objects, storing unstructured data (e.g. flat files, OST objects,
608 2) index objects, storing key=value pairs (e.g. directories, quota indexes,
611 As a result, there are 3 sets of methods that should be implemented by the OSD
613 - core methods used to create/destroy/manipulate attributes of objects
614 - data methods used to access the object body as a flat address space
615 (read/write/truncate/punch) for regular objects
616 - index operations to access index objects as a key-value association
618 A data object is represented by the dt_object structure which is defined as
619 a sub-class of lu_object, plus operation vectors for the core, data and index
620 methods as listed above.
622 i. Common Storage Operations
623 ----------------------------
624 The core methods are defined in dt_object_operations as follows:
626 void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned);
627 void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned);
628 void (*do_read_unlock)(const struct lu_env *, struct dt_object *);
629 void (*do_write_unlock)(const struct lu_env *, struct dt_object *);
630 int (*do_write_locked)(const struct lu_env *, struct dt_object *);
631 int (*do_attr_get)(const struct lu_env *, struct dt_object *,
633 int (*do_declare_attr_set)(const struct lu_env *, struct dt_object *,
634 const struct lu_attr *, struct thandle *);
635 int (*do_attr_set)(const struct lu_env *, struct dt_object *,
636 const struct lu_attr *, struct thandle *);
637 int (*do_xattr_get)(const struct lu_env *, struct dt_object *,
638 struct lu_buf *, const char *);
639 int (*do_declare_xattr_set)(const struct lu_env *, struct dt_object *,
640 const struct lu_buf *, const char *, int,
642 int (*do_xattr_set)(const struct lu_env *, struct dt_object *,
643 const struct lu_buf *, const char *, int,
645 int (*do_declare_xattr_del)(const struct lu_env *, struct dt_object *,
646 const char *, struct thandle *);
647 int (*do_xattr_del)(const struct lu_env *, struct dt_object *, const char *,
649 int (*do_xattr_list)(const struct lu_env *, struct dt_object *,
651 void (*do_ah_init)(const struct lu_env *, struct dt_allocation_hint *,
652 struct dt_object *, struct dt_object *, cfs_umode_t);
653 int (*do_declare_create)(const struct lu_env *, struct dt_object *,
654 struct lu_attr *, struct dt_allocation_hint *,
655 struct dt_object_format *, struct thandle *);
656 int (*do_create)(const struct lu_env *, struct dt_object *, struct lu_attr *,
657 struct dt_allocation_hint *, struct dt_object_format *,
659 int (*do_declare_destroy)(const struct lu_env *, struct dt_object *,
661 int (*do_destroy)(const struct lu_env *, struct dt_object *, struct thandle *);
662 int (*do_index_try)(const struct lu_env *, struct dt_object *,
663 const struct dt_index_features *);
664 int (*do_declare_ref_add)(const struct lu_env *, struct dt_object *,
666 int (*do_ref_add)(const struct lu_env *, struct dt_object *, struct thandle *);
667 int (*do_declare_ref_del)(const struct lu_env *, struct dt_object *,
669 int (*do_ref_del)(const struct lu_env *, struct dt_object *, struct thandle *);
670 int (*do_object_sync)(const struct lu_env *, struct dt_object *);
679 The method is called to get regular attributes an object stores.
680 The lu_attr fields maps the usual unix file attributes, like ownership
681 or size. The object must exist.
683 the method is called to notify OSD the caller is going to modify regular
684 attributes of an object in specified transaction. OSD should use this
685 method to reserve resources needed to change attributes. Can be called
686 on an non-existing object.
688 the method is called to change attributes of an object. The object
689 must exist. If the fl argument has LU_XATTR_CREATE, the extended
690 argument must not exist, otherwise -EEXIST should be returned.
691 If the fl argument has LU_XATTR_REPLACE, the extended argument must
692 exist, otherwise -ENODATA should be returned. The object must exist.
693 The maximum size of extended attribute supported by OSD should be
694 present in struct dt_device_param the caller can get with
695 ->dt_conf_get() method.
697 called when the caller needs to get an extended attribute with a
698 specified name. If the struct lu_buf argument has a null lb_buf, the
699 size of the extended attribute should be returned. If the requested
700 extended attribute does not exist, -ENODATA should be returned.
701 The object must exist. If buffer space (specified in lu_buf.lb_len) is
702 not enough to fit the value, then return -ERANGE.
704 called to notify OSD the caller is going to set/change an extended
705 attribute on an object. OSD should use this method to reserve resources
706 needed to change an attribute.
708 called when the caller needs to change an extended attribute with
711 called to notify OSD the caller is going to remove an extended attribute
712 with a specified name
714 called when the caller needs to remove an extended attribute with a
715 specified name. Deleting an nonexistent extended attribute is allowed.
716 The object must exist. The method called on a non-existing attribute
719 called when the caller needs to get a list of existing extended
720 attributes (only names of attributes are returned). The size of the list
721 is returned, including the string terminator. If the lu_buf argument has
722 a null lb_buf, how many bytes the list would require is returned to help
723 the caller to allocate a buffer of an appropriate size.
724 The object must exist.
726 called to let OSD to prepare allocation hint which stores information
727 about object locality, type. later this allocation hint is passed to
728 ->do_create() method and use OSD can use this information to optimize
729 on-disk object location. allocation hint is opaque for the caller and
730 can contain OSD-specific information.
732 called to notify OSD the caller is going to create a new object in a
733 specified transaction.
735 called to create an object on the OSD in a specified transaction.
736 For index objects the caller can request a set of index properties (like
737 key/value size). If OSD can not support requested properties, it should
738 return an error. The object shouldn't exist already (i.e.
739 dt_object_exist() should return false).
741 called to notify OSD the caller is going to destroy an object in a
742 specified transaction.
744 called to destroy an object in a specified transaction. Semantically,
745 it’s dual to object creation and does not care about on-disk reference
746 to the object (in contrast with POSIX unlink operation).
747 The object must exist (i.e. dt_object_exist() must return true).
749 called when the caller needs to use an object as an index (the object
750 should be created as an index before). Also the caller specify a set of
751 properties she expect the index should support.
753 called to notify OSD the caller is going to increment nlink attribute
754 in a specified transaction.
756 called to increment nlink attribute in a specified transaction.
757 The object must exist.
759 called to notify OSD the caller is going to decrement nlink attribute
760 in a specified transaction.
762 called to decrement nlink attribute in a specified transaction.
763 This is typically done on an object when a record referring to it is
764 deleted from an index object. The object must exist.
766 called to flush a given object on-disk. It’s a fine grained version of
767 ->do_sync() method which should make sure an object is stored on-disk.
768 OSD (or backend file system) can track a status of every object and if
769 an object is already flushed, then just the method can return
770 immediately. The method is used on OSS now, but can also be used on MDS
771 at some point to improve performance of COS.
773 the method is not used any more and planned for removal.
775 ii. Data Object Operations
776 --------------------------
777 Set of methods described in struct dt_body_operations which should be used with
778 regular objects storing unstructured data:
780 ssize_t (*dbo_read)(const struct lu_env *, struct dt_object *, struct lu_buf *,
782 ssize_t (*dbo_declare_write)(const struct lu_env *, struct dt_object *,
783 const loff_t, loff_t, struct thandle *);
784 ssize_t (*dbo_write)(const struct lu_env , struct dt_object *,
785 const struct lu_buf *, loff_t *, struct thandle *, int);
786 int (*dbo_bufs_get)(const struct lu_env *, struct dt_object *, loff_t,
787 ssize_t, struct niobuf_local *, int);
788 int (*dbo_bufs_put)(const struct lu_env *, struct dt_object *,
789 struct niobuf_local *, int);
790 int (*dbo_write_prep)(const struct lu_env *, struct dt_object *,
791 struct niobuf_local *, int);
792 int (*dbo_declare_write_commit)(const struct lu_env *, struct dt_object *,
793 struct niobuf_local *,int, struct thandle *);
794 int (*dbo_write_commit)(const struct lu_env *, struct dt_object *,
795 struct niobuf_local *, int, struct thandle *);
796 int (*dbo_read_prep)(const struct lu_env *, struct dt_object *,
797 struct niobuf_local *, int);
798 int (*dbo_fiemap_get)(const struct lu_env *, struct dt_object *,
799 struct ll_user_fiemap *);
800 int (*dbo_declare_punch)(const struct lu_env*, struct dt_object *, __u64,
801 __u64,struct thandle *);
802 int (*dbo_punch)(const struct lu_env *, struct dt_object *, __u64, __u64,
806 is called to read raw unstructured data from a specified range of an
807 object. It returns number of bytes read or an error. Usually OSD
808 implements this method using internal buffering (to be able to put data
809 at non-aligned address). So this method should not be used to move a
810 lot of data. Lustre services use it to read to read small internal data
811 like last_rcvd file, llog files. It's also used to fetch body symlinks.
813 is called to notify OSD the caller will be writing data to a specific
814 range of an object in a specified transaction.
816 is called to write raw unstructured data to a specified range of an
817 object in a specified transaction. data should be written atomically
818 with another change in the transaction. The method is used by Lustre
819 services to update small portions on a disk. OSD should maintain size
820 attribute consistent with data written.
822 is called to fill memory with buffer descriptors (see struct
823 niobuf_local) for a specified range of an object. memory for the set is
824 provided by the caller, no concurrent access to this memory is allowed.
825 OSD can fill all fields of the descriptor except lnb_grant_used.
826 The caller specify whether buffers will be user to read or write data.
827 This method is used to access file system's internal buffers for
828 zero-copy IO. Internal buffers referenced by descriptors are supposed to
831 is called to unpin/release internal buffers referenced by the
832 descriptors dbo_bufs_get returns. After this point pointers in the
833 descriptors are not valid.
835 is called to fill internal buffers with actual data. this is required
836 for buffers which do not match filesystem blocksize, as later the buffer
837 is supposed to be written as a whole. for example, ldiskfs uses 4k
838 blocks, but the caller wants to update just a half of that. to prevent
839 data corruption, this method is called OSD compares range to be written
840 with 4k, if they do not match, then OSD fetches data from a disk.
841 If they do match, then all the data will be overwritten and there is no
842 need to fetch data from a disk.
843 dbo_declare_write_commit
844 is called to notify OSD the caller is going to write internal buffers
845 and OSD needs to reserve enough resource in a transaction.
847 is called to actually make data in internal buffers part of a specified
848 transaction. Data is supposed to be written by the moment the
849 transaction is considered committed. This is slightly different from
850 generic transaction model because in this case it's allowed to have
851 data written, but not have transaction committed.
852 If no dbo_write_commit is called, then dbo_bufs_put should discard
853 internal buffers and possible changes made to internal buffers should
856 is called to fill all internal buffers referenced by descriptors with
857 actual data. buffers may already contain valid data (be cached), so OSD
858 can just verify the data is valid and return immediately.
860 is called to map logical range of an object to physical blocks where
861 corresponded range of data is actually stored.
863 is called to notify OSD the caller is going to punch (deallocate)
864 specified range in a transaction.
866 is called to punch (deallocate) specified range of data in a
867 transaction. this method is allowed to use few disk file system
868 transactions (within the same lustre transaction handle).
869 Currently Lustre calls the method in form of truncate only where the end
870 offset is EOF always.
872 iii. Indice Operations
873 ----------------------
874 In contrast with raw unstructured data they are collection of key=value pairs.
875 OSD should provide with few methods to lookup, insert, delete and scan pairs.
876 Indices may have different properties like key/value size, string/binary keys,
877 etc. When user need to use an index, it needs to check whether the index has
878 required properties with a special method. indices are used by Lustre services
879 to maintain user-visible namespace, FLD, index of unlinked files, etc.
881 The method prototypes are defined in dt_index_operations as follows:
883 int (*dio_lookup)(const struct lu_env *, struct dt_object *, struct dt_rec *,
884 const struct dt_key *);
885 int (*dio_declare_insert)(const struct lu_env *, struct dt_object *,
886 const struct dt_rec *, const struct dt_key *,
888 int (*dio_insert)(const struct lu_env *, struct dt_object *,
889 const struct dt_rec *, const struct dt_key *,
890 struct thandle *, int);
891 int (*dio_declare_delete)(const struct lu_env *, struct dt_object *,
892 const struct dt_key *, struct thandle *);
893 int (*dio_delete)(const struct lu_env *, struct dt_object *,
894 const struct dt_key *, struct thandle *);
897 is called to lookup exact key=value pair. A value is copied into a
898 buffer provided by the caller. so the caller should make sure the
899 buffer's size is big enough. this should be done with ->do_index_try()
902 is called to notify OSD the caller is going to insert key=value pair in
903 a transaction. exact key is specified by a caller so OSD can use this to
904 make reservation better (i.e. smaller).
906 is called to insert key/value pair into an index object. it's up to OSD
907 whether to allow concurrent inserts or not. the caller is not required
908 to serialize access to an index
910 is called to notify OSD the caller is going to remove a specified key
911 in a transaction. exact key is specified by a caller so OSD can use this
912 to make reservation better.
914 is called to remove a key/value pair specified by a caller.
916 To iterate over all key=value pair stored in an index, OSD should provide the
917 following set of methods:
919 struct dt_it *(*init)(const struct lu_env *, struct dt_object *, __u32);
920 void (*fini)(const struct lu_env *, struct dt_it *);
921 int (*get)(const struct lu_env *, struct dt_it *, const struct dt_key *);
922 void (*put)(const struct lu_env *, struct dt_it *);
923 int (*next)(const struct lu_env *, struct dt_it *);
924 struct dt_key *(*key)(const struct lu_env *, const struct dt_it *);
925 int (*key_size)(const struct lu_env *, const struct dt_it *);
926 int (*rec)(const struct lu_env *, const struct dt_it *, struct dt_rec *,
928 __u64 (*store)(const struct lu_env *, const struct dt_it *);
929 int (*load)(const struct lu_env *, const struct dt_it *, __u64);
930 int (*key_rec)(const struct lu_env *, const struct dt_it *, void *);
933 is called to allocate and initialize an instance of "iterator" which
934 subsequent methods will be passed in. the structure is not accessed by
935 Lustre and its content is totally internal to OSD. Usually it contains a
936 reference to index, current position in an index.
937 It may contain prefetched key/value pairs. It's not required to maintain
938 this cache up-to-date, if index changes this is not required to be
939 reflected by an already initialized iterator. In the extreme case
940 ->init() can prefetch all existing pairs to be returned by subsequent
941 calls to an iterator.
943 is called to release an iterator and all its resources.
944 For example, iterator can unpin an index, free prefetched pairs, etc.
946 is called to move an iterator to a specified key. if key does not exist
947 then it should be the closest position from the beginning of iteration.
949 is called to release an iterator.
951 is called to move an iterator to a next item
953 is called to fill specified buffer with a key at a current position of
954 an iterator. it’s the caller responsibility to pass big enough buffer.
955 In turn OSD should not exceed sizes negotiated with ->do_index_try()
958 is called to learn size of a key at current position of an iterator
960 is called to fill specified buffer with a value at a current position of
961 an iterator. it’s the caller responsibility to pass big enough buffer.
962 in turn OSD should not exceed sizes negotiated with ->do_index_try()
965 is called to get a 64bit cookie of a current position of an iterator.
967 is called to reset current position of an iterator to match 64bit
968 cookie ->store() method returns. these two methods allow to implement
969 functionality like POSIX readdir where current position is stored as an
972 is not used currently
979 Transactions are used by Lustre to implement recovery protocol and support
980 failover. The main purpose of transactions is to atomically update backend file
981 system. This include as regular changes (file creation, for example) as special
982 Lustre changes (last_rcvd file, lastid, llogs). OSD is supposed to provide the
983 transactional mechanism and let Lustre to control what specific updates to put
986 Lustre relies on the following rule for transactions order: if transaction T1
987 starts before transaction T2 starts, then the commit of T2 means that T1 is
988 committed at the same time or earlier. Notice that the creation of a transaction
989 does not imply the immediate start of the updates on storage, do not confuse
990 creation of a transaction with start of a transaction.
992 It’s up to OSD and backend file system to group few transactions for better
993 performance given it still follow the rule above.
995 Transactions are identified in the OSD API by an opaque transaction handle,
996 which is a pointer to an OSD-private data structure that it can use to track
997 (and optionally verify) the updates done within that transaction. This handle is
998 returned by the OSD to the caller when the transaction is first created.
999 Any potential updates (modifications to the underlying storage) must be declared
1000 as part of a transaction, after the transaction has been created, and before the
1001 transaction is started. The transaction handle is passed when declaring all
1002 updates. If any part of the declaration should fail, the transaction is aborted
1003 without having modified the storage.
1005 After all updates have been declared, and have completed successfully, the
1006 handle is passed to the transaction start. After the transaction has started,
1007 the handle will be passed to every update that is done as part of that
1008 transaction. All updates done under the transaction must previously have been
1009 declared. Once the transaction has started, it is not permitted to add new
1010 updates to the transaction, nor is it possible to roll back the transaction
1011 after this point. Should some update to the storage fail, the caller will try
1012 to undo the previous updates within the context of the transaction itself, to
1013 ensure that the resulting OSD state is correct.
1015 Any update that was not previously declared is an implementation error in the
1016 caller. Not all declared updates need to be executed, as they form a worst-case
1017 superset of the possible updates that may be required in order to complete the
1018 desired operation in a consistent manner.
1020 OSD should let a caller to register callback function(s) to be called on
1021 transaction commit to a disk. Also OSD should be able to call a special of
1022 transaction hooks on all the stages (creation, start, stop, commit) on
1023 per-devices basis so that high-level services (like MDT) which are not involved
1024 directly into controlling transactions still can be involved.
1025 Every commit callback gets a result of transaction commit, if disk filesystem was
1026 not able to commit the transaction, then an appropriate error code will be passed.
1028 It’s important to note that OSD and disk file system should use asynchronous IO
1029 to implement transactions, otherwise the performance is expected to be bad.
1031 The maximum number of updates that make up a single transaction is OSD-specific,
1032 but is expected to be at least in the tens of updates to multiple objects in the
1033 OSD (extending writes of multiple MB of data, modifying or adding attributes,
1034 extended attributes, references, etc). For example, in ext4, each update to the
1035 filesystem will modify one or more blocks of storage. Since one transaction is
1036 limited to one quarter of the journal size, if the caller declares a series of
1037 updates that modify more than this number of blocks, the declaration must fail
1038 or it could not be committed atomically.
1039 In general, every constraint must be checked here to ensure that all changes
1040 that must commit atomically can complete successfully.
1044 From Lustre point of view a transaction goes through the following steps:
1046 2. declaration of all possible changes planned in transaction
1047 3. transaction start
1048 4. execution of planned and declared changes
1050 6. commit callback(s)
1054 OSD should implement the following methods to let Lustre control transactions:
1056 struct thandle *(*dt_trans_create)(const struct lu_env *, struct dt_device *);
1057 int (*dt_trans_start)(const struct lu_env *, struct dt_device *,
1059 int (*dt_trans_stop)(const struct lu_env *, struct thandle *);
1060 int (*dt_trans_cb_add)(struct thandle *, struct dt_txn_commit_cb *);
1063 is called to allocate and initialize transaction handle (see struct
1064 thandle). This structure has no pointer to a private data so, it should
1065 be embedded into private representation of transaction at OSD layer.
1066 This method can block.
1068 is called to notify OSD a specified transaction has got all the
1069 declarations and now OSD should tell whether it has enough resources to
1070 proceed with declared changes or to return an error to a caller.
1071 This method can block. OSD should call dt_txn_hook_start() function
1072 before underlying file system’s transaction starts to support per-device
1073 transaction hooks. If OSD (or disk files system) can not start
1074 transaction, then an error is returned and transaction handle is
1075 destroyed, no commit callbacks are called.
1077 is called to notify OSD a specified transaction has been executed and no
1078 more changes are expected in a context of that. Usually this mean that at
1079 this point OSD is free to start writeout preserving notion
1080 all-or-nothing. This method can block.
1081 If th_sync flag is set at this point, then OSD should start to commit
1082 this transaction and block until the transaction is committed. the order
1083 of unblock event and transaction’s commit callback functions is not
1084 defined by the API. OSD should call dt_txn_hook_stop() functions once
1085 underlying file system’s transaction is stopped to support per-device
1088 is called to register commit callback function(s), which OSD will be
1089 calling up on transaction commit to a storage. when all the callback
1090 functions are processed, transaction handle can be freed by OSD.
1091 There are no constraints on how many callback functions can be running
1092 concurrently. They should not be running in an interrupt context.
1093 Usually this method should not block and use spinlocks. As part of
1094 commit callback functions processing dt_txn_hook_commit() function
1095 should be called to support per-device transaction hooks.
1097 The callback mechanism let layers not commanding transactions be involved.
1098 For example, MDT registers its set and now every transaction happening on
1099 corresponded OSD will be seen by MDT, which adds recovery information to the
1100 transactions: generate transaction number, puts it into a special file -- all
1101 this happen within the context of the transaction, so atomically.
1102 Similarly VBR functionality in MDT updates objects versions.
1109 OSD is expected to maintain internal consistency of the file system and its
1110 object on its own, requiring no additional locking or serialization from higher
1111 levels. This let OSD to control how fine the locking is depending on the
1112 internal structuring of a specific file system. If few update conflict then the
1113 result is not defined by OSD API and left to OSD.
1115 OSD should provide the caller with few methods to serialize access to an object
1116 in shared and exclusive mode. It’s up to caller how to use them, to define order
1117 of locking. In general the locks provided by OSD are used to group complex
1118 updates so that other threads do not see intermediate result of operations.
1122 Methods to lock/unlock object
1123 The set of methods exported by each OSD to manage locking is the following:
1124 void (*do_read_lock)(const struct lu_env *, struct dt_object *, unsigned);
1125 void (*do_write_lock)(const struct lu_env *, struct dt_object *, unsigned);
1126 void (*do_read_unlock)(const struct lu_env *, struct dt_object *);
1127 void (*do_write_unlock)(const struct lu_env *, struct dt_object *);
1128 int (*do_write_locked)(const struct lu_env *, struct dt_object *);
1131 get a shared lock on the object, this is a blocking lock.
1133 get an exclusive lock on the object, this is a blocking lock.
1135 release a shared lock on an object, this is a blocking lock.
1137 release an exclusive lock on an object, this is a blocking lock.
1139 check whether an object is exclusive-locked.
1141 It is highly desirable that an OSD object can be accessed and modified by
1142 multiple threads concurrently.
1144 For regular objects, the preferred implementation allows an object to be read
1145 concurrently at overlapping offsets, and written by multiple threads at
1146 non-overlapping offsets with the minimum amount of contention possible, or any
1147 combination of concurrent read/write operations. Lustre will not itself perform
1148 concurrent overlapping writes to a single region of the object, due to
1149 serialization at a higher level.
1151 For index objects, the preferred implementation allows key/value pair to be
1152 looked up concurrently, allows non-conflicting keys to be inserted or removed
1153 concurrently, or any combination of concurrent lookup, insertion, or removal.
1154 Lustre does not require the storage of multiple identical keys. Operations on
1155 the same key should be serialized.
1157 ========================
1158 = V. Quota Enforcement =
1159 ========================
1164 The OSD layer is in charge of setting up a Quota Slave Device (aka QSD) to
1165 manage quota enforcement for a specific OSD device. The QSD is implemented under
1166 the form of a library. Each OSD device should create a QSD instance which will
1167 be used to manage quota enforcement for this device. This implies:
1168 - completing the reintegration procedure with the quota master (aka QMT) to
1169 to retrieve the latest quota settings and quota space distribution for each
1171 - managing quota locks in order to be notified of configuration changes.
1172 - acquiring space from the QMT when quota space for a given user/group is
1173 close to exhaustion.
1174 - allocating quota space to service threads for local request processing.
1176 The reintegration procedure allows a disconnected slave to re-synchronize with
1177 the quota master, which means:
1178 - re-acquiring quota locks,
1179 - fetching up-to-date quota settings (e.g. list of UIDs with quota enforced),
1180 - reporting space usage to master for newly (e.g. setquota was run while the
1181 slave wasn't connected) enforced UID/GID,
1182 - adjusting spare quota space (e.g. slave hold a large amount of unused quota
1183 space for a user which ran out of quota space on the master while the slave
1186 The latter two actions are known as reconciliation.
1191 The QSD API is defined in lustre/include/lustre_quota.h as follows:
1193 struct qsd_instance *qsd_init(const struct lu_env *, char *, struct dt_device *,
1194 struct proc_dir_entry *);
1195 int qsd_prepare(const struct lu_env *, struct qsd_instance *);
1196 int qsd_start(const struct lu_env *, struct qsd_instance *);
1197 void qsd_fini(const struct lu_env *, struct qsd_instance *);
1198 int qsd_op_begin(const struct lu_env *, struct qsd_instance *,
1199 struct lquota_trans *, struct lquota_id_info *, int *);
1200 void qsd_op_end(const struct lu_env *, struct qsd_instance *,
1201 struct lquota_trans *);
1202 void qsd_op_adjust(const struct lu_env *, struct qsd_instance *,
1203 union lquota_id *, int);
1206 The OSD module should first allocate a qsd instance via qsd_init.
1207 This creates all required structures to manage quota enforcement for
1208 this target and performs all low-level initialization which does not
1209 involve any lustre object. qsd_init should typically be called when
1210 the OSD is being set up.
1213 This sets up on-disk objects associated with the quota slave feature
1214 and initiates the quota reintegration procedure if needed.
1215 qsd_prepare should typically be called when ->ldo_prepare is invoked.
1218 a qsd instance should be started once recovery is completed (i.e. when
1219 ->ldo_recovery_complete is called). This is used to notify the qsd layer
1220 that quota should now be enforced again via the qsd_op_begin/end
1221 functions. The last step of the reintegration procedure (namely usage
1222 reconciliation) will be completed during start.
1225 is used to release a qsd_instance structure allocated with qsd_init.
1226 This releases all quota slave objects and frees the structures
1227 associated with the qsd_instance.
1230 is used to enforce quota, it must be called in the declaration of each
1231 operation. qsd_op_end should then be invoked later once all operations
1232 have been completed in order to release/adjust the quota space.
1233 Running qsd_op_begin before qsd_start isn't fatal and will return
1234 success. Once qsd_start has been run, qsd_op_begin will block until the
1235 reintegration procedure is completed.
1238 performs the post operation quota processing. This must be called after
1239 the operation transaction stopped. While qsd_op_begin must be invoked
1240 each time a new operation is declared, qsd_op_end should be called only
1241 once for the whole transaction.
1244 Trigger pre-acquire/release if necessary, it's only used for ldiskfs osd
1245 so far. When unlink a file in ldiskfs, the quota accounting isn't
1246 updated when the transaction stopped. Instead, it'll be updated on the
1247 final iput, so qsd_op_adjust() will be called then (in
1248 osd_object_delete()) to trigger quota release if necessary.
1250 Appendix 1. A brief note on Lustre configuration.
1251 =================================================
1253 In the current versions (1.8, 2.x) MGS is used to store configuration of the
1254 servers, so called profile. The profile stores configuration commands and
1255 arguments to setup specific stack. To see how it looks exactly you can fetch
1256 MDT profile with debugfs -R "dump /CONFIGS/lustre-MDT0000 <tempfile>", then
1257 parse it with: llog_reader <tempfile>. Here is a short extract:
1259 #02 (136)attach 0:lustre-MDT0000-mdtlov 1:lov 2:lustre-MDT0000-mdtlov_UUID
1260 #03 (176)lov_setup 0:lustre-MDT0000-mdtlov 1:(struct lov_desc)
1261 uuid=lustre-MDT0000-mdtlov_UUID stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1
1262 #06 (120)attach 0:lustre-MDT0000 1:mdt 2:lustre-MDT0000_UUID
1263 #07 (112)mount_option 0: 1:lustre-MDT0000 2:lustre-MDT0000-mdtlov
1264 #08 (160)setup 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f
1265 #23 (080)add_uuid nid=10.0.2.15@tcp(0x200000a00020f) 0: 1:10.0.2.15@tcp
1266 #24 (144)attach 0:lustre-OST0000-osc-MDT0000 1:osc 2:lustre-MDT0000-mdtlov_UUID
1267 #25 (144)setup 0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:10.0.2.15@tcp
1268 #26 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov 1:lustre-OST0000_UUID 2:0 3:1
1269 #32 (080)add_uuid nid=10.0.2.15@tcp(0x200000a00020f) 0: 1:10.0.2.15@tcp
1270 #33 (144)attach 0:lustre-OST0001-osc-MDT0000 1:osc 2:lustre-MDT0000-mdtlov_UUID
1271 #34 (144)setup 0:lustre-OST0001-osc-MDT0000 1:lustre-OST0001_UUID 2:10.0.2.15@tcp
1272 #35 (136)lov_modify_tgts add 0:lustre-MDT0000-mdtlov 1:lustre-OST0001_UUID 2:1 3:1
1273 #41 (120)param 0: 1:sys.jobid_var=procname_uid 2:procname_uid
1274 #44 (080)set_timeout=20
1275 #48 (112)param 0:lustre-MDT0000-mdtlov 1:lov.stripesize=1048576
1276 #51 (112)param 0:lustre-MDT0000-mdtlov 1:lov.stripecount=-1
1277 #54 (160)param 0:lustre-MDT0000 1:mdt.identity_upcall=/work/lustre/head/lustre-release/lustre/utils/l_getidentity
1279 Every line starts with a specific command (attach, lov_setup, set, etc) to do
1280 specific configuration action. Then arguments follow. Often the first argument
1281 is a device name. For example,
1282 #02 (136)attach 0:lustre-MDT0000-mdtlov 1:lov 2:lustre-MDT0000-mdtlov_UUID
1284 This command will be setting up device “lustre-MDT0000-mdtlov” of type “lov”
1285 with additional argument “lustre-MDT0000-mdtlov_UUID”. All these arguments are
1286 packed into lustre configuration buffers ( struct lustre_cfg).
1288 Another commands will be attaching device into the stack (like setup and
1291 Appendix 2. Sample Code
1292 =======================
1294 Lustre currently has 2 different OSD implementations:
1295 - ldiskfs OSD under lustre/osd-ldiskfs
1296 http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-ldiskfs;hb=HEAD
1297 - ZFS OSD under lustre/zfs-osd
1298 http://git.whamcloud.com/?p=fs/lustre-release.git;a=tree;f=lustre/osd-zfs;hb=HEAD