1 Lustre Protocol Documentation
2 =============================
3 Andrew Uselton <andrew.c.uselton@intel.com>
5 :author: Andrew Uselton
12 :website: http://lustre.org/
13 :keywords: PtlRPC, Lustre, Protocol
16 :numbered!: [abstract] Abstract -------- The Lustre parallel file
17 system <<lustre>> provides a global POSIX <<POSIX>> namespace for the
18 computing resources of a data center. Lustre runs on Linux-based hosts
19 via kernel modules, and delegates block storage management to the
20 back-end servers while providing object-based storage to its
21 clients. Servers are responsible for both data objects (the contents
22 of actual files) and index objects (for directory information). Data
23 objects are gathered on Object Storage Servers (OSSs), and index
24 objects are stored on MetaData Storage Servers (MDSs). Each back-end
25 storage volume is a target with Object Storage Targets (OSTs) on OSSs,
26 and MetaData Storage Targets (MDTs) on MDSs. Clients assemble the
27 data from the MDT(s) and OST(s) to present a single coherent
28 POSIX-compliant file system. The clients and servers communicate and
29 coordinate among themselves via network protocols. A low-level
30 protocol, LNet, abstracts the details of the underlying networking
31 hardware and presents a uniform interface, originally based on Sandia
32 Portals <<PORTALS>>, to Lustre clients and servers. Lustre, in turn,
33 layers its own protocol PtlRPC atop LNet. This document describes the
34 Lustre protocols, including how they are implemeted via PtlRPC and the
35 Lustre Distributed Lock Manager (based on the VAX/VMS Distributed Lock
36 Manager <<VAX_DLM>>). This document does not describe Lustre itself in
37 any detail, except where doing so explains concepts that allow this
38 document to be self-contained.
45 'Content to be provided'
50 These are the messages that traverse the network using PTLRPC.
53 This initial list combines some actual message names or types with the
54 POSIX semantic operations they are being used to implement, as well as
55 a few other underlying mechanisms (cf. "grant"). A subsequent
56 refinement will separate the various items and relate them to one
59 Client-MDS RPCs for POSIX namespace operations
60 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
62 'Content to be provided'
66 'Content to be provided'
70 'Content to be provided'
74 'Content to be provided'
78 'Content to be provided'
82 'Content to be provided'
86 'Content to be provided'
90 image:mkdir1.png[mkdir]
94 'Content to be provided'
98 'Content to be provided'
102 'Content to be provided'
106 'Content to be provided'
110 'Content to be provided'
114 'Content to be provided'
118 'Content to be provided'
122 'Content to be provided'
125 Client-MDS RPCs for internal state management
126 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
128 'Content to be provided'
132 'Content to be provided'
136 'Content to be provided'
140 'Content to be provided'
144 'Content to be provided'
148 'Content to be provided'
152 'Content to be provided'
156 'Content to be provided'
158 Client-OSS RPCs for IO Operations
159 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
161 'Content to be provided'
165 'Content to be provided'
169 'Content to be provided'
173 'Content to be provided'
177 'Content to be provided'
181 'Content to be provided'
185 'Content to be provided'
187 MDS-OSS RPCs for internal state management
188 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
190 'Content to be provided'
192 === object precreation ===
194 'Content to be provided'
196 === orphan recovery ===
198 'Content to be provided'
200 === UID/GID change ===
202 'Content to be provided'
206 'Content to be provided'
210 'Content to be provided'
212 MDS-OSS RPCs for quota management
213 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215 'Content to be provided'
218 MDS-OSS OUT RPCs for distributed updates
219 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
221 'Content to be provided'
223 === DNE1 remote directories ===
225 'Content to be provided'
227 === DNE2 striped directories ===
229 'Content to be provided'
231 === LFSCK2/3 verification and repair ===
233 'Content to be provided'
238 Each file operation (in Lustre) generates a set of messages in a
239 particular sequence. There is one sequence for any particular
240 concrete operation, but under varying circumstances the same file
241 operation may generate a different sequence.
246 For each File operation, the collection of possible sequences of
247 messages is governed by a state machine.
254 Here are some common terms used in discussing Lustre, POSIX semantics,
255 and the prtocols used to implement them.
258 Object Storage Server::
259 An object storage server (OSS) is a computer responsible for
260 running Lustre kernel services in support of managing bulk data
261 objects on the underlying storage. There can be multiple OSSs in a
265 A metadata server (MDS) is a computer responsible for running the
266 Lustre kernel services in support of managing the POSIX-compliant
267 name space and the indices associating files in that name space with
268 the locations of their corresponding objects. As of v2.4 there can
269 be multiple MDSs in a Lustre file system.
271 Object Storage Target::
272 An object storage target (OST) is the service provided by an OSS
273 that mediates the placement of data objects on the specific
274 underlying file system hardware. There can be multiple OSTs on a
278 A metadata target (MDT) is the service provided by an MDS that
279 mediates the management of name space indices on the underlying file
280 system hardware. As of v2.4 there can be multiple MDTs on an MDS.
283 A computer providing a service, such as an OSS or an MDS
286 Storage available to be served, such as an OST or an MDT. Also the
287 service being provided.
290 An agreed upon formalism for communicating between two entities,
291 such as between two servers or between a client and a server.
294 A computer taking advantage of a service provided by a server, such
295 as a Lustre client using MDS(s) and OSS(s) to assemble a
296 POSIX-compliant file system with its namespace and data storage
300 The protocol (or set of protocols) implemented via RPCs that is
301 (are) employed by Lustre to communicate between its clients and
304 Remote Procedure Call::
305 A mechanism for implementing operations involving one computer
306 acting on the behalf of another (RPC).
309 A lower level protocol employed by PtlRPC to abstract the mechanisms
310 provided by various hardware centric protocols, such as TCP or
317 'Content to be provided'
323 Copyright (C) Intel 2015
325 This work is licensed under a Creative Commons Attribution-ShareAlike
326 4.0 International License (CC BY-SA 4.0). See
327 <https://creativecommons.org/licenses/by-sa/4.0/> for more detail.
332 Here is a selected list of references, including those cited in the
336 - [[[lustre]]] 'Lustre'. http://lustre.opensfs.org
337 - [[[POSIX]]] 'POSIX'. http://pubs.opengroup.org/onlinepubs/9699919799/
338 - [[[PORTALS]]] 'The Portals 3.0 Message Passing
339 Interface Revision 1.1.'. Ron Brightwell, Trammel
340 Hudson, Rolf Riesen, and Arthur B. Maccabe. Technical
341 report, December 1999.
342 - [[[VAX_DLM]]] 'The VAX/VMS Distributed Lock Manager'. W Snaman and
343 D Thiel. Digital Technical Journal, September 1987.
344 - [[[Barton_Dilger]]] 'Lustre'. Eric Barton and Andreas Dilger. A book
345 on parallel file systems. Chapter 8. High
346 Performance Parallel I/O, Prabhat and Quincey
347 Koziol, Chapman and Hall/CRC Press, 2014, ISBN: