Documentation/ladvise_lockahead.txt

   1 Ladvise Lock Ahead design
   2
   3 Lock ahead is a new Lustre feature aimed at solving a long standing problem
   4 with shared file write performance in Lustre.  It requires client and server
   5 support.  It will be used primarily via the MPI-I/O library, not directly from
   6 user applications.
   7
   8 The first part of this document (sections 1 and 2) is an overview of the
   9 problem and high level description of the solution.  Section 3 explains how the
  10 library will make use of this feature, and sections 4 and 5 describe the design
  11 of the Lustre changes.
  12
  13 1. Overview: Purpose & Interface
  14 Lock ahead is intended to allow optimization of certain I/O patterns which
  15 would otherwise suffer LDLM* lock contention.  It allows applications to
  16 manually request locks on specific extents of a file, avoiding the usual
  17 server side optimizations. This applications which know their I/O pattern to
  18 use that information to avoid false conflicts due to server side optimizations.
  19
  20 *Lustre distributed lock manager.  This is the locking layer shared between
  21 clients and servers, to manage access between clients.
  22
  23 Normally, clients get locks automatically as the first step of an I/O.
  24 The client asks for a lock which covers exactly the area of interest (ie, a
  25 read or write lock of n bytes at offset x), but the server attempts to optimize
  26 this by expanding the lock to cover as much of the file as possible.  This is
  27 useful for a single client, but can be trouble for multiple clients.
  28
  29 In cases where multiple clients wish to write to the same file, this
  30 optimization can result in locks that conflict when the actual I/O operations
  31 do not.  This requires clients to wait for one another to complete I/O, even
  32 when there is no conflict between actual I/O requests.  This can significantly
  33 reduce performance (Anywhere from 40-90%, depending on system specs) for some
  34 workloads.
  35
  36 The lockahead feature makes it possible to avoid this problem by acquiring the
  37 necessary locks in advance, by explicit requests with server side extent
  38 changes disabled.  We add a new lfs advice type, LU_LADVISE_LOCKAHEAD,
  39 which allows lock requests from userspace on the client, specifying the extent
  40 and the I/O mode (read/write) for the lock.  These lock requests explicitly
  41 disable server side changes to the lock extent, so the lock returned to the
  42 client covers only the extent requested.
  43
  44 When using this feature, clients which intend to write to a file can request
  45 locks to cover their I/O pattern, wait a moment for the locks to be granted,
  46 then write or read the file.
  47
  48 In this way, a set of clients which knows their I/O pattern in advance can
  49 force the LDLM layer to grant locks appropriate for that I/O pattern.  This
  50 allows applications which are poorly handled by the default lock optimization
  51 behavior to significantly improve their performance.
  52
  53 2. I/O Pattern & Locking problems
  54 2. A. Strided writing and MPI-I/O
  55 There is a thorough explanation and overview of strided writing and the
  56 benefits of this functionality in the slides from the lock ahead presentation
  57 at LUG 2015.  It is highly recommended to read that first, as the graphics are
  58 much clearer than the prose here.
  59
  60 See slides 1-13:
  61 http://wiki.lustre.org/images/f/f9/Shared-File-Performance-in-Lustre_Farrell.pdf
  62
  63 MPI-I/O uses strided writing when doing I/O from a large job to a single file.
  64 I/O is aggregated from all the nodes running a particular application to a
  65 small number of I/O aggregator nodes which then write out the data, in a
  66 strided manner.
  67
  68 In strided writing, different clients take turns writing different blocks of a
  69 file (A block is some arbitrary number of bytes).  Client 1 is responsible for
  70 writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1,
  71 block 3, etc.
  72
  73 Without the ability to manually request locks, strided writing is set up in
  74 concert with Lustre file striping so each client writes to one OST.  (IE, for a
  75 file striped to three OSTs, we would write from three clients.)
  76
  77 The particular case of interest is when we want to use more than one client
  78 per OST.  This is important, because an OST typically has much more bandwidth
  79 than one client.  Strided writes are non-overlapping, so they should be able to
  80 proceed in parallel with more than one client per OST.  In practice, on Lustre,
  81 they do not, due to lock expansion.
  82
  83 2. B. Locking problems
  84 We will now describe locking when there is more than one client per OST.  This
  85 behavior is the same on a per OST basis in a file striped across multiple OSTs.
  86 When the first client asks to write block 0, it asks for the required lock from
  87 the server.  When it receives this request, the server sees that there are no
  88 other locks on the file.  Since it assumes the client will want to write to the
  89 file again, the server expands the lock as far as possible.  In this case, it
  90 expands the lock to the maximum file size (effectively, to infinity), then
  91 grants it to client 1.
  92
  93 When client 2 wants to write block 1, it conflicts with the expanded lock
  94 granted to client 1.  The server then must revoke (In Lustre terms,
  95 'call back') the lock granted to client 1 so it can grant a lock to client 2.
  96 After the lock granted to client is revoked, there are no locks on the file.
  97 The server sees this when processing the lock request from client 2, and
  98 expands that lock to cover the whole file.
  99
 100 Client 1 then wishes to write block 3 of the file...  And the cycle continues.
 101 The two clients exchange the extended lock throughout the write, allowing only
 102 one client to write at a time, plus latency to exchange the lock.  The effect is
 103 dramatic: Two clients are actually slower than one.  (Similar behavior is seen
 104 with more than two clients.)
 105
 106 The solution is to use this new advice type to acquire locks before they are
 107 needed.  In effect, before it starts writing to the file, client 1 requests
 108 locks on block 0, block 2, etc. It locks 'ahead' a certain (tunable) number of
 109 locks. Client 2 does the same.  Then they both begin to write, and are able to
 110 do so in parallel.  A description of the actual library implementation follows.
 111
 112 3. Library implementation
 113 Actually implementing this in the library carries a number of wrinkles.
 114 The basic pattern is this:
 115 Before writing, an I/O aggregator requests a certain number of locks on blocks
 116 that it is responsible for.  It may or may not ever write to these blocks, but
 117 it takes locks knowing it might.  It then begins to write, tracking how many of
 118 the locks it has used.  When the number of locks 'ahead' of the I/O is low
 119 enough, it requests more locks in advance of the I/O.
 120
 121 For technical reasons which are explained in the implementation section, these
 122 lock requests are either asynchronous and non-blocking or synchronous and
 123 blocking.  In Lustre terms, non-blocking means if there is already a lock on
 124 the relevant extent of the file, the manual lock request is not granted.  This
 125 means that if there is already a lock on the file (quite common; imagine
 126 writing to a file which was previously read by another process), these lock
 127 requests will be denied.  However, once the first 'real' write arrives that
 128 was hoping to use a lockahead lock, that write will cause the blocking lock to
 129 be cancelled, so this interference is not fatal.
 130
 131 It is of course possible for another process to get in the way by immediately
 132 asking for a lock on the file.  This is something users should try to avoid.
 133 When writing out a file, repeatedly trying to read it will impact performance
 134 even without this feature.
 135
 136 These interfering locks can also happen if a manually requested lock is, for
 137 some reason, not available in time for the write which intended to use it.
 138 The lock which results from this write request is expanded using the
 139 normal rules.  So it's possible for that lock (depending on the position of
 140 other locks at the time) to be extended to cover the rest of the file.  That
 141 will block future lockahead locks.
 142
 143 The expanded lock will be revoked when a write happens (from another client)
 144 in the range covered by that lock, but the lock for that write will be expanded
 145 as well - And then we return to handing the lock back and forth between
 146 clients.  These expanded locks will still block future lockahead locks,
 147 rendering them useless.
 148
 149 The way to avoid this is to turn off lock expansion for I/Os which are
 150 supposed to be using these manually requested locks.  That way, if the
 151 manually requested lock is not available, the lock request for the I/O will not
 152 be expanded.  Instead, that request (which is blocking, unlike a lockahead
 153 request) will cancel any interfering locks, but the resulting lock will not be
 154 expanded.  This leaves the later parts of the file open, allowing future
 155 manual lock requests to succeed.  This means that if an interfering lock blocks
 156 some manual requests, those are lost, but the next set of manual requests can
 157 proceed as normal.
 158
 159 In effect, the 'locking ahead of I/O' is interrupted, but then is able to
 160 re-assert itself. The feature used here is referred to as 'no expansion'
 161 locking (as only the extent required by the actual I/O operation is locked)
 162 and is turned on with another new ladvise advice, LU_LADVISE_NOEXPAND.  This
 163 feature is added as part of the lockahead patch.  The strided writing library
 164 will use this advice on the file descriptor it uses for writing.
 165
 166 4. Client side design
 167 4. A. Ladvise lockahead
 168 Requestlock uses the existing asynchronous lock request functionality
 169 implemented for asynchronous glimpse locks (AGLs), a long standing Lustre
 170 feature.  AGLs are locks which are requested by statahead, which are used to
 171 get file size information before it's requested.  The key thing about an
 172 asynchronous lock request is that it does not have a specific I/O operation
 173 waiting for the lock.
 174
 175 This means two key things:
 176
 177 1. There is no OSC lock (lock layer above LDLM for data locking) associated
 178 with the LDLM lock
 179 2. There is no thread waiting for the LDLM lock, so lock grant processing
 180 must be handled by the ptlrpc daemon thread which received the reply
 181
 182 Since both of these issues are addressed by the asynchronous lock request code
 183 which lockahead shares with AGL, we will not explore them in depth here.
 184
 185 Finally, lockahead requests set the CEF_LOCK_NO_EXPAND flag, which tells the
 186 OSC (the per OST layer of the client) to set LDLM_FL_NO_EXPANSION on any lock
 187 requests.  LDLM_FL_NO_EXPANSION is a new LDLM lock flag which tells the server
 188 not to expand the lock extent.
 189
 190 This leaves the user facing interface.  Requestlock is implemented as a new
 191 ladvise advice, and it uses the ladvise feature of multiple advices in one API
 192 call to put many lock requests in to an array of advices.
 193
 194 The arguments required for this advice are a mode (read or write), range (start
 195 and end), and flags.
 196
 197 The client will then make lock requests on these extents, one at a time.
 198 Because the lock requests are asynchronous (replies are handled by ptlrpcd),
 199 many requests can be made quickly by overlapping them, rather than waiting for
 200 each one to complete.  (This requires that they be non-blocking, as the
 201 ptlrpcd threads must not wait in the ldlm layer.)
 202
 203 4. B. LU_LADVISE_LOCKNOEXPAND
 204 The lock no expand ladvise advice sets a boolean in a Lustre data structure
 205 associated with a file descriptor.  When an I/O is done to this file
 206 descriptor, the flag is picked up and passed through to the ldlm layer, where
 207 it sets LDLM_FL_NO_EXPANSION on lock requests made for that I/O.
 208
 209 5. Server side changes
 210 Implementing lockahead requires server support for LDLM_FL_NO_EXPANSION, but
 211 it also required an additional pair of server side changes to fix issues which
 212 came up because of lockahead.  These changes are not part of the core design
 213 instead, they are separate fixes which are required for it to work.
 214
 215 5. A. Support LDLM_FL_NO_EXPANSION
 216
 217 Disabling server side lock expansion is done with a new LDLM flag.  This is
 218 done with a simple check for that flag on the server before attempting to
 219 expand the lock.  If the flag is found, lock expansion is skipped.
 220
 221 5. B. Implement LDLM_FL_SPECULATIVE
 222
 223 As described above, lock ahead locks are non-blocking. The BLOCK_NOWAIT LDLM
 224 flag is used now to implement some nonblocking behavior, but it only considers
 225 group locks blocking.  But, for asynchronous lock requests to work correctly,
 226 they cannot wait for any other locks.  For this purpose, we add
 227 LDLM_FL_SPECULATIVE.  This new flag is used for asynchronous lock requests,
 228 and implements the broader non-blocking behavior they require.
 229
 230 5. C. File size & ofd_intent_policy changes
 231
 232 Knowing the current file size during writes is tricky on a distributed file
 233 system, because multiple clients can be writing to a file at any time.  When
 234 writes are in progress, the server must identify which client is currently
 235 responsible for growing the file size, and ask that client what the file size
 236 is.
 237
 238 To do this, the server uses glimpse locking (in ofd_intent_policy) to get the
 239 current file size from the clients.  This code uses the assumption that the
 240 holder of the highest write lock (PW lock) knows the current file size.  A
 241 client learns the (then current) file size when a lock is granted.  Because
 242 only the holder of the highest lock can grow a file, either the size hasn't
 243 changed, or that client knows the new size; so the server only has to contact
 244 the client which holds this lock, and it knows the current file size.
 245
 246 Note that the above is actually racy. When the server asks, the client can
 247 still be writing, or another client could acquire a higher lock during this
 248 time.  The goal is a good approximation while the file is being written, and a
 249 correct answer once all the clients are done writing.  This is achieved because
 250 once writes to a file are complete, the holder of that highest lock is
 251 guaranteed to know the current file size.  This is where manually requested
 252 locks cause trouble.
 253
 254 By creating write locks in advance of an actual I/O, lockahead breaks the
 255 assumption that the holder of the highest lock knows the file size.
 256
 257 This assumption is normally true because locks which are created as part of
 258 IO - rather than in advance of it - are guaranteed to be 'active', IE,
 259 involved in IO, and the holder of the highest 'active' lock always knows the
 260 current file size, because the size is either not changing or the holder of
 261 that lock is responsible for updating it.
 262
 263 Consider:  Two clients, A and B, strided writing.  Each client requests, for
 264 example, 2 manually requested locks.  (Real numbers are much higher.)  Client A
 265 holds locks on segments 0 and 2, client B holds locks on segments 1 and 3.
 266
 267 The request comes to write 3 segments of data.  Client A writes to segment 0,
 268 client B writes to segment 1, and client A also writes to segment 2.  No data
 269 is written to segment 3.  At this point, the server checks the file size, by
 270 glimpsing the highest lock . The lock on segment 3.  Client B does not know
 271 about the writing done by client A to segment 2, so it gives an incorrect file
 272 size.
 273
 274 This would be OK if client B had pending writes to segment 3, but it does not.
 275 In this situation, the server will never get the correct file size while this
 276 lock exists.
 277
 278 The solution is relatively straightforward: The server needs to glimpse every
 279 client holding a write lock (starting from the top) until we find one holding
 280 an 'active' lock (because the size is known to be at least the size returned
 281 from an 'active' lock), and take the largest size returned. This avoids asking
 282 only a client which may not know the correct file size.
 283
 284 Unfortunately, there is no way to know if a manually requested lock is active
 285 from the server side.  So when we see such a lock, we must send a glimpse to
 286 the holder (unless we have already sent a glimpse to that client*).  However,
 287 because locks without LDLM_FL_NO_EXPANSION set are guaranteed to be 'active',
 288 once we reach the first such lock, we can stop glimpsing.
 289
 290 *This is because when we glimpse a specific lock, the client holding it returns
 291 its best idea of the size information, so we only need to send one glimpse to
 292 each client.
 293
 294 This is less efficient than the standard "glimpse only the top lock"
 295 methodology, but since we only need to glimpse one lock per client (and the
 296 number of clients writing to the part of a file on a given OST is fairly
 297 limited), the cost is restrained.
 298
 299 Additionally, lock cancellation methods such as early lock cancel aggressively
 300 clean up older locks, particularly when the LRU limit is exceeded, so the
 301 total lock count should also remain manageable.
 302
 303 In the end, the final verdict here is performance. Requestlock testing for the
 304 strided I/O case has shown good performance results.