--- /dev/null
+Ladvise Lock Ahead design
+
+Lock ahead is a new Lustre feature aimed at solving a long standing problem
+with shared file write performance in Lustre. It requires client and server
+support. It will be used primarily via the MPI-I/O library, not directly from
+user applications.
+
+The first part of this document (sections 1 and 2) is an overview of the
+problem and high level description of the solution. Section 3 explains how the
+library will make use of this feature, and sections 4 and 5 describe the design
+of the Lustre changes.
+
+1. Overview: Purpose & Interface
+Lock ahead is intended to allow optimization of certain I/O patterns which
+would otherwise suffer LDLM* lock contention. It allows applications to
+manually request locks on specific extents of a file, avoiding the usual
+server side optimizations. This applications which know their I/O pattern to
+use that information to avoid false conflicts due to server side optimizations.
+
+*Lustre distributed lock manager. This is the locking layer shared between
+clients and servers, to manage access between clients.
+
+Normally, clients get locks automatically as the first step of an I/O.
+The client asks for a lock which covers exactly the area of interest (ie, a
+read or write lock of n bytes at offset x), but the server attempts to optimize
+this by expanding the lock to cover as much of the file as possible. This is
+useful for a single client, but can be trouble for multiple clients.
+
+In cases where multiple clients wish to write to the same file, this
+optimization can result in locks that conflict when the actual I/O operations
+do not. This requires clients to wait for one another to complete I/O, even
+when there is no conflict between actual I/O requests. This can significantly
+reduce performance (Anywhere from 40-90%, depending on system specs) for some
+workloads.
+
+The lockahead feature makes it possible to avoid this problem by acquiring the
+necessary locks in advance, by explicit requests with server side extent
+changes disabled. We add a new lfs advice type, LU_LADVISE_LOCKAHEAD,
+which allows lock requests from userspace on the client, specifying the extent
+and the I/O mode (read/write) for the lock. These lock requests explicitly
+disable server side changes to the lock extent, so the lock returned to the
+client covers only the extent requested.
+
+When using this feature, clients which intend to write to a file can request
+locks to cover their I/O pattern, wait a moment for the locks to be granted,
+then write or read the file.
+
+In this way, a set of clients which knows their I/O pattern in advance can
+force the LDLM layer to grant locks appropriate for that I/O pattern. This
+allows applications which are poorly handled by the default lock optimization
+behavior to significantly improve their performance.
+
+2. I/O Pattern & Locking problems
+2. A. Strided writing and MPI-I/O
+There is a thorough explanation and overview of strided writing and the
+benefits of this functionality in the slides from the lock ahead presentation
+at LUG 2015. It is highly recommended to read that first, as the graphics are
+much clearer than the prose here.
+
+See slides 1-13:
+http://wiki.lustre.org/images/f/f9/Shared-File-Performance-in-Lustre_Farrell.pdf
+
+MPI-I/O uses strided writing when doing I/O from a large job to a single file.
+I/O is aggregated from all the nodes running a particular application to a
+small number of I/O aggregator nodes which then write out the data, in a
+strided manner.
+
+In strided writing, different clients take turns writing different blocks of a
+file (A block is some arbitrary number of bytes). Client 1 is responsible for
+writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1,
+block 3, etc.
+
+Without the ability to manually request locks, strided writing is set up in
+concert with Lustre file striping so each client writes to one OST. (IE, for a
+file striped to three OSTs, we would write from three clients.)
+
+The particular case of interest is when we want to use more than one client
+per OST. This is important, because an OST typically has much more bandwidth
+than one client. Strided writes are non-overlapping, so they should be able to
+proceed in parallel with more than one client per OST. In practice, on Lustre,
+they do not, due to lock expansion.
+
+2. B. Locking problems
+We will now describe locking when there is more than one client per OST. This
+behavior is the same on a per OST basis in a file striped across multiple OSTs.
+When the first client asks to write block 0, it asks for the required lock from
+the server. When it receives this request, the server sees that there are no
+other locks on the file. Since it assumes the client will want to write to the
+file again, the server expands the lock as far as possible. In this case, it
+expands the lock to the maximum file size (effectively, to infinity), then
+grants it to client 1.
+
+When client 2 wants to write block 1, it conflicts with the expanded lock
+granted to client 1. The server then must revoke (In Lustre terms,
+'call back') the lock granted to client 1 so it can grant a lock to client 2.
+After the lock granted to client is revoked, there are no locks on the file.
+The server sees this when processing the lock request from client 2, and
+expands that lock to cover the whole file.
+
+Client 1 then wishes to write block 3 of the file... And the cycle continues.
+The two clients exchange the extended lock throughout the write, allowing only
+one client to write at a time, plus latency to exchange the lock. The effect is
+dramatic: Two clients are actually slower than one. (Similar behavior is seen
+with more than two clients.)
+
+The solution is to use this new advice type to acquire locks before they are
+needed. In effect, before it starts writing to the file, client 1 requests
+locks on block 0, block 2, etc. It locks 'ahead' a certain (tunable) number of
+locks. Client 2 does the same. Then they both begin to write, and are able to
+do so in parallel. A description of the actual library implementation follows.
+
+3. Library implementation
+Actually implementing this in the library carries a number of wrinkles.
+The basic pattern is this:
+Before writing, an I/O aggregator requests a certain number of locks on blocks
+that it is responsible for. It may or may not ever write to these blocks, but
+it takes locks knowing it might. It then begins to write, tracking how many of
+the locks it has used. When the number of locks 'ahead' of the I/O is low
+enough, it requests more locks in advance of the I/O.
+
+For technical reasons which are explained in the implementation section, these
+lock requests are either asynchronous and non-blocking or synchronous and
+blocking. In Lustre terms, non-blocking means if there is already a lock on
+the relevant extent of the file, the manual lock request is not granted. This
+means that if there is already a lock on the file (quite common; imagine
+writing to a file which was previously read by another process), these lock
+requests will be denied. However, once the first 'real' write arrives that
+was hoping to use a lockahead lock, that write will cause the blocking lock to
+be cancelled, so this interference is not fatal.
+
+It is of course possible for another process to get in the way by immediately
+asking for a lock on the file. This is something users should try to avoid.
+When writing out a file, repeatedly trying to read it will impact performance
+even without this feature.
+
+These interfering locks can also happen if a manually requested lock is, for
+some reason, not available in time for the write which intended to use it.
+The lock which results from this write request is expanded using the
+normal rules. So it's possible for that lock (depending on the position of
+other locks at the time) to be extended to cover the rest of the file. That
+will block future lockahead locks.
+
+The expanded lock will be revoked when a write happens (from another client)
+in the range covered by that lock, but the lock for that write will be expanded
+as well - And then we return to handing the lock back and forth between
+clients. These expanded locks will still block future lockahead locks,
+rendering them useless.
+
+The way to avoid this is to turn off lock expansion for I/Os which are
+supposed to be using these manually requested locks. That way, if the
+manually requested lock is not available, the lock request for the I/O will not
+be expanded. Instead, that request (which is blocking, unlike a lockahead
+request) will cancel any interfering locks, but the resulting lock will not be
+expanded. This leaves the later parts of the file open, allowing future
+manual lock requests to succeed. This means that if an interfering lock blocks
+some manual requests, those are lost, but the next set of manual requests can
+proceed as normal.
+
+In effect, the 'locking ahead of I/O' is interrupted, but then is able to
+re-assert itself. The feature used here is referred to as 'no expansion'
+locking (as only the extent required by the actual I/O operation is locked)
+and is turned on with another new ladvise advice, LU_LADVISE_NOEXPAND. This
+feature is added as part of the lockahead patch. The strided writing library
+will use this advice on the file descriptor it uses for writing.
+
+4. Client side design
+4. A. Ladvise lockahead
+Requestlock uses the existing asynchronous lock request functionality
+implemented for asynchronous glimpse locks (AGLs), a long standing Lustre
+feature. AGLs are locks which are requested by statahead, which are used to
+get file size information before it's requested. The key thing about an
+asynchronous lock request is that it does not have a specific I/O operation
+waiting for the lock.
+
+This means two key things:
+
+1. There is no OSC lock (lock layer above LDLM for data locking) associated
+with the LDLM lock
+2. There is no thread waiting for the LDLM lock, so lock grant processing
+must be handled by the ptlrpc daemon thread which received the reply
+
+Since both of these issues are addressed by the asynchronous lock request code
+which lockahead shares with AGL, we will not explore them in depth here.
+
+Finally, lockahead requests set the CEF_LOCK_NO_EXPAND flag, which tells the
+OSC (the per OST layer of the client) to set LDLM_FL_NO_EXPANSION on any lock
+requests. LDLM_FL_NO_EXPANSION is a new LDLM lock flag which tells the server
+not to expand the lock extent.
+
+This leaves the user facing interface. Requestlock is implemented as a new
+ladvise advice, and it uses the ladvise feature of multiple advices in one API
+call to put many lock requests in to an array of advices.
+
+The arguments required for this advice are a mode (read or write), range (start
+and end), and flags.
+
+The client will then make lock requests on these extents, one at a time.
+Because the lock requests are asynchronous (replies are handled by ptlrpcd),
+many requests can be made quickly by overlapping them, rather than waiting for
+each one to complete. (This requires that they be non-blocking, as the
+ptlrpcd threads must not wait in the ldlm layer.)
+
+4. B. LU_LADVISE_LOCKNOEXPAND
+The lock no expand ladvise advice sets a boolean in a Lustre data structure
+associated with a file descriptor. When an I/O is done to this file
+descriptor, the flag is picked up and passed through to the ldlm layer, where
+it sets LDLM_FL_NO_EXPANSION on lock requests made for that I/O.
+
+5. Server side changes
+Implementing lockahead requires server support for LDLM_FL_NO_EXPANSION, but
+it also required an additional pair of server side changes to fix issues which
+came up because of lockahead. These changes are not part of the core design
+instead, they are separate fixes which are required for it to work.
+
+5. A. Support LDLM_FL_NO_EXPANSION
+
+Disabling server side lock expansion is done with a new LDLM flag. This is
+done with a simple check for that flag on the server before attempting to
+expand the lock. If the flag is found, lock expansion is skipped.
+
+5. B. Implement LDLM_FL_SPECULATIVE
+
+As described above, lock ahead locks are non-blocking. The BLOCK_NOWAIT LDLM
+flag is used now to implement some nonblocking behavior, but it only considers
+group locks blocking. But, for asynchronous lock requests to work correctly,
+they cannot wait for any other locks. For this purpose, we add
+LDLM_FL_SPECULATIVE. This new flag is used for asynchronous lock requests,
+and implements the broader non-blocking behavior they require.
+
+5. C. File size & ofd_intent_policy changes
+
+Knowing the current file size during writes is tricky on a distributed file
+system, because multiple clients can be writing to a file at any time. When
+writes are in progress, the server must identify which client is currently
+responsible for growing the file size, and ask that client what the file size
+is.
+
+To do this, the server uses glimpse locking (in ofd_intent_policy) to get the
+current file size from the clients. This code uses the assumption that the
+holder of the highest write lock (PW lock) knows the current file size. A
+client learns the (then current) file size when a lock is granted. Because
+only the holder of the highest lock can grow a file, either the size hasn't
+changed, or that client knows the new size; so the server only has to contact
+the client which holds this lock, and it knows the current file size.
+
+Note that the above is actually racy. When the server asks, the client can
+still be writing, or another client could acquire a higher lock during this
+time. The goal is a good approximation while the file is being written, and a
+correct answer once all the clients are done writing. This is achieved because
+once writes to a file are complete, the holder of that highest lock is
+guaranteed to know the current file size. This is where manually requested
+locks cause trouble.
+
+By creating write locks in advance of an actual I/O, lockahead breaks the
+assumption that the holder of the highest lock knows the file size.
+
+This assumption is normally true because locks which are created as part of
+IO - rather than in advance of it - are guaranteed to be 'active', IE,
+involved in IO, and the holder of the highest 'active' lock always knows the
+current file size, because the size is either not changing or the holder of
+that lock is responsible for updating it.
+
+Consider: Two clients, A and B, strided writing. Each client requests, for
+example, 2 manually requested locks. (Real numbers are much higher.) Client A
+holds locks on segments 0 and 2, client B holds locks on segments 1 and 3.
+
+The request comes to write 3 segments of data. Client A writes to segment 0,
+client B writes to segment 1, and client A also writes to segment 2. No data
+is written to segment 3. At this point, the server checks the file size, by
+glimpsing the highest lock . The lock on segment 3. Client B does not know
+about the writing done by client A to segment 2, so it gives an incorrect file
+size.
+
+This would be OK if client B had pending writes to segment 3, but it does not.
+In this situation, the server will never get the correct file size while this
+lock exists.
+
+The solution is relatively straightforward: The server needs to glimpse every
+client holding a write lock (starting from the top) until we find one holding
+an 'active' lock (because the size is known to be at least the size returned
+from an 'active' lock), and take the largest size returned. This avoids asking
+only a client which may not know the correct file size.
+
+Unfortunately, there is no way to know if a manually requested lock is active
+from the server side. So when we see such a lock, we must send a glimpse to
+the holder (unless we have already sent a glimpse to that client*). However,
+because locks without LDLM_FL_NO_EXPANSION set are guaranteed to be 'active',
+once we reach the first such lock, we can stop glimpsing.
+
+*This is because when we glimpse a specific lock, the client holding it returns
+its best idea of the size information, so we only need to send one glimpse to
+each client.
+
+This is less efficient than the standard "glimpse only the top lock"
+methodology, but since we only need to glimpse one lock per client (and the
+number of clients writing to the part of a file on a given OST is fairly
+limited), the cost is restrained.
+
+Additionally, lock cancellation methods such as early lock cancel aggressively
+clean up older locks, particularly when the LRU limit is exceeded, so the
+total lock count should also remain manageable.
+
+In the end, the final verdict here is performance. Requestlock testing for the
+strided I/O case has shown good performance results.
static int hf_lustre_ldlm_fl_block_granted = -1;
static int hf_lustre_ldlm_fl_block_conv = -1;
static int hf_lustre_ldlm_fl_block_wait = -1;
+static int hf_lustre_ldlm_fl_speculative = -1;
static int hf_lustre_ldlm_fl_ast_sent = -1;
static int hf_lustre_ldlm_fl_replay = -1;
static int hf_lustre_ldlm_fl_intent_only = -1;
static int hf_lustre_ldlm_fl_test_lock = -1;
static int hf_lustre_ldlm_fl_cancel_on_block = -1;
static int hf_lustre_ldlm_fl_cos_incompat = -1;
+static int hf_lustre_ldlm_fl_no_expansion = -1;
static int hf_lustre_ldlm_fl_deny_on_contention = -1;
static int hf_lustre_ldlm_fl_ast_discard_data = -1;
{LDLM_FL_BLOCK_GRANTED, "LDLM_FL_BLOCK_GRANTED"},
{LDLM_FL_BLOCK_CONV, "LDLM_FL_BLOCK_CONV"},
{LDLM_FL_BLOCK_WAIT, "LDLM_FL_BLOCK_WAIT"},
+ {LDLM_FL_SPECULATIVE, "LDLM_FL_SPECULATIVE"},
{LDLM_FL_AST_SENT, "LDLM_FL_AST_SENT"},
{LDLM_FL_REPLAY, "LDLM_FL_REPLAY"},
{LDLM_FL_INTENT_ONLY, "LDLM_FL_INTENT_ONLY"},
{LDLM_FL_TEST_LOCK, "LDLM_FL_TEST_LOCK"},
{LDLM_FL_CANCEL_ON_BLOCK, "LDLM_FL_CANCEL_ON_BLOCK"},
{LDLM_FL_COS_INCOMPAT, "LDLM_FL_COS_INCOMPAT"},
+ {LDLM_FL_NO_EXPANSION, "LDLM_FL_NO_EXPANSION"},
{LDLM_FL_DENY_ON_CONTENTION, "LDLM_FL_DENY_ON_CONTENTION"},
{LDLM_FL_AST_DISCARD_DATA, "LDLM_FL_AST_DISCARD_DATA"},
{ 0, NULL }
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_granted);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_conv);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_wait);
+ dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_speculative);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_sent);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_replay);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_intent_only);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_test_lock);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cancel_on_block);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cos_incompat);
+ dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_no_expansion);
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_deny_on_contention);
return
dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_discard_data);
}
},
{
+ /* p_id */ &hf_lustre_ldlm_fl_speculative,
+ /* hfinfo */ {
+ /* name */ "LDLM_FL_SPECULATIVE",
+ /* abbrev */ "lustre.ldlm_fl_speculative",
+ /* type */ FT_BOOLEAN,
+ /* display */ 32,
+ /* strings */ TFS(&lnet_flags_set_truth),
+ /* bitmask */ LDLM_FL_SPECULATIVE,
+ /* blurb */ "Lock request is speculative/asynchronous, and cannot\n"
+ "wait for any reason. Fail the lock request if any blocking locks\n"
+ "encountered."
+ /* id */ HFILL
+ }
+ },
+ {
/* p_id */ &hf_lustre_ldlm_fl_ast_sent,
/* hfinfo */ {
/* name */ "LDLM_FL_AST_SENT",
}
},
{
+ /* p_id */ &hf_lustre_ldlm_fl_no_expansion,
+ /* hfinfo */ {
+ /* name */ "LDLM_FL_NO_EXPANSION",
+ /* abbrev */ "lustre.ldlm_fl_NO_EXPANSION",
+ /* type */ FT_BOOLEAN,
+ /* display */ 32,
+ /* strings */ TFS(&lnet_flags_set_truth),
+ /* bitmask */ LDLM_FL_NO_EXPANSION,
+ /* blurb */ "Do not expand this lock. Grant it only on the extent\n"
+ "requested. Used for manually requested locks from the client\n"
+ "(LU_LADVISE_LOCKAHEAD)."
+ /* id */ HFILL
+ }
+ },
+ {
/* p_id */ &hf_lustre_ldlm_fl_deny_on_contention,
/* hfinfo */ {
/* name */ "LDLM_FL_DENY_ON_CONTENTION",
.B lfs ladvise [--advice|-a ADVICE ] [--background|-b]
\fB[--start|-s START[kMGT]]
\fB{[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
+ \fB{[--mode|-m MODE] | [--unset|-u]}
\fB<FILE> ...\fR
.br
.SH DESCRIPTION
\fBwillread\fR to prefetch data into server cache
.TP
\fBdontneed\fR to cleanup data cache on server
+.TP
+\fBlockahead\fR to request a lock on a specified extent of a file
+\fBlocknoexpand\fR to disable server side lock expansion for a file
.RE
.TP
\fB\-b\fR, \fB\-\-background
\fB\-l\fR, \fB\-\-length\fR=\fILENGTH\fR
File range has length of \fILENGTH\fR. This option may not be specified at the
same time as the -e option.
+.TP
+\fB\-m\fR, \fB\-\-mode\fR=\fIMODE\fR
+Specify the lock \fIMODE\fR. This option is only valid with lockahead
+advice. Valid modes are: READ, WRITE
+.TP
+\fB\-u\fR, \fB\-\-unset\fR=\fIUNSET\fR
+Unset the previous advice. Currently only valid with locknoexpand advice.
.SH NOTE
.PP
Typically,
This gives the OST(s) holding the first 1GB of \fB/mnt/lustre/file1\fR a hint
that the first 1GB of file will not be read in the near future, thus the OST(s)
could clear the cache of that file in the memory.
+.B $ lfs ladvise -a lockahead -s 0 -e 1048576 -m READ /mnt/lustre/file1
+Request a read lock on the first 1 MiB of /mnt/lustre/file1.
+.B $ $ lfs ladvise -a lockahead -s 0 -e 4096 -m WRITE ./file1
+Request a write lock on the first 4KiB of /mnt/lustre/file1.
+.B $ $ lfs ladvise -a locknoexpand ./file1
+Set disable lock expansion on ./file1
+.B $ $ lfs ladvise -a locknoexpand -u ./file1
+Unset disable lock expansion on ./file1
.SH AVAILABILITY
The lfs ladvise command is part of the Lustre filesystem.
.SH SEE ALSO
* -EWOULDBLOCK is returned immediately.
*/
CEF_NONBLOCK = 0x00000001,
- /**
- * take lock asynchronously (out of order), as it cannot
- * deadlock. This is for LDLM_FL_HAS_INTENT locks used for glimpsing.
- */
- CEF_ASYNC = 0x00000002,
+ /**
+ * Tell lower layers this is a glimpse request, translated to
+ * LDLM_FL_HAS_INTENT at LDLM layer.
+ *
+ * Also, because glimpse locks never block other locks, we count this
+ * as automatically compatible with other osc locks.
+ * (see osc_lock_compatible)
+ */
+ CEF_GLIMPSE = 0x00000002,
/**
* tell the server to instruct (though a flag in the blocking ast) an
* owner of the conflicting lock, that it can drop dirty pages
* protected by this lock, without sending them to the server.
*/
CEF_DISCARD_DATA = 0x00000004,
- /**
- * tell the sub layers that it must be a `real' lock. This is used for
- * mmapped-buffer locks and glimpse locks that must be never converted
- * into lockless mode.
- *
- * \see vvp_mmap_locks(), cl_glimpse_lock().
- */
- CEF_MUST = 0x00000008,
+ /**
+ * tell the sub layers that it must be a `real' lock. This is used for
+ * mmapped-buffer locks, glimpse locks, manually requested locks
+ * (LU_LADVISE_LOCKAHEAD) that must never be converted into lockless
+ * mode.
+ *
+ * \see vvp_mmap_locks(), cl_glimpse_lock, cl_request_lock().
+ */
+ CEF_MUST = 0x00000008,
/**
* tell the sub layers that never request a `real' lock. This flag is
* not used currently.
*/
CEF_NEVER = 0x00000010,
/**
- * for async glimpse lock.
+ * tell the dlm layer this is a speculative lock request
+ * speculative lock requests are locks which are not requested as part
+ * of an I/O operation. Instead, they are requested because we expect
+ * to use them in the future. They are requested asynchronously at the
+ * ptlrpc layer.
+ *
+ * Currently used for asynchronous glimpse locks and manually requested
+ * locks (LU_LADVISE_LOCKAHEAD).
*/
- CEF_AGL = 0x00000020,
+ CEF_SPECULATIVE = 0x00000020,
/**
* enqueue a lock to test DLM lock existence.
*/
*/
CEF_LOCK_MATCH = 0x00000080,
/**
+ * tell the DLM layer to lock only the requested range
+ */
+ CEF_LOCK_NO_EXPAND = 0x00000100,
+ /**
* mask of enq_flags.
*/
- CEF_MASK = 0x000000ff,
+ CEF_MASK = 0x000001ff,
};
/**
*/
ci_noatime:1,
/** Set to 1 if parallel execution is allowed for current I/O? */
- ci_pio:1;
+ ci_pio:1,
+ /* Tell sublayers not to expand LDLM locks requested for this IO */
+ ci_lock_no_expand:1;
/**
* Number of pages owned by this IO. For invariant checking.
*/
struct ldlm_lock *ca_lock;
};
-/** The ldlm_glimpse_work is allocated on the stack and should not be freed. */
-#define LDLM_GL_WORK_NOFREE 0x1
+/** The ldlm_glimpse_work was slab allocated & must be freed accordingly.*/
+#define LDLM_GL_WORK_SLAB_ALLOCATED 0x1
/** Interval node data for each LDLM_EXTENT lock. */
struct ldlm_interval {
#define ldlm_set_block_wait(_l) LDLM_SET_FLAG(( _l), 1ULL << 3)
#define ldlm_clear_block_wait(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 3)
+/**
+ * Lock request is speculative/asynchronous, and cannot wait for any reason.
+ * Fail the lock request if any blocking locks are encountered.
+ * */
+#define LDLM_FL_SPECULATIVE 0x0000000000000010ULL /* bit 4 */
+#define ldlm_is_speculative(_l) LDLM_TEST_FLAG((_l), 1ULL << 4)
+#define ldlm_set_speculative(_l) LDLM_SET_FLAG((_l), 1ULL << 4)
+#define ldlm_clear_specualtive_(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 4)
+
/** blocking or cancel packet was queued for sending. */
#define LDLM_FL_AST_SENT 0x0000000000000020ULL // bit 5
#define ldlm_is_ast_sent(_l) LDLM_TEST_FLAG(( _l), 1ULL << 5)
#define ldlm_clear_cos_incompat(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 24)
/**
+ * Part of original lockahead implementation, OBD_CONNECT_LOCKAHEAD_OLD.
+ * Reserved temporarily to allow those implementations to keep working.
+ * Will be removed after 2.12 release.
+ * */
+#define LDLM_FL_LOCKAHEAD_OLD_RESERVED 0x0000000010000000ULL /* bit 28 */
+#define ldlm_is_do_not_expand_io(_l) LDLM_TEST_FLAG((_l), 1ULL << 28)
+#define ldlm_set_do_not_expand_io(_l) LDLM_SET_FLAG((_l), 1ULL << 28)
+#define ldlm_clear_do_not_expand_io(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 28)
+
+/**
+ * Do not expand this lock. Grant it only on the extent requested.
+ * Used for manually requested locks from the client (LU_LADVISE_LOCKAHEAD).
+ * */
+#define LDLM_FL_NO_EXPANSION 0x0000000020000000ULL /* bit 29 */
+#define ldlm_is_do_not_expand(_l) LDLM_TEST_FLAG((_l), 1ULL << 29)
+#define ldlm_set_do_not_expand(_l) LDLM_SET_FLAG((_l), 1ULL << 29)
+#define ldlm_clear_do_not_expand(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 29)
+
+/**
* measure lock contention and return -EUSERS if locking contention is high */
#define LDLM_FL_DENY_ON_CONTENTION 0x0000000040000000ULL // bit 30
#define ldlm_is_deny_on_contention(_l) LDLM_TEST_FLAG(( _l), 1ULL << 30)
#define LDLM_FL_GONE_MASK (LDLM_FL_DESTROYED |\
LDLM_FL_FAILED)
-/** l_flags bits marked as "inherit" bits */
-/* Flags inherited from wire on enqueue/reply between client/server. */
-/* NO_TIMEOUT flag to force ldlm_lock_match() to wait with no timeout. */
-/* TEST_LOCK flag to not let TEST lock to be granted. */
+/** l_flags bits marked as "inherit" bits
+ * Flags inherited from wire on enqueue/reply between client/server.
+ * CANCEL_ON_BLOCK so server will not grant if a blocking lock is found
+ * NO_TIMEOUT flag to force ldlm_lock_match() to wait with no timeout.
+ * TEST_LOCK flag to not let TEST lock to be granted.
+ * NO_EXPANSION to tell server not to expand extent of lock request */
#define LDLM_FL_INHERIT_MASK (LDLM_FL_CANCEL_ON_BLOCK |\
LDLM_FL_NO_TIMEOUT |\
- LDLM_FL_TEST_LOCK)
+ LDLM_FL_TEST_LOCK |\
+ LDLM_FL_NO_EXPANSION)
/** flags returned in @flags parameter on ldlm_lock_enqueue,
* to be re-constructed on re-send */
return *exp_connect_flags_ptr(exp);
}
+static inline __u64 *exp_connect_flags2_ptr(struct obd_export *exp)
+{
+ return &exp->exp_connect_data.ocd_connect_flags2;
+}
+
+static inline __u64 exp_connect_flags2(struct obd_export *exp)
+{
+ return *exp_connect_flags2_ptr(exp);
+}
+
static inline int exp_max_brw_size(struct obd_export *exp)
{
LASSERT(exp != NULL);
return !!(exp_connect_flags(exp) & OBD_CONNECT_LARGE_ACL);
}
+static inline int exp_connect_lockahead_old(struct obd_export *exp)
+{
+ return !!(exp_connect_flags(exp) & OBD_CONNECT_LOCKAHEAD_OLD);
+}
+
+static inline int exp_connect_lockahead(struct obd_export *exp)
+{
+ return !!(exp_connect_flags2(exp) & OBD_CONNECT2_LOCKAHEAD);
+}
+
extern struct obd_export *class_conn2export(struct lustre_handle *conn);
extern struct obd_device *class_conn2obd(struct lustre_handle *conn);
/**
* For async glimpse lock.
*/
- ols_agl:1;
+ ols_agl:1,
+ /**
+ * for speculative locks - asynchronous glimpse locks and ladvise
+ * lockahead manual lock requests
+ *
+ * Used to tell osc layer to not wait for the ldlm reply from the
+ * server, so the osc lock will be short lived - It only exists to
+ * create the ldlm request and is not updated on request completion.
+ */
+ ols_speculative:1;
};
#define OBD_FAIL_OST_LADVISE_PAUSE 0x237
#define OBD_FAIL_OST_FAKE_RW 0x238
#define OBD_FAIL_OST_LIST_ASSERT 0x239
+#define OBD_FAIL_OST_GL_WORK_ALLOC 0x240
#define OBD_FAIL_LDLM 0x300
#define OBD_FAIL_LDLM_NAMESPACE_NEW 0x301
RPCs in parallel */
#define OBD_CONNECT_DIR_STRIPE 0x400000000000000ULL /* striped DNE dir */
#define OBD_CONNECT_SUBTREE 0x800000000000000ULL /* fileset mount */
-#define OBD_CONNECT_LOCK_AHEAD 0x1000000000000000ULL /* lock ahead */
+#define OBD_CONNECT_LOCKAHEAD_OLD 0x1000000000000000ULL /* Old Cray lockahead */
+
/** bulk matchbits is sent within ptlrpc_body */
#define OBD_CONNECT_BULK_MBITS 0x2000000000000000ULL
#define OBD_CONNECT_OBDOPACK 0x4000000000000000ULL /* compact OUT obdo */
#define OBD_CONNECT_FLAGS2 0x8000000000000000ULL /* second flags word */
/* ocd_connect_flags2 flags */
#define OBD_CONNECT2_FILE_SECCTX 0x1ULL /* set file security context at create */
+#define OBD_CONNECT2_LOCKAHEAD 0x2ULL /* ladvise lockahead v2 */
/* XXX README XXX:
* Please DO NOT add flag values here before first ensuring that this same
OBD_CONNECT_LAYOUTLOCK | OBD_CONNECT_FID | \
OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK | \
OBD_CONNECT_BULK_MBITS | \
- OBD_CONNECT_GRANT_PARAM)
-#define OST_CONNECT_SUPPORTED2 0
+ OBD_CONNECT_GRANT_PARAM | OBD_CONNECT_FLAGS2)
+
+#define OST_CONNECT_SUPPORTED2 OBD_CONNECT2_LOCKAHEAD
#define ECHO_CONNECT_SUPPORTED 0
#define ECHO_CONNECT_SUPPORTED2 0
__u64 gid;
};
+static inline bool ldlm_extent_equal(const struct ldlm_extent *ex1,
+ const struct ldlm_extent *ex2)
+{
+ return ex1->start == ex2->start && ex1->end == ex2->end;
+}
+
struct ldlm_inodebits {
__u64 bits;
__u64 try_bits; /* optional bits to try */
LU_LADVISE_INVALID = 0,
LU_LADVISE_WILLREAD = 1,
LU_LADVISE_DONTNEED = 2,
+ LU_LADVISE_LOCKNOEXPAND = 3,
+ LU_LADVISE_LOCKAHEAD = 4,
+ LU_LADVISE_MAX
};
#define LU_LADVISE_NAMES { \
- [LU_LADVISE_WILLREAD] = "willread", \
- [LU_LADVISE_DONTNEED] = "dontneed", \
+ [LU_LADVISE_WILLREAD] = "willread", \
+ [LU_LADVISE_DONTNEED] = "dontneed", \
+ [LU_LADVISE_LOCKNOEXPAND] = "locknoexpand", \
+ [LU_LADVISE_LOCKAHEAD] = "lockahead", \
}
/* This is the userspace argument for ladvise. It is currently the same as
enum ladvise_flag {
LF_ASYNC = 0x00000001,
+ LF_UNSET = 0x00000002,
};
#define LADVISE_MAGIC 0x1ADF1CE0
-#define LF_MASK LF_ASYNC
+/* Masks of valid flags for each advice */
+#define LF_LOCKNOEXPAND_MASK LF_UNSET
+/* Flags valid for all advices not explicitly specified */
+#define LF_DEFAULT_MASK LF_ASYNC
+/* All flags */
+#define LF_MASK (LF_ASYNC | LF_UNSET)
+
+#define lla_lockahead_mode lla_value1
+#define lla_peradvice_flags lla_value2
+#define lla_lockahead_result lla_value3
/* This is the userspace argument for ladvise, corresponds to ladvise_hdr which
* is used on the wire. It is defined separately as we may need info which is
size_t sht_bytes;
};
+enum lock_mode_user {
+ MODE_READ_USER = 1,
+ MODE_WRITE_USER,
+ MODE_MAX_USER,
+};
+
+#define LOCK_MODE_NAMES { \
+ [MODE_READ_USER] = "READ",\
+ [MODE_WRITE_USER] = "WRITE"\
+}
+
+enum lockahead_results {
+ LLA_RESULT_SENT = 0,
+ LLA_RESULT_DIFFERENT,
+ LLA_RESULT_SAME,
+};
+
/** @} lustreuser */
+
#endif /* _LUSTRE_USER_H */
static void ldlm_extent_policy(struct ldlm_resource *res,
struct ldlm_lock *lock, __u64 *flags)
{
- struct ldlm_extent new_ex = { .start = 0, .end = OBD_OBJECT_EOF };
-
- if (lock->l_export == NULL)
- /*
- * this is local lock taken by server (e.g., as a part of
- * OST-side locking, or unlink handling). Expansion doesn't
- * make a lot of sense for local locks, because they are
- * dropped immediately on operation completion and would only
- * conflict with other threads.
- */
- return;
+ struct ldlm_extent new_ex = { .start = 0, .end = OBD_OBJECT_EOF };
+
+ if (lock->l_export == NULL)
+ /*
+ * this is a local lock taken by server (e.g., as a part of
+ * OST-side locking, or unlink handling). Expansion doesn't
+ * make a lot of sense for local locks, because they are
+ * dropped immediately on operation completion and would only
+ * conflict with other threads.
+ */
+ return;
- if (lock->l_policy_data.l_extent.start == 0 &&
- lock->l_policy_data.l_extent.end == OBD_OBJECT_EOF)
- /* fast-path whole file locks */
- return;
+ if (lock->l_policy_data.l_extent.start == 0 &&
+ lock->l_policy_data.l_extent.end == OBD_OBJECT_EOF)
+ /* fast-path whole file locks */
+ return;
- ldlm_extent_internal_policy_granted(lock, &new_ex);
- ldlm_extent_internal_policy_waiting(lock, &new_ex);
+ /* Because reprocess_queue zeroes flags and uses it to return
+ * LDLM_FL_LOCK_CHANGED, we must check for the NO_EXPANSION flag
+ * in the lock flags rather than the 'flags' argument */
+ if (likely(!(lock->l_flags & LDLM_FL_NO_EXPANSION))) {
+ ldlm_extent_internal_policy_granted(lock, &new_ex);
+ ldlm_extent_internal_policy_waiting(lock, &new_ex);
+ } else {
+ LDLM_DEBUG(lock, "Not expanding manually requested lock.\n");
+ new_ex.start = lock->l_policy_data.l_extent.start;
+ new_ex.end = lock->l_policy_data.l_extent.end;
+ /* In case the request is not on correct boundaries, we call
+ * fixup. (normally called in ldlm_extent_internal_policy_*) */
+ ldlm_extent_internal_policy_fixup(lock, &new_ex, 0);
+ }
- if (new_ex.start != lock->l_policy_data.l_extent.start ||
- new_ex.end != lock->l_policy_data.l_extent.end) {
- *flags |= LDLM_FL_LOCK_CHANGED;
- lock->l_policy_data.l_extent.start = new_ex.start;
- lock->l_policy_data.l_extent.end = new_ex.end;
- }
+ if (!ldlm_extent_equal(&new_ex, &lock->l_policy_data.l_extent)) {
+ *flags |= LDLM_FL_LOCK_CHANGED;
+ lock->l_policy_data.l_extent.start = new_ex.start;
+ lock->l_policy_data.l_extent.end = new_ex.end;
+ }
}
static int ldlm_check_contention(struct ldlm_lock *lock, int contended_locks)
}
if (tree->lit_mode == LCK_GROUP) {
- if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+ if (*flags & (LDLM_FL_BLOCK_NOWAIT |
+ LDLM_FL_SPECULATIVE)) {
compat = -EWOULDBLOCK;
goto destroylock;
}
continue;
}
- if (!work_list) {
- rc = interval_is_overlapped(tree->lit_root,&ex);
- if (rc)
- RETURN(0);
+ /* We've found a potentially blocking lock, check
+ * compatibility. This handles locks other than GROUP
+ * locks, which are handled separately above.
+ *
+ * Locks with FL_SPECULATIVE are asynchronous requests
+ * which must never wait behind another lock, so they
+ * fail if any conflicting lock is found. */
+ if (!work_list || (*flags & LDLM_FL_SPECULATIVE)) {
+ rc = interval_is_overlapped(tree->lit_root,
+ &ex);
+ if (rc) {
+ if (!work_list) {
+ RETURN(0);
+ } else {
+ compat = -EWOULDBLOCK;
+ goto destroylock;
+ }
+ }
} else {
interval_search(tree->lit_root, &ex,
ldlm_extent_compat_cb, &data);
* already blocked.
* If we are in nonblocking mode - return
* immediately */
- if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+ if (*flags & (LDLM_FL_BLOCK_NOWAIT
+ | LDLM_FL_SPECULATIVE)) {
compat = -EWOULDBLOCK;
goto destroylock;
}
}
if (unlikely(lock->l_req_mode == LCK_GROUP)) {
- /* If compared lock is GROUP, then requested is PR/PW/
- * so this is not compatible; extent range does not
- * matter */
- if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+ /* If compared lock is GROUP, then requested is
+ * PR/PW so this is not compatible; extent
+ * range does not matter */
+ if (*flags & (LDLM_FL_BLOCK_NOWAIT
+ | LDLM_FL_SPECULATIVE)) {
compat = -EWOULDBLOCK;
goto destroylock;
} else {
if (!work_list)
RETURN(0);
+ if (*flags & LDLM_FL_SPECULATIVE) {
+ compat = -EWOULDBLOCK;
+ goto destroylock;
+ }
+
/* don't count conflicting glimpse locks */
if (lock->l_req_mode == LCK_PR &&
lock->l_policy_data.l_extent.start == 0 &&
*err = ELDLM_OK;
if (intention == LDLM_PROCESS_RESCAN) {
- /* Careful observers will note that we don't handle -EWOULDBLOCK
- * here, but it's ok for a non-obvious reason -- compat_queue
- * can only return -EWOULDBLOCK if (flags & BLOCK_NOWAIT).
- * flags should always be zero here, and if that ever stops
- * being true, we want to find out. */
+ /* Careful observers will note that we don't handle -EWOULDBLOCK
+ * here, but it's ok for a non-obvious reason -- compat_queue
+ * can only return -EWOULDBLOCK if (flags & BLOCK_NOWAIT |
+ * SPECULATIVE). flags should always be zero here, and if that
+ * ever stops being true, we want to find out. */
LASSERT(*flags == 0);
rc = ldlm_extent_compat_queue(&res->lr_granted, lock, flags,
err, NULL, &contended_locks);
ocd->ocd_connect_flags, "old %#llx, new %#llx\n",
data->ocd_connect_flags, ocd->ocd_connect_flags);
data->ocd_connect_flags = ocd->ocd_connect_flags;
+ data->ocd_connect_flags2 = ocd->ocd_connect_flags2;
}
ptlrpc_pinger_add_import(imp);
#include "ldlm_internal.h"
+struct kmem_cache *ldlm_glimpse_work_kmem;
+EXPORT_SYMBOL(ldlm_glimpse_work_kmem);
+
/* lock types */
char *ldlm_lockname[] = {
[0] = "--",
rc = 1;
LDLM_LOCK_RELEASE(lock);
-
- if ((gl_work->gl_flags & LDLM_GL_WORK_NOFREE) == 0)
+ if (gl_work->gl_flags & LDLM_GL_WORK_SLAB_ALLOCATED)
+ OBD_SLAB_FREE_PTR(gl_work, ldlm_glimpse_work_kmem);
+ else
OBD_FREE_PTR(gl_work);
RETURN(rc);
static void ll_io_init(struct cl_io *io, struct file *file, enum cl_io_type iot)
{
struct inode *inode = file_inode(file);
+ struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
memset(&io->u.ci_rw.rw_iter, 0, sizeof(io->u.ci_rw.rw_iter));
init_sync_kiocb(&io->u.ci_rw.rw_iocb, file);
io->u.ci_rw.rw_file = file;
io->u.ci_rw.rw_ptask = ll_file_io_ptask;
io->u.ci_rw.rw_nonblock = !!(file->f_flags & O_NONBLOCK);
+ io->ci_lock_no_expand = fd->ll_lock_no_expand;
+
if (iot == CIT_WRITE) {
io->u.ci_rw.rw_append = !!(file->f_flags & O_APPEND);
io->u.ci_rw.rw_sync = !!(file->f_flags & O_SYNC ||
RETURN(rc);
}
+static enum cl_lock_mode cl_mode_user_to_kernel(enum lock_mode_user mode)
+{
+ switch (mode) {
+ case MODE_READ_USER:
+ return CLM_READ;
+ case MODE_WRITE_USER:
+ return CLM_WRITE;
+ default:
+ return -EINVAL;
+ }
+}
+
+static const char *const user_lockname[] = LOCK_MODE_NAMES;
+
+/* Used to allow the upper layers of the client to request an LDLM lock
+ * without doing an actual read or write.
+ *
+ * Used for ladvise lockahead to manually request specific locks.
+ *
+ * \param[in] file file this ladvise lock request is on
+ * \param[in] ladvise ladvise struct describing this lock request
+ *
+ * \retval 0 success, no detailed result available (sync requests
+ * and requests sent to the server [not handled locally]
+ * cannot return detailed results)
+ * \retval LLA_RESULT_{SAME,DIFFERENT} - detailed result of the lock request,
+ * see definitions for details.
+ * \retval negative negative errno on error
+ */
+int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise)
+{
+ struct lu_env *env = NULL;
+ struct cl_io *io = NULL;
+ struct cl_lock *lock = NULL;
+ struct cl_lock_descr *descr = NULL;
+ struct dentry *dentry = file->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+ enum cl_lock_mode cl_mode;
+ off_t start = ladvise->lla_start;
+ off_t end = ladvise->lla_end;
+ int result;
+ __u16 refcheck;
+
+ ENTRY;
+
+ CDEBUG(D_VFSTRACE, "Lock request: file=%.*s, inode=%p, mode=%s "
+ "start=%llu, end=%llu\n", dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode,
+ user_lockname[ladvise->lla_lockahead_mode], (__u64) start,
+ (__u64) end);
+
+ cl_mode = cl_mode_user_to_kernel(ladvise->lla_lockahead_mode);
+ if (cl_mode < 0)
+ GOTO(out, result = cl_mode);
+
+ /* Get IO environment */
+ result = cl_io_get(inode, &env, &io, &refcheck);
+ if (result <= 0)
+ GOTO(out, result);
+
+ result = cl_io_init(env, io, CIT_MISC, io->ci_obj);
+ if (result > 0) {
+ /*
+ * nothing to do for this io. This currently happens when
+ * stripe sub-object's are not yet created.
+ */
+ result = io->ci_result;
+ } else if (result == 0) {
+ lock = vvp_env_lock(env);
+ descr = &lock->cll_descr;
+
+ descr->cld_obj = io->ci_obj;
+ /* Convert byte offsets to pages */
+ descr->cld_start = cl_index(io->ci_obj, start);
+ descr->cld_end = cl_index(io->ci_obj, end);
+ descr->cld_mode = cl_mode;
+ /* CEF_MUST is used because we do not want to convert a
+ * lockahead request to a lockless lock */
+ descr->cld_enq_flags = CEF_MUST | CEF_LOCK_NO_EXPAND |
+ CEF_NONBLOCK;
+
+ if (ladvise->lla_peradvice_flags & LF_ASYNC)
+ descr->cld_enq_flags |= CEF_SPECULATIVE;
+
+ result = cl_lock_request(env, io, lock);
+
+ /* On success, we need to release the lock */
+ if (result >= 0)
+ cl_lock_release(env, lock);
+ }
+ cl_io_fini(env, io);
+ cl_env_put(env, &refcheck);
+
+ /* -ECANCELED indicates a matching lock with a different extent
+ * was already present, and -EEXIST indicates a matching lock
+ * on exactly the same extent was already present.
+ * We convert them to positive values for userspace to make
+ * recognizing true errors easier.
+ * Note we can only return these detailed results on async requests,
+ * as sync requests look the same as i/o requests for locking. */
+ if (result == -ECANCELED)
+ result = LLA_RESULT_DIFFERENT;
+ else if (result == -EEXIST)
+ result = LLA_RESULT_SAME;
+
+out:
+ RETURN(result);
+}
+static const char *const ladvise_names[] = LU_LADVISE_NAMES;
+
+static int ll_ladvise_sanity(struct inode *inode,
+ struct llapi_lu_ladvise *ladvise)
+{
+ enum lu_ladvise_type advice = ladvise->lla_advice;
+ /* Note the peradvice flags is a 32 bit field, so per advice flags must
+ * be in the first 32 bits of enum ladvise_flags */
+ __u32 flags = ladvise->lla_peradvice_flags;
+ /* 3 lines at 80 characters per line, should be plenty */
+ int rc = 0;
+
+ if (advice > LU_LADVISE_MAX || advice == LU_LADVISE_INVALID) {
+ rc = -EINVAL;
+ CDEBUG(D_VFSTRACE, "%s: advice with value '%d' not recognized,"
+ "last supported advice is %s (value '%d'): rc = %d\n",
+ ll_get_fsname(inode->i_sb, NULL, 0), advice,
+ ladvise_names[LU_LADVISE_MAX-1], LU_LADVISE_MAX-1, rc);
+ GOTO(out, rc);
+ }
+
+ /* Per-advice checks */
+ switch (advice) {
+ case LU_LADVISE_LOCKNOEXPAND:
+ if (flags & ~LF_LOCKNOEXPAND_MASK) {
+ rc = -EINVAL;
+ CDEBUG(D_VFSTRACE, "%s: Invalid flags (%x) for %s: "
+ "rc = %d\n",
+ ll_get_fsname(inode->i_sb, NULL, 0), flags,
+ ladvise_names[advice], rc);
+ GOTO(out, rc);
+ }
+ break;
+ case LU_LADVISE_LOCKAHEAD:
+ /* Currently only READ and WRITE modes can be requested */
+ if (ladvise->lla_lockahead_mode >= MODE_MAX_USER ||
+ ladvise->lla_lockahead_mode == 0) {
+ rc = -EINVAL;
+ CDEBUG(D_VFSTRACE, "%s: Invalid mode (%d) for %s: "
+ "rc = %d\n",
+ ll_get_fsname(inode->i_sb, NULL, 0),
+ ladvise->lla_lockahead_mode,
+ ladvise_names[advice], rc);
+ GOTO(out, rc);
+ }
+ case LU_LADVISE_WILLREAD:
+ case LU_LADVISE_DONTNEED:
+ default:
+ /* Note fall through above - These checks apply to all advices
+ * except LOCKNOEXPAND */
+ if (flags & ~LF_DEFAULT_MASK) {
+ rc = -EINVAL;
+ CDEBUG(D_VFSTRACE, "%s: Invalid flags (%x) for %s: "
+ "rc = %d\n",
+ ll_get_fsname(inode->i_sb, NULL, 0), flags,
+ ladvise_names[advice], rc);
+ GOTO(out, rc);
+ }
+ if (ladvise->lla_start >= ladvise->lla_end) {
+ rc = -EINVAL;
+ CDEBUG(D_VFSTRACE, "%s: Invalid range (%llu to %llu) "
+ "for %s: rc = %d\n",
+ ll_get_fsname(inode->i_sb, NULL, 0),
+ ladvise->lla_start, ladvise->lla_end,
+ ladvise_names[advice], rc);
+ GOTO(out, rc);
+ }
+ break;
+ }
+
+out:
+ return rc;
+}
+#undef ERRSIZE
+
/*
* Give file access advices
*
RETURN(rc);
}
+static int ll_lock_noexpand(struct file *file, int flags)
+{
+ struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+
+ fd->ll_lock_no_expand = !(flags & LF_UNSET);
+
+ return 0;
+}
+
int ll_ioctl_fsgetxattr(struct inode *inode, unsigned int cmd,
unsigned long arg)
{
RETURN(ll_file_futimes_3(file, &lfu));
}
case LL_IOC_LADVISE: {
- struct llapi_ladvise_hdr *ladvise_hdr;
+ struct llapi_ladvise_hdr *k_ladvise_hdr;
+ struct llapi_ladvise_hdr __user *u_ladvise_hdr;
int i;
int num_advise;
- int alloc_size = sizeof(*ladvise_hdr);
+ int alloc_size = sizeof(*k_ladvise_hdr);
rc = 0;
- OBD_ALLOC_PTR(ladvise_hdr);
- if (ladvise_hdr == NULL)
+ u_ladvise_hdr = (void __user *)arg;
+ OBD_ALLOC_PTR(k_ladvise_hdr);
+ if (k_ladvise_hdr == NULL)
RETURN(-ENOMEM);
- if (copy_from_user(ladvise_hdr,
- (const struct llapi_ladvise_hdr __user *)arg,
- alloc_size))
+ if (copy_from_user(k_ladvise_hdr, u_ladvise_hdr, alloc_size))
GOTO(out_ladvise, rc = -EFAULT);
- if (ladvise_hdr->lah_magic != LADVISE_MAGIC ||
- ladvise_hdr->lah_count < 1)
+ if (k_ladvise_hdr->lah_magic != LADVISE_MAGIC ||
+ k_ladvise_hdr->lah_count < 1)
GOTO(out_ladvise, rc = -EINVAL);
- num_advise = ladvise_hdr->lah_count;
+ num_advise = k_ladvise_hdr->lah_count;
if (num_advise >= LAH_COUNT_MAX)
GOTO(out_ladvise, rc = -EFBIG);
- OBD_FREE_PTR(ladvise_hdr);
- alloc_size = offsetof(typeof(*ladvise_hdr),
+ OBD_FREE_PTR(k_ladvise_hdr);
+ alloc_size = offsetof(typeof(*k_ladvise_hdr),
lah_advise[num_advise]);
- OBD_ALLOC(ladvise_hdr, alloc_size);
- if (ladvise_hdr == NULL)
+ OBD_ALLOC(k_ladvise_hdr, alloc_size);
+ if (k_ladvise_hdr == NULL)
RETURN(-ENOMEM);
/*
* TODO: submit multiple advices to one server in a single RPC
*/
- if (copy_from_user(ladvise_hdr,
- (const struct llapi_ladvise_hdr __user *)arg,
- alloc_size))
+ if (copy_from_user(k_ladvise_hdr, u_ladvise_hdr, alloc_size))
GOTO(out_ladvise, rc = -EFAULT);
for (i = 0; i < num_advise; i++) {
- rc = ll_ladvise(inode, file, ladvise_hdr->lah_flags,
- &ladvise_hdr->lah_advise[i]);
+ struct llapi_lu_ladvise *k_ladvise =
+ &k_ladvise_hdr->lah_advise[i];
+ struct llapi_lu_ladvise __user *u_ladvise =
+ &u_ladvise_hdr->lah_advise[i];
+
+ rc = ll_ladvise_sanity(inode, k_ladvise);
if (rc)
+ GOTO(out_ladvise, rc);
+
+ switch (k_ladvise->lla_advice) {
+ case LU_LADVISE_LOCKNOEXPAND:
+ rc = ll_lock_noexpand(file,
+ k_ladvise->lla_peradvice_flags);
+ GOTO(out_ladvise, rc);
+ case LU_LADVISE_LOCKAHEAD:
+
+ rc = ll_file_lock_ahead(file, k_ladvise);
+
+ if (rc < 0)
+ GOTO(out_ladvise, rc);
+
+ if (put_user(rc,
+ &u_ladvise->lla_lockahead_result))
+ GOTO(out_ladvise, rc = -EFAULT);
+ break;
+ default:
+ rc = ll_ladvise(inode, file,
+ k_ladvise_hdr->lah_flags,
+ k_ladvise);
+ if (rc)
+ GOTO(out_ladvise, rc);
break;
+ }
+
}
out_ladvise:
- OBD_FREE(ladvise_hdr, alloc_size);
+ OBD_FREE(k_ladvise_hdr, alloc_size);
RETURN(rc);
}
case LL_IOC_FSGETXATTR:
CDEBUG(D_DLMTRACE, "Glimpsing inode "DFID"\n", PFID(fid));
/* NOTE: this looks like DLM lock request, but it may
- * not be one. Due to CEF_ASYNC flag (translated
+ * not be one. Due to CEF_GLIMPSE flag (translated
* to LDLM_FL_HAS_INTENT by osc), this is
* glimpse request, that won't revoke any
* conflicting DLM locks held. Instead,
*descr = whole_file;
descr->cld_obj = clob;
descr->cld_mode = CLM_READ;
- descr->cld_enq_flags = CEF_ASYNC | CEF_MUST;
+ descr->cld_enq_flags = CEF_GLIMPSE | CEF_MUST;
if (agl)
- descr->cld_enq_flags |= CEF_AGL;
+ descr->cld_enq_flags |= CEF_SPECULATIVE | CEF_NONBLOCK;
/*
- * CEF_ASYNC is used because glimpse sub-locks cannot
- * deadlock (because they never conflict with other
- * locks) and, hence, can be enqueued out-of-order.
- *
* CEF_MUST protects glimpse lock from conversion into
* a lockless mode.
*/
RETURN(result);
}
-static int cl_io_get(struct inode *inode, struct lu_env **envout,
+/**
+ * Get an IO environment for special operations such as glimpse locks and
+ * manually requested locks (ladvise lockahead)
+ *
+ * \param[in] inode inode the operation is being performed on
+ * \param[out] envout thread specific execution environment
+ * \param[out] ioout client io description
+ * \param[out] refcheck reference check
+ *
+ * \retval 1 on success
+ * \retval 0 not a regular file, cannot get environment
+ * \retval negative negative errno on error
+ */
+int cl_io_get(struct inode *inode, struct lu_env **envout,
struct cl_io **ioout, __u16 *refcheck)
{
struct lu_env *env;
* true: failure is known, not report again.
* false: unknown failure, should report. */
bool fd_write_failed;
+ bool ll_lock_no_expand;
rwlock_t fd_lock; /* protect lcc list */
struct list_head fd_lccs; /* list of ll_cl_context */
};
return cl_glimpse_size0(inode, 0);
}
+/* AGL is 'asychronous glimpse lock', which is a speculative lock taken as
+ * part of statahead */
static inline int cl_agl(struct inode *inode)
{
return cl_glimpse_size0(inode, 1);
}
+int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise);
+
+int cl_io_get(struct inode *inode, struct lu_env **envout,
+ struct cl_io **ioout, __u16 *refcheck);
+
static inline int ll_glimpse_size(struct inode *inode)
{
struct ll_inode_info *lli = ll_i2info(inode);
RETURN(-ENOMEM);
}
- /* indicate the features supported by this client */
+ /* indicate MDT features supported by this client */
data->ocd_connect_flags = OBD_CONNECT_IBITS | OBD_CONNECT_NODEVOH |
OBD_CONNECT_ATTRFID |
OBD_CONNECT_VERSION | OBD_CONNECT_BRW_SIZE |
* back its backend blocksize for grant calculation purpose */
data->ocd_grant_blkbits = PAGE_SHIFT;
+ /* indicate OST features supported by this client */
data->ocd_connect_flags = OBD_CONNECT_GRANT | OBD_CONNECT_VERSION |
OBD_CONNECT_REQPORTAL | OBD_CONNECT_BRW_SIZE |
OBD_CONNECT_CANCELSET | OBD_CONNECT_FID |
OBD_CONNECT_JOBSTATS | OBD_CONNECT_LVB_TYPE |
OBD_CONNECT_LAYOUTLOCK |
OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK |
- OBD_CONNECT_BULK_MBITS;
+ OBD_CONNECT_BULK_MBITS |
+ OBD_CONNECT_FLAGS2;
- data->ocd_connect_flags2 = 0;
+/* The client currently advertises support for OBD_CONNECT_LOCKAHEAD_OLD so it
+ * can interoperate with an older version of lockahead which was released prior
+ * to landing in master. This support will be dropped when 2.13 development
+ * starts. At the point, we should not just drop the connect flag (below), we
+ * should also remove the support in the code.
+ *
+ * Removing it means a few things:
+ * 1. Remove this section here
+ * 2. Remove CEF_NONBLOCK in ll_file_lockahead()
+ * 3. Remove function exp_connect_lockahead_old
+ * 4. Remove LDLM_FL_LOCKAHEAD_OLD_RESERVED in lustre_dlm_flags.h
+ * */
+#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 12, 50, 0)
+ data->ocd_connect_flags |= OBD_CONNECT_LOCKAHEAD_OLD;
+#endif
+
+ data->ocd_connect_flags2 = OBD_CONNECT2_LOCKAHEAD;
if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_GRANT_PARAM))
data->ocd_connect_flags |= OBD_CONNECT_GRANT_PARAM;
if (io->u.ci_rw.rw_nonblock)
ast_flags |= CEF_NONBLOCK;
+ if (io->ci_lock_no_expand)
+ ast_flags |= CEF_LOCK_NO_EXPAND;
result = vvp_mmap_locks(env, io);
if (result == 0)
sub_io->ci_no_srvlock = io->ci_no_srvlock;
sub_io->ci_noatime = io->ci_noatime;
sub_io->ci_pio = io->ci_pio;
+ sub_io->ci_lock_no_expand = io->ci_lock_no_expand;
result = cl_io_sub_init(sub->sub_env, sub_io, io->ci_type, sub_obj);
if (rc < 0)
RETURN(rc);
- if ((enq_flags & CEF_ASYNC) && !(enq_flags & CEF_AGL)) {
+ if ((enq_flags & CEF_GLIMPSE) && !(enq_flags & CEF_SPECULATIVE)) {
anchor = &cl_env_info(env)->clt_anchor;
cl_sync_io_init(anchor, 1, cl_sync_io_end);
}
"multi_mod_rpcs",
"dir_stripe",
"subtree",
- "lock_ahead",
+ "lockahead",
"bulk_mbits",
"compact_obdo",
"second_flags",
/* flags2 names */
"file_secctx",
+ "lockaheadv2",
NULL
};
return(rc);
}
+ rc = ofd_dlm_init();
+ if (rc) {
+ lu_kmem_fini(ofd_caches);
+ ofd_fmd_exit();
+ return rc;
+ }
+
rc = class_register_type(&ofd_obd_ops, NULL, true, NULL,
LUSTRE_OST_NAME, &ofd_device_type);
return rc;
static void __exit ofd_exit(void)
{
ofd_fmd_exit();
+ ofd_dlm_exit();
lu_kmem_fini(ofd_caches);
class_unregister_type(LUSTRE_OST_NAME);
}
#include "ofd_internal.h"
struct ofd_intent_args {
- struct ldlm_lock **victim;
+ struct list_head gl_list;
__u64 size;
- int *liblustre;
+ bool no_glimpse_ast;
+ int error;
};
+int ofd_dlm_init(void)
+{
+ ldlm_glimpse_work_kmem = kmem_cache_create("ldlm_glimpse_work_kmem",
+ sizeof(struct ldlm_glimpse_work),
+ 0, 0, NULL);
+ if (ldlm_glimpse_work_kmem == NULL)
+ return -ENOMEM;
+ else
+ return 0;
+}
+
+void ofd_dlm_exit(void)
+{
+ if (ldlm_glimpse_work_kmem) {
+ kmem_cache_destroy(ldlm_glimpse_work_kmem);
+ ldlm_glimpse_work_kmem = NULL;
+ }
+}
+
/**
* OFD interval callback.
*
* The interval_callback_t is part of interval_iterate_reverse() and is called
* for each interval in tree. The OFD interval callback searches for locks
- * covering extents beyond the given args->size. This is used to decide if LVB
- * data is outdated.
+ * covering extents beyond the given args->size. This is used to decide if the
+ * size is too small and needs to be updated. Note that we are only interested
+ * in growing the size, as truncate is the only operation which can shrink it,
+ * and it is handled differently. This is why we only look at locks beyond the
+ * current size.
+ *
+ * It finds the highest lock (by starting point) in this interval, and adds it
+ * to the list of locks to glimpse. We must glimpse a list of locks - rather
+ * than only the highest lock on the file - because lockahead creates extent
+ * locks in advance of IO, and so breaks the assumption that the holder of the
+ * highest lock knows the current file size.
+ *
+ * This assumption is normally true because locks which are created as part of
+ * IO - rather than in advance of it - are guaranteed to be 'active', i.e.,
+ * involved in IO, and the holder of the highest 'active' lock always knows the
+ * current file size, because the size is either not changing or the holder of
+ * that lock is responsible for updating it.
+ *
+ * So we need only glimpse until we find the first client with an 'active'
+ * lock.
+ *
+ * Unfortunately, there is no way to know if a manually requested/speculative
+ * lock is 'active' from the server side. So when we see a potentially
+ * speculative lock, we must send a glimpse for that lock unless we have
+ * already sent a glimpse to the holder of that lock.
+ *
+ * However, *all* non-speculative locks are active. So we can stop glimpsing
+ * as soon as we find a non-speculative lock. Currently, all speculative PW
+ * locks have LDLM_FL_NO_EXPANSION set, and we use this to identify them. This
+ * is enforced by an assertion in osc_lock_init, which references this comment.
+ *
+ * If that ever changes, we will either need to find a new way to identify
+ * active locks or we will need to consider all PW locks (we will still only
+ * glimpse one per client).
+ *
+ * Note that it is safe to glimpse only the 'top' lock from each interval
+ * because ofd_intent_cb is only called for PW extent locks, and for PW locks,
+ * there is only one lock per interval.
*
* \param[in] n interval node
- * \param[in] args intent arguments
+ * \param[in,out] args intent arguments, gl work list for identified locks
*
* \retval INTERVAL_ITER_STOP if the interval is lower than
* file size, caller stops execution
struct ldlm_interval *node = (struct ldlm_interval *)n;
struct ofd_intent_args *arg = args;
__u64 size = arg->size;
- struct ldlm_lock **v = arg->victim;
+ struct ldlm_lock *victim_lock = NULL;
struct ldlm_lock *lck;
+ struct ldlm_glimpse_work *gl_work = NULL;
+ int rc = 0;
/* If the interval is lower than the current file size, just break. */
if (interval_high(n) <= size)
- return INTERVAL_ITER_STOP;
+ GOTO(out, rc = INTERVAL_ITER_STOP);
+ /* Find the 'victim' lock from this interval */
list_for_each_entry(lck, &node->li_group, l_sl_policy) {
- /* Don't send glimpse ASTs to liblustre clients.
- * They aren't listening for them, and they do
- * entirely synchronous I/O anyways. */
- if (lck->l_export == NULL || lck->l_export->exp_libclient)
- continue;
-
- if (*arg->liblustre)
- *arg->liblustre = 0;
- if (*v == NULL) {
- *v = LDLM_LOCK_GET(lck);
- } else if ((*v)->l_policy_data.l_extent.start <
- lck->l_policy_data.l_extent.start) {
- LDLM_LOCK_RELEASE(*v);
- *v = LDLM_LOCK_GET(lck);
- }
+ victim_lock = LDLM_LOCK_GET(lck);
/* the same policy group - every lock has the
* same extent, so needn't do it any more */
break;
}
- return INTERVAL_ITER_CONT;
-}
+ /* l_export can be null in race with eviction - In that case, we will
+ * not find any locks in this interval */
+ if (!victim_lock)
+ GOTO(out, rc = INTERVAL_ITER_CONT);
+
+ /*
+ * This check is for lock taken in ofd_destroy_by_fid() that does
+ * not have l_glimpse_ast set. So the logic is: if there is a lock
+ * with no l_glimpse_ast set, this object is being destroyed already.
+ * Hence, if you are grabbing DLM locks on the server, always set
+ * non-NULL glimpse_ast (e.g., ldlm_request.c::ldlm_glimpse_ast()).
+ */
+ if (victim_lock->l_glimpse_ast == NULL) {
+ LDLM_DEBUG(victim_lock, "no l_glimpse_ast");
+ arg->no_glimpse_ast = true;
+ GOTO(out_release, rc = INTERVAL_ITER_STOP);
+ }
+ /* If NO_EXPANSION is not set, this is an active lock, and we don't need
+ * to glimpse any further once we've glimpsed the client holding this
+ * lock. So set us up to stop. See comment above this function. */
+ if (!(victim_lock->l_flags & LDLM_FL_NO_EXPANSION))
+ rc = INTERVAL_ITER_STOP;
+ else
+ rc = INTERVAL_ITER_CONT;
+
+ /* Check to see if we're already set up to send a glimpse to this
+ * client; if so, don't add this lock to the glimpse list - We need
+ * only glimpse each client once. (And if we know that client holds
+ * an active lock, we can stop glimpsing. So keep the rc set in the
+ * check above.) */
+ list_for_each_entry(gl_work, &arg->gl_list, gl_list) {
+ if (gl_work->gl_lock->l_export == victim_lock->l_export)
+ GOTO(out_release, rc);
+ }
+
+ if (!OBD_FAIL_CHECK(OBD_FAIL_OST_GL_WORK_ALLOC))
+ OBD_SLAB_ALLOC_PTR_GFP(gl_work, ldlm_glimpse_work_kmem,
+ GFP_ATOMIC);
+
+ if (!gl_work) {
+ arg->error = -ENOMEM;
+ GOTO(out_release, rc = INTERVAL_ITER_STOP);
+ }
+
+ /* Populate the gl_work structure. */
+ gl_work->gl_lock = victim_lock;
+ list_add_tail(&gl_work->gl_list, &arg->gl_list);
+ /* There is actually no need for a glimpse descriptor when glimpsing
+ * extent locks */
+ gl_work->gl_desc = NULL;
+ /* This tells ldlm_work_gl_ast_lock this was allocated from a slab and
+ * must be freed in a slab-aware manner. */
+ gl_work->gl_flags = LDLM_GL_WORK_SLAB_ALLOCATED;
+
+ GOTO(out, rc);
+
+out_release:
+ /* If the victim doesn't go on the glimpse list, we must release it */
+ LDLM_LOCK_RELEASE(victim_lock);
+
+out:
+ return rc;
+}
/**
* OFD lock intent policy
*
* \retval ELDLM_LOCK_REPLACED if already granted lock was found
* and placed in \a lockp
* \retval ELDLM_LOCK_ABORTED in other cases except error
- * \retval negative value on error
+ * \retval negative errno on error
*/
int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
void *req_cookie, enum ldlm_mode mode, __u64 flags,
void *data)
{
struct ptlrpc_request *req = req_cookie;
- struct ldlm_lock *lock = *lockp, *l = NULL;
+ struct ldlm_lock *lock = *lockp;
struct ldlm_resource *res = lock->l_resource;
ldlm_processing_policy policy;
struct ost_lvb *res_lvb, *reply_lvb;
struct ldlm_reply *rep;
enum ldlm_error err;
- int idx, rc, only_liblustre = 1;
+ int idx, rc;
struct ldlm_interval_tree *tree;
struct ofd_intent_args arg;
__u32 repsize[3] = {
[DLM_LOCKREPLY_OFF] = sizeof(*rep),
[DLM_REPLY_REC_OFF] = sizeof(*reply_lvb)
};
- struct ldlm_glimpse_work gl_work = {};
- struct list_head gl_list;
+ struct ldlm_glimpse_work *pos, *tmp;
ENTRY;
- INIT_LIST_HEAD(&gl_list);
+ INIT_LIST_HEAD(&arg.gl_list);
+ arg.no_glimpse_ast = false;
+ arg.error = 0;
lock->l_lvb_type = LVB_T_OST;
policy = ldlm_get_processing_policy(res);
LASSERT(policy != NULL);
/* The lock met with no resistance; we're finished. */
if (rc == LDLM_ITER_CONTINUE) {
- /* do not grant locks to the liblustre clients: they cannot
- * handle ASTs robustly. We need to do this while still
- * holding ns_lock to avoid the lock remaining on the res_link
- * list (and potentially being added to l_pending_list by an
- * AST) when we are going to drop this lock ASAP. */
- if (lock->l_export->exp_libclient ||
- OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_GLIMPSE, 2)) {
+ if (OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_GLIMPSE, 2)) {
ldlm_resource_unlink_lock(lock);
err = ELDLM_LOCK_ABORTED;
} else {
* res->lr_lvb_sem.
*/
arg.size = reply_lvb->lvb_size;
- arg.victim = &l;
- arg.liblustre = &only_liblustre;
+ /* Check for PW locks beyond the size in the LVB, build the list
+ * of locks to glimpse (arg.gl_list) */
for (idx = 0; idx < LCK_MODE_NUM; idx++) {
tree = &res->lr_itree[idx];
if (tree->lit_mode == LCK_PR)
continue;
interval_iterate_reverse(tree->lit_root, ofd_intent_cb, &arg);
+ if (arg.error) {
+ unlock_res(res);
+ GOTO(out, rc = arg.error);
+ }
}
unlock_res(res);
/* There were no PW locks beyond the size in the LVB; finished. */
- if (l == NULL) {
- if (only_liblustre) {
- /* If we discovered a liblustre client with a PW lock,
- * however, the LVB may be out of date! The LVB is
- * updated only on glimpse (which we don't do for
- * liblustre clients) and cancel (which the client
- * obviously has not yet done). So if it has written
- * data but kept the lock, the LVB is stale and needs
- * to be updated from disk.
- *
- * Of course, this will all disappear when we switch to
- * taking liblustre locks on the OST. */
- ldlm_res_lvbo_update(res, NULL, 1);
- }
+ if (list_empty(&arg.gl_list))
RETURN(ELDLM_LOCK_ABORTED);
- }
- /*
- * This check is for lock taken in ofd_destroy_by_fid() that does
- * not have l_glimpse_ast set. So the logic is: if there is a lock
- * with no l_glimpse_ast set, this object is being destroyed already.
- * Hence, if you are grabbing DLM locks on the server, always set
- * non-NULL glimpse_ast (e.g., ldlm_request.c::ldlm_glimpse_ast()).
- */
- if (l->l_glimpse_ast == NULL) {
+ if (arg.no_glimpse_ast) {
/* We are racing with unlink(); just return -ENOENT */
rep->lock_policy_res1 = ptlrpc_status_hton(-ENOENT);
- goto out;
+ GOTO(out, ELDLM_LOCK_ABORTED);
}
- /* Populate the gl_work structure.
- * Grab additional reference on the lock which will be released in
- * ldlm_work_gl_ast_lock() */
- gl_work.gl_lock = LDLM_LOCK_GET(l);
- /* The glimpse callback is sent to one single extent lock. As a result,
- * the gl_work list is just composed of one element */
- list_add_tail(&gl_work.gl_list, &gl_list);
- /* There is actually no need for a glimpse descriptor when glimpsing
- * extent locks */
- gl_work.gl_desc = NULL;
- /* the ldlm_glimpse_work structure is allocated on the stack */
- gl_work.gl_flags = LDLM_GL_WORK_NOFREE;
-
- rc = ldlm_glimpse_locks(res, &gl_list); /* this will update the LVB */
-
- if (!list_empty(&gl_list))
- LDLM_LOCK_RELEASE(l);
+ /* this will update the LVB */
+ ldlm_glimpse_locks(res, &arg.gl_list);
lock_res(res);
*reply_lvb = *res_lvb;
unlock_res(res);
out:
- LDLM_LOCK_RELEASE(l);
+ /* If the list is not empty, we failed to glimpse some locks and
+ * must clean up. Usually due to a race with unlink.*/
+ list_for_each_entry_safe(pos, tmp, &arg.gl_list, gl_list) {
+ list_del(&pos->gl_list);
+ LDLM_LOCK_RELEASE(pos->gl_lock);
+ OBD_SLAB_FREE_PTR(pos, ldlm_glimpse_work_kmem);
+ }
- RETURN(ELDLM_LOCK_ABORTED);
+ RETURN(rc < 0 ? rc : ELDLM_LOCK_ABORTED);
}
extern struct ldlm_valblock_ops ofd_lvbo;
/* ofd_dlm.c */
+extern struct kmem_cache *ldlm_glimpse_work_kmem;
+int ofd_dlm_init(void);
+void ofd_dlm_exit(void);
int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
void *req_cookie, enum ldlm_mode mode, __u64 flags,
void *data);
struct ost_lvb *lvb, int kms_valid,
osc_enqueue_upcall_f upcall,
void *cookie, struct ldlm_enqueue_info *einfo,
- struct ptlrpc_request_set *rqset, int async, int agl);
+ struct ptlrpc_request_set *rqset, int async,
+ bool speculative);
int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
enum ldlm_type type, union ldlm_policy_data *policy,
{
__u64 result = 0;
+ CDEBUG(D_DLMTRACE, "flags: %x\n", enqflags);
+
LASSERT((enqflags & ~CEF_MASK) == 0);
if (enqflags & CEF_NONBLOCK)
result |= LDLM_FL_BLOCK_NOWAIT;
- if (enqflags & CEF_ASYNC)
+ if (enqflags & CEF_GLIMPSE)
result |= LDLM_FL_HAS_INTENT;
if (enqflags & CEF_DISCARD_DATA)
result |= LDLM_FL_AST_DISCARD_DATA;
result |= LDLM_FL_TEST_LOCK;
if (enqflags & CEF_LOCK_MATCH)
result |= LDLM_FL_MATCH_LOCK;
+ if (enqflags & CEF_LOCK_NO_EXPAND)
+ result |= LDLM_FL_NO_EXPANSION;
+ if (enqflags & CEF_SPECULATIVE)
+ result |= LDLM_FL_SPECULATIVE;
return result;
}
RETURN(rc);
}
-static int osc_lock_upcall_agl(void *cookie, struct lustre_handle *lockh,
- int errcode)
+static int osc_lock_upcall_speculative(void *cookie,
+ struct lustre_handle *lockh,
+ int errcode)
{
struct osc_object *osc = cookie;
struct ldlm_lock *dlmlock;
lock_res_and_lock(dlmlock);
LASSERT(dlmlock->l_granted_mode == dlmlock->l_req_mode);
- /* there is no osc_lock associated with AGL lock */
+ /* there is no osc_lock associated with speculative locks */
osc_lock_lvb_update(env, osc, dlmlock, NULL);
unlock_res_and_lock(dlmlock);
struct cl_lock_descr *qed_descr = &qed->ols_cl.cls_lock->cll_descr;
struct cl_lock_descr *qing_descr = &qing->ols_cl.cls_lock->cll_descr;
- if (qed->ols_glimpse)
+ if (qed->ols_glimpse || qed->ols_speculative)
return true;
if (qing_descr->cld_mode == CLM_READ && qed_descr->cld_mode == CLM_READ)
struct osc_io *oio = osc_env_io(env);
struct osc_object *osc = cl2osc(slice->cls_obj);
struct osc_lock *oscl = cl2osc_lock(slice);
+ struct obd_export *exp = osc_export(osc);
struct cl_lock *lock = slice->cls_lock;
struct ldlm_res_id *resname = &info->oti_resname;
union ldlm_policy_data *policy = &info->oti_policy;
if (oscl->ols_state == OLS_GRANTED)
RETURN(0);
+ if ((oscl->ols_flags & LDLM_FL_NO_EXPANSION) &&
+ !(exp_connect_lockahead_old(exp) || exp_connect_lockahead(exp))) {
+ result = -EOPNOTSUPP;
+ CERROR("%s: server does not support lockahead/locknoexpand:"
+ "rc = %d\n", exp->exp_obd->obd_name, result);
+ RETURN(result);
+ }
+
if (oscl->ols_flags & LDLM_FL_TEST_LOCK)
GOTO(enqueue_base, 0);
- if (oscl->ols_glimpse) {
- LASSERT(equi(oscl->ols_agl, anchor == NULL));
+ /* For glimpse and/or speculative locks, do not wait for reply from
+ * server on LDLM request */
+ if (oscl->ols_glimpse || oscl->ols_speculative) {
+ /* Speculative and glimpse locks do not have an anchor */
+ LASSERT(equi(oscl->ols_speculative, anchor == NULL));
async = true;
GOTO(enqueue_base, 0);
}
/**
* DLM lock's ast data must be osc_object;
- * if glimpse or AGL lock, async of osc_enqueue_base() must be true,
+ * if glimpse or speculative lock, async of osc_enqueue_base()
+ * must be true
+ *
+ * For non-speculative locks:
* DLM's enqueue callback set to osc_lock_upcall() with cookie as
* osc_lock.
+ * For speculative locks:
+ * osc_lock_upcall_speculative & cookie is the osc object, since
+ * there is no osc_lock
*/
ostid_build_res_name(&osc->oo_oinfo->loi_oi, resname);
osc_lock_build_policy(env, lock, policy);
- if (oscl->ols_agl) {
+ if (oscl->ols_speculative) {
oscl->ols_einfo.ei_cbdata = NULL;
/* hold a reference for callback */
cl_object_get(osc2cl(osc));
- upcall = osc_lock_upcall_agl;
+ upcall = osc_lock_upcall_speculative;
cookie = osc;
}
- result = osc_enqueue_base(osc_export(osc), resname, &oscl->ols_flags,
+ result = osc_enqueue_base(exp, resname, &oscl->ols_flags,
policy, &oscl->ols_lvb,
osc->oo_oinfo->loi_kms_valid,
upcall, cookie,
&oscl->ols_einfo, PTLRPCD_SET, async,
- oscl->ols_agl);
+ oscl->ols_speculative);
if (result == 0) {
if (osc_lock_is_lockless(oscl)) {
oio->oi_lockless = 1;
LASSERT(oscl->ols_hold);
LASSERT(oscl->ols_dlmlock != NULL);
}
- } else if (oscl->ols_agl) {
+ } else if (oscl->ols_speculative) {
cl_object_put(env, osc2cl(osc));
- result = 0;
+ if (oscl->ols_glimpse) {
+ /* hide error for AGL request */
+ result = 0;
+ }
}
out:
INIT_LIST_HEAD(&oscl->ols_wait_entry);
INIT_LIST_HEAD(&oscl->ols_nextlock_oscobj);
+ /* Speculative lock requests must be either no_expand or glimpse
+ * request (CEF_GLIMPSE). non-glimpse no_expand speculative extent
+ * locks will break ofd_intent_cb. (see comment there)*/
+ LASSERT(ergo((enqflags & CEF_SPECULATIVE) != 0,
+ (enqflags & (CEF_LOCK_NO_EXPAND | CEF_GLIMPSE)) != 0));
+
oscl->ols_flags = osc_enq2ldlm_flags(enqflags);
- oscl->ols_agl = !!(enqflags & CEF_AGL);
- if (oscl->ols_agl)
- oscl->ols_flags |= LDLM_FL_BLOCK_NOWAIT;
+ oscl->ols_speculative = !!(enqflags & CEF_SPECULATIVE);
+
if (oscl->ols_flags & LDLM_FL_HAS_INTENT) {
oscl->ols_flags |= LDLM_FL_BLOCK_GRANTED;
oscl->ols_glimpse = 1;
void *oa_cookie;
struct ost_lvb *oa_lvb;
struct lustre_handle oa_lockh;
- unsigned int oa_agl:1;
+ bool oa_speculative;
};
static void osc_release_ppga(struct brw_page **ppga, size_t count);
static int osc_enqueue_fini(struct ptlrpc_request *req,
osc_enqueue_upcall_f upcall, void *cookie,
struct lustre_handle *lockh, enum ldlm_mode mode,
- __u64 *flags, int agl, int errcode)
+ __u64 *flags, bool speculative, int errcode)
{
bool intent = *flags & LDLM_FL_HAS_INTENT;
int rc;
ptlrpc_status_ntoh(rep->lock_policy_res1);
if (rep->lock_policy_res1)
errcode = rep->lock_policy_res1;
- if (!agl)
+ if (!speculative)
*flags |= LDLM_FL_LVB_READY;
} else if (errcode == ELDLM_OK) {
*flags |= LDLM_FL_LVB_READY;
/* Let CP AST to grant the lock first. */
OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_ENQ_RACE, 1);
- if (aa->oa_agl) {
+ if (aa->oa_speculative) {
LASSERT(aa->oa_lvb == NULL);
LASSERT(aa->oa_flags == NULL);
aa->oa_flags = &flags;
lockh, rc);
/* Complete osc stuff. */
rc = osc_enqueue_fini(req, aa->oa_upcall, aa->oa_cookie, lockh, mode,
- aa->oa_flags, aa->oa_agl, rc);
+ aa->oa_flags, aa->oa_speculative, rc);
OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_CANCEL_RACE, 10);
struct ost_lvb *lvb, int kms_valid,
osc_enqueue_upcall_f upcall, void *cookie,
struct ldlm_enqueue_info *einfo,
- struct ptlrpc_request_set *rqset, int async, int agl)
+ struct ptlrpc_request_set *rqset, int async,
+ bool speculative)
{
struct obd_device *obd = exp->exp_obd;
struct lustre_handle lockh = { 0 };
policy->l_extent.start -= policy->l_extent.start & ~PAGE_MASK;
policy->l_extent.end |= ~PAGE_MASK;
- /*
- * kms is not valid when either object is completely fresh (so that no
- * locks are cached), or object was evicted. In the latter case cached
- * lock cannot be used, because it would prime inode state with
- * potentially stale LVB.
- */
- if (!kms_valid)
- goto no_match;
+ /*
+ * kms is not valid when either object is completely fresh (so that no
+ * locks are cached), or object was evicted. In the latter case cached
+ * lock cannot be used, because it would prime inode state with
+ * potentially stale LVB.
+ */
+ if (!kms_valid)
+ goto no_match;
/* Next, search for already existing extent locks that will cover us */
/* If we're trying to read, we also search for an existing PW lock. The
mode = einfo->ei_mode;
if (einfo->ei_mode == LCK_PR)
mode |= LCK_PW;
- if (agl == 0)
+ /* Normal lock requests must wait for the LVB to be ready before
+ * matching a lock; speculative lock requests do not need to,
+ * because they will not actually use the lock. */
+ if (!speculative)
match_flags |= LDLM_FL_LVB_READY;
if (intent != 0)
match_flags |= LDLM_FL_BLOCK_GRANTED;
RETURN(ELDLM_OK);
matched = ldlm_handle2lock(&lockh);
- if (agl) {
- /* AGL enqueues DLM locks speculatively. Therefore if
- * it already exists a DLM lock, it wll just inform the
- * caller to cancel the AGL process for this stripe. */
+ if (speculative) {
+ /* This DLM lock request is speculative, and does not
+ * have an associated IO request. Therefore if there
+ * is already a DLM lock, it wll just inform the
+ * caller to cancel the request for this stripe.*/
+ lock_res_and_lock(matched);
+ if (ldlm_extent_equal(&policy->l_extent,
+ &matched->l_policy_data.l_extent))
+ rc = -EEXIST;
+ else
+ rc = -ECANCELED;
+ unlock_res_and_lock(matched);
+
ldlm_lock_decref(&lockh, mode);
LDLM_LOCK_PUT(matched);
- RETURN(-ECANCELED);
+ RETURN(rc);
} else if (osc_set_lock_data(matched, einfo->ei_cbdata)) {
*flags |= LDLM_FL_LVB_READY;
struct osc_enqueue_args *aa;
CLASSERT(sizeof(*aa) <= sizeof(req->rq_async_args));
aa = ptlrpc_req_async_args(req);
- aa->oa_exp = exp;
- aa->oa_mode = einfo->ei_mode;
- aa->oa_type = einfo->ei_type;
+ aa->oa_exp = exp;
+ aa->oa_mode = einfo->ei_mode;
+ aa->oa_type = einfo->ei_type;
lustre_handle_copy(&aa->oa_lockh, &lockh);
- aa->oa_upcall = upcall;
- aa->oa_cookie = cookie;
- aa->oa_agl = !!agl;
- if (!agl) {
+ aa->oa_upcall = upcall;
+ aa->oa_cookie = cookie;
+ aa->oa_speculative = speculative;
+ if (!speculative) {
aa->oa_flags = flags;
aa->oa_lvb = lvb;
} else {
- /* AGL is essentially to enqueue an DLM lock
- * in advance, so we don't care about the
- * result of AGL enqueue. */
+ /* speculative locks are essentially to enqueue
+ * a DLM lock in advance, so we don't care
+ * about the result of the enqueue. */
aa->oa_lvb = NULL;
aa->oa_flags = NULL;
}
}
rc = osc_enqueue_fini(req, upcall, cookie, &lockh, einfo->ei_mode,
- flags, agl, rc);
+ flags, speculative, rc);
if (intent)
ptlrpc_req_finished(req);
OBD_CONNECT_DIR_STRIPE);
LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
OBD_CONNECT_SUBTREE);
- LASSERTF(OBD_CONNECT_LOCK_AHEAD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
- OBD_CONNECT_LOCK_AHEAD);
+ LASSERTF(OBD_CONNECT_LOCKAHEAD_OLD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
+ OBD_CONNECT_LOCKAHEAD_OLD);
LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
OBD_CONNECT_BULK_MBITS);
LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
OBD_CONNECT_FLAGS2);
LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
OBD_CONNECT2_FILE_SECCTX);
+ LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
+ OBD_CONNECT2_LOCKAHEAD);
LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
(unsigned)OBD_CKSUM_CRC32);
LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
noinst_PROGRAMS += listxattr_size_check check_fhandle_syscalls badarea_io
noinst_PROGRAMS += llapi_layout_test orphan_linkea_check llapi_hsm_test
noinst_PROGRAMS += group_lock_test llapi_fid_test sendfile_grouplock mmap_cat
-noinst_PROGRAMS += swap_lock_test
+noinst_PROGRAMS += swap_lock_test lockahead_test
bin_PROGRAMS = mcreate munlink
testdir = $(libdir)/lustre/tests
statmany_LDADD=$(LIBLUSTREAPI)
statone_LDADD=$(LIBLUSTREAPI)
rwv_LDADD=$(LIBCFS)
+lockahead_test_LDADD=$(LIBLUSTREAPI)
ll_dirstripe_verify_SOURCES = ll_dirstripe_verify.c
ll_dirstripe_verify_LDADD = $(LIBLUSTREAPI) $(LIBCFS) $(PTHREAD_LIBS)
--- /dev/null
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+
+/*
+ * Copyright 2016 Cray Inc. All rights reserved.
+ * Authors: Patrick Farrell, Frank Zago
+ *
+ * A few portions are extracted from llapi_layout_test.c
+ *
+ * The purpose of this test is to exercise the lockahead advice of ladvise.
+ *
+ * The program will exit as soon as a test fails.
+ */
+
+#include <stdlib.h>
+#include <errno.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <poll.h>
+#include <time.h>
+
+#include <lustre/lustreapi.h>
+#include <linux/lustre/lustre_idl.h>
+
+#define ERROR(fmt, ...) \
+ fprintf(stderr, "%s: %s:%d: %s: " fmt "\n", \
+ program_invocation_short_name, __FILE__, __LINE__, \
+ __func__, ## __VA_ARGS__);
+
+#define DIE(fmt, ...) \
+ do { \
+ ERROR(fmt, ## __VA_ARGS__); \
+ exit(-1); \
+ } while (0)
+
+#define ASSERTF(cond, fmt, ...) \
+ do { \
+ if (!(cond)) \
+ DIE("assertion '%s' failed: "fmt, \
+ #cond, ## __VA_ARGS__); \
+ } while (0)
+
+#define PERFORM(testfn) \
+ do { \
+ cleanup(); \
+ fprintf(stderr, "Starting test " #testfn " at %lld\n", \
+ (unsigned long long)time(NULL)); \
+ rc = testfn(); \
+ fprintf(stderr, "Finishing test " #testfn " at %lld\n", \
+ (unsigned long long)time(NULL)); \
+ cleanup(); \
+ } while (0)
+
+/* Name of file/directory. Will be set once and will not change. */
+static char mainpath[PATH_MAX];
+static const char *mainfile = "lockahead_test_654";
+
+static char fsmountdir[PATH_MAX]; /* Lustre mountpoint */
+static char *lustre_dir; /* Test directory inside Lustre */
+static int single_test; /* Number of a single test to execute*/
+
+/* Cleanup our test file. */
+static void cleanup(void)
+{
+ unlink(mainpath);
+}
+
+/* Trivial helper for one advice */
+void setup_ladvise_lockahead(struct llapi_lu_ladvise *advice, int mode,
+ int flags, size_t start, size_t end, bool async)
+{
+ advice->lla_advice = LU_LADVISE_LOCKAHEAD;
+ advice->lla_lockahead_mode = mode;
+ if (async)
+ advice->lla_peradvice_flags = flags | LF_ASYNC;
+ else
+ advice->lla_peradvice_flags = flags;
+ advice->lla_start = start;
+ advice->lla_end = end;
+ advice->lla_value3 = 0;
+ advice->lla_value4 = 0;
+}
+
+/* Test valid single lock ahead request */
+static int test10(void)
+{
+ struct llapi_lu_ladvise advice;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size - 1, true);
+
+ /* Manually set the result so we can verify it's being modified */
+ advice.lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0,
+ "cannot lockahead '%s': %s", mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result == 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+
+ close(fd);
+
+ return 0;
+}
+
+/* Get lock, wait until lock is taken */
+static int test11(void)
+{
+ struct llapi_lu_ladvise advice;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int enqueue_requests = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size - 1, true);
+
+ /* Manually set the result so we can verify it's being modified */
+ advice.lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0,
+ "cannot lockahead '%s': %s", mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result == 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ enqueue_requests++;
+
+ /* Ask again until we get the lock (status 1). */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice.lla_lockahead_result > 0)
+ break;
+
+ enqueue_requests++;
+ }
+
+ ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Again. This time it is always there. */
+ for (i = 0; i < 100; i++) {
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result > 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+ }
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ close(fd);
+
+ return enqueue_requests;
+}
+
+/* Test with several times the same extent */
+static int test12(void)
+{
+ struct llapi_lu_ladvise *advice;
+ const int count = 10;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+ for (i = 0; i < count; i++) {
+ setup_ladvise_lockahead(&(advice[i]), MODE_WRITE_USER, 0, 0,
+ write_size - 1, true);
+ advice[i].lla_lockahead_result = 98674;
+ }
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0,
+ "cannot lockahead '%s': %s", mainpath, strerror(errno));
+ for (i = 0; i < count; i++) {
+ ASSERTF(advice[i].lla_lockahead_result >= 0,
+ "unexpected extent result for extent %d: %d",
+ i, advice[i].lla_lockahead_result);
+ }
+ /* Since all the requests are for the same extent, we should only have
+ * one lock at the end. */
+ expected_lock_count = 1;
+
+ /* Ask again until we get the locks. */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice[count-1].lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice[count-1].lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice[count-1].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice[count-1].lla_lockahead_result);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ free(advice);
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Grow a lock forward */
+static int test13(void)
+{
+ struct llapi_lu_ladvise *advice = NULL;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ for (i = 0; i < 100; i++) {
+ if (advice)
+ free(advice);
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0,
+ i * write_size, (i+1)*write_size - 1,
+ true);
+ advice[0].lla_lockahead_result = 98674;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' at offset %llu: %s",
+ mainpath,
+ advice[0].lla_end,
+ strerror(errno));
+
+ ASSERTF(advice[0].lla_lockahead_result >= 0,
+ "unexpected extent result for extent %d: %d",
+ i, advice[0].lla_lockahead_result);
+
+ expected_lock_count++;
+ }
+
+ /* Ask again until we get the lock. */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice[0].lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice[0].lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice[0].lla_lockahead_result);
+
+ free(advice);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Grow a lock backward */
+static int test14(void)
+{
+ struct llapi_lu_ladvise *advice = NULL;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ const int num_blocks = 100;
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ for (i = 0; i < num_blocks; i++) {
+ size_t start = (num_blocks - i - 1) * write_size;
+ size_t end = (num_blocks - i) * write_size - 1;
+
+ if (advice)
+ free(advice);
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+ end, true);
+ advice[0].lla_lockahead_result = 98674;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' at offset %llu: %s",
+ mainpath,
+ advice[0].lla_end,
+ strerror(errno));
+
+ ASSERTF(advice[0].lla_lockahead_result >= 0,
+ "unexpected extent result for extent %d: %d",
+ i, advice[0].lla_lockahead_result);
+
+ expected_lock_count++;
+ }
+
+ /* Ask again until we get the lock. */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice[0].lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice[0].lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice[0].lla_lockahead_result);
+
+ free(advice);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Request many locks at 10MiB intervals */
+static int test15(void)
+{
+ struct llapi_lu_ladvise *advice;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+ for (i = 0; i < 5000; i++) {
+ /* The 'UL' designators are required to avoid undefined
+ * behavior which GCC turns in to an infinite loop */
+ __u64 start = i * 1024UL * 1024UL * 10UL;
+ __u64 end = start + 1;
+
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+ end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result >= 0,
+ "unexpected extent result for extent %d: %d",
+ i, advice[0].lla_lockahead_result);
+ expected_lock_count++;
+ }
+
+ /* Ask again until we get the lock. */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice[0].lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice[0].lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice[0].lla_lockahead_result);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+ /* The write should cancel the first lock (which was too small)
+ * and create one of its own, so the net effect on lock count is 0. */
+
+ free(advice);
+
+ close(fd);
+
+ /* We have to map our expected return in to the range of valid return
+ * codes, 0-255. */
+ expected_lock_count = expected_lock_count/1000;
+
+ return expected_lock_count;
+}
+
+/* Use lockahead to verify behavior of ladvise locknoexpand */
+static int test16(void)
+{
+ struct llapi_lu_ladvise *advice;
+ struct llapi_lu_ladvise *advice_noexpand;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ __u64 start = 0;
+ __u64 end = write_size - 1;
+ int rc;
+ char buf[write_size];
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ advice_noexpand = malloc(sizeof(struct llapi_lu_ladvise));
+
+ /* First ask for a read lock, which will conflict with the write */
+ setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, false);
+ advice[0].lla_lockahead_result = 345678;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == 0,
+ "unexpected extent result for extent: %d",
+ advice[0].lla_lockahead_result);
+
+ /* Use an async request to verify we got the read lock we asked for */
+ setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, true);
+ advice[0].lla_lockahead_result = 345678;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result for extent: %d",
+ advice[0].lla_lockahead_result);
+
+ /* Set noexpand */
+ advice_noexpand[0].lla_advice = LU_LADVISE_LOCKNOEXPAND;
+ advice_noexpand[0].lla_peradvice_flags = 0;
+ rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+
+ /* This write should generate a lock on exactly "write_size" bytes */
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+ /* Write should create one LDLM lock */
+ expected_lock_count++;
+
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result for extent: %d",
+ advice[0].lla_lockahead_result);
+
+ /* Now, disable locknoexpand and try writing again. */
+ advice_noexpand[0].lla_peradvice_flags = LF_UNSET;
+ rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+ /* This write should get an expanded lock */
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+ /* Write should create one LDLM lock */
+ expected_lock_count++;
+
+ /* Verify it didn't get a lock on just the bytes it wrote.*/
+ usleep(100000); /* 0.1 second, plenty of time to get the lock */
+
+ start = start + write_size;
+ end = end + write_size;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+ "unexpected extent result for extent %d",
+ advice[0].lla_lockahead_result);
+
+ free(advice);
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Use lockahead to verify behavior of ladvise locknoexpand, with O_NONBLOCK.
+ * There should be no change in behavior. */
+static int test17(void)
+{
+ struct llapi_lu_ladvise *advice;
+ struct llapi_lu_ladvise *advice_noexpand;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ __u64 start = 0;
+ __u64 end = write_size - 1;
+ int rc;
+ char buf[write_size];
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC | O_NONBLOCK,
+ S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ advice_noexpand = malloc(sizeof(struct llapi_lu_ladvise));
+
+ /* First ask for a read lock, which will conflict with the write */
+ setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, false);
+ advice[0].lla_lockahead_result = 345678;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == 0,
+ "unexpected extent result for extent: %d",
+ advice[0].lla_lockahead_result);
+
+ /* Use an async request to verify we got the read lock we asked for */
+ setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, true);
+ advice[0].lla_lockahead_result = 345678;
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result for extent: %d",
+ advice[0].lla_lockahead_result);
+
+ /* Set noexpand */
+ advice_noexpand[0].lla_advice = LU_LADVISE_LOCKNOEXPAND;
+ advice_noexpand[0].lla_peradvice_flags = 0;
+ rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+
+ /* This write should generate a lock on exactly "write_size" bytes */
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+ /* Write should create one LDLM lock */
+ expected_lock_count++;
+
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result for extent: %d",
+ advice[0].lla_lockahead_result);
+
+ /* Now, disable locknoexpand and try writing again. */
+ advice_noexpand[0].lla_peradvice_flags = LF_UNSET;
+ rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+ /* This write should get an expanded lock */
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+ /* Write should create one LDLM lock */
+ expected_lock_count++;
+
+ /* Verify it didn't get a lock on just the bytes it wrote.*/
+ usleep(100000); /* 0.1 second, plenty of time to get the lock */
+
+ start = start + write_size;
+ end = end + write_size;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+ "unexpected extent result for extent %d",
+ advice[0].lla_lockahead_result);
+
+ free(advice);
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Test overlapping requests */
+static int test18(void)
+{
+ struct llapi_lu_ladvise *advice;
+ const int count = 1;
+ int fd;
+ int rc;
+ int i;
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+ /* Overlapping locks - Should only end up with 1 */
+ for (i = 0; i < 10; i++) {
+ __u64 start = i;
+ __u64 end = start + 4096;
+
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+ end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result >= 0,
+ "unexpected extent result for extent %d: %d",
+ i, advice[0].lla_lockahead_result);
+ }
+ expected_lock_count = 1;
+
+ /* Ask again until we get the lock. */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice[0].lla_lockahead_result = 456789;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, 0, 4096,
+ true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice[0].lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice[0].lla_lockahead_result);
+
+ free(advice);
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Test that normal request blocks lock ahead requests */
+static int test19(void)
+{
+ struct llapi_lu_ladvise *advice;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int expected_lock_count = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+ /* This should create a lock on the whole file, which will block lock
+ * ahead requests. */
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ expected_lock_count = 1;
+
+ /* These should all be blocked. */
+ for (i = 0; i < 10; i++) {
+ __u64 start = i * 4096;
+ __u64 end = start + 4096;
+
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+ end, true);
+
+ advice[0].lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, advice);
+
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+ "unexpected extent result for extent %d: %d",
+ i, advice[0].lla_lockahead_result);
+ }
+
+ free(advice);
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Test sync requests, and matching with async requests */
+static int test20(void)
+{
+ struct llapi_lu_ladvise advice;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int expected_lock_count = 1;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ /* Async request */
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size - 1, true);
+
+ /* Manually set the result so we can verify it's being modified */
+ advice.lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0,
+ "cannot lockahead '%s': %s", mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result == 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Ask again until we get the lock (status 1). */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice.lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Convert to a sync request on smaller range, should match and not
+ * cancel */
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size - 1 - write_size/2, false);
+
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+ /* Sync requests cannot give detailed results */
+ ASSERTF(advice.lla_lockahead_result == 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Use an async request to test original lock is still present */
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size - 1, true);
+
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Test sync requests, and conflict with async requests */
+static int test21(void)
+{
+ struct llapi_lu_ladvise advice;
+ const int count = 1;
+ int fd;
+ size_t write_size = 1024 * 1024;
+ int rc;
+ char buf[write_size];
+ int i;
+ int expected_lock_count = 1;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ /* Async request */
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size - 1, true);
+
+ /* Manually set the result so we can verify it's being modified */
+ advice.lla_lockahead_result = 345678;
+
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0,
+ "cannot lockahead '%s': %s", mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result == 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Ask again until we get the lock (status 1). */
+ for (i = 1; i < 100; i++) {
+ usleep(100000); /* 0.1 second */
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+
+ if (advice.lla_lockahead_result > 0)
+ break;
+ }
+
+ ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Convert to a sync request on larger range, should cancel existing
+ * lock */
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size*2 - 1, false);
+
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+ /* Sync requests cannot give detailed results */
+ ASSERTF(advice.lla_lockahead_result == 0,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ /* Use an async request to test new lock is there */
+ setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+ write_size*2 - 1, true);
+
+ advice.lla_lockahead_result = 456789;
+ rc = llapi_ladvise(fd, 0, count, &advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+ mainpath, strerror(errno));
+ ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+ "unexpected extent result: %d",
+ advice.lla_lockahead_result);
+
+ memset(buf, 0xaa, write_size);
+ rc = write(fd, buf, write_size);
+ ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ close(fd);
+
+ return expected_lock_count;
+}
+
+/* Test various valid and invalid inputs */
+static int test22(void)
+{
+ struct llapi_lu_ladvise *advice;
+ const int count = 1;
+ int fd;
+ int rc;
+ size_t start = 0;
+ size_t end = 0;
+
+ fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+ ASSERTF(fd >= 0, "open failed for '%s': %s",
+ mainpath, strerror(errno));
+
+ /* A valid async request first */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ start = 0;
+ end = 1024*1024;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ free(advice);
+
+ /* Valid request sync request */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ start = 0;
+ end = 1024*1024;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, false);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+ mainpath, strerror(errno));
+ free(advice);
+
+ /* No actual block */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ start = 0;
+ end = 0;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == -1 && errno == EINVAL,
+ "unexpected return for no block lock: %d %s",
+ rc, strerror(errno));
+ free(advice);
+
+ /* end before start */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ start = 1024 * 1024;
+ end = 0;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == -1 && errno == EINVAL,
+ "unexpected return for reversed block: %d %s",
+ rc, strerror(errno));
+ free(advice);
+
+ /* bogus lock mode - 0x65464 */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ start = 0;
+ end = 1024 * 1024;
+ setup_ladvise_lockahead(advice, 0x65464, 0, start, end, true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == -1 && errno == EINVAL,
+ "unexpected return for bogus lock mode: %d %s",
+ rc, strerror(errno));
+ free(advice);
+
+ /* bogus flags, 0x80 */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ start = 0;
+ end = 1024 * 1024;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0x80, start, end,
+ true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == -1 && errno == EINVAL,
+ "unexpected return for bogus flags: %u %d %s",
+ 0x80, rc, strerror(errno));
+ free(advice);
+
+ /* bogus flags, 0xff - CEF_MASK */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ end = 1024 * 1024;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0xff, start, end,
+ true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == -1 && errno == EINVAL,
+ "unexpected return for bogus flags: %u %d %s",
+ 0xff, rc, strerror(errno));
+ free(advice);
+
+ /* bogus flags, 0xffffffff */
+ advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+ end = 1024 * 1024;
+ setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0xffffffff, start,
+ end, true);
+ rc = llapi_ladvise(fd, 0, count, advice);
+ ASSERTF(rc == -1 && errno == EINVAL,
+ "unexpected return for bogus flags: %u %d %s",
+ 0xffffffff, rc, strerror(errno));
+ free(advice);
+
+ close(fd);
+
+ return 0;
+}
+
+static void usage(char *prog)
+{
+ fprintf(stderr, "Usage: %s [-d lustre_dir], [-t single_test]\n", prog);
+ exit(-1);
+}
+
+static void process_args(int argc, char *argv[])
+{
+ int c;
+
+ while ((c = getopt(argc, argv, "d:t:")) != -1) {
+ switch (c) {
+ case 'd':
+ lustre_dir = optarg;
+ break;
+ case 't':
+ single_test = atoi(optarg);
+ break;
+ case '?':
+ default:
+ fprintf(stderr, "Invalid option '%c'\n", optopt);
+ usage(argv[0]);
+ break;
+ }
+ }
+}
+
+int main(int argc, char *argv[])
+{
+ char fsname[8];
+ int rc;
+
+ process_args(argc, argv);
+ if (lustre_dir == NULL)
+ lustre_dir = "/mnt/lustre";
+
+ rc = llapi_search_mounts(lustre_dir, 0, fsmountdir, fsname);
+ if (rc != 0) {
+ fprintf(stderr, "Error: '%s': not a Lustre filesystem\n",
+ lustre_dir);
+ return -1;
+ }
+
+ /* Play nice with Lustre test scripts. Non-line buffered output
+ * stream under I/O redirection may appear incorrectly. */
+ setvbuf(stdout, NULL, _IOLBF, 0);
+
+ /* Create a test filename and reuse it. Remove possibly old files. */
+ rc = snprintf(mainpath, sizeof(mainpath), "%s/%s", lustre_dir,
+ mainfile);
+ ASSERTF(rc > 0 && rc < sizeof(mainpath), "invalid name for mainpath");
+ cleanup();
+
+ atexit(cleanup);
+
+ switch (single_test) {
+ case 0:
+ PERFORM(test10);
+ PERFORM(test11);
+ PERFORM(test12);
+ PERFORM(test13);
+ PERFORM(test14);
+ PERFORM(test15);
+ PERFORM(test16);
+ PERFORM(test17);
+ PERFORM(test18);
+ PERFORM(test19);
+ PERFORM(test20);
+ PERFORM(test21);
+ PERFORM(test22);
+ /* When running all the test cases, we can't use the return
+ * from the last test case, as it might be non-zero to return
+ * info, rather than for an error. Test cases assert and exit
+ * if an error occurs. */
+ rc = 0;
+ break;
+ case 10:
+ PERFORM(test10);
+ break;
+ case 11:
+ PERFORM(test11);
+ break;
+ case 12:
+ PERFORM(test12);
+ break;
+ case 13:
+ PERFORM(test13);
+ break;
+ case 14:
+ PERFORM(test14);
+ break;
+ case 15:
+ PERFORM(test15);
+ break;
+ case 16:
+ PERFORM(test16);
+ break;
+ case 17:
+ PERFORM(test17);
+ break;
+ case 18:
+ PERFORM(test18);
+ break;
+ case 19:
+ PERFORM(test19);
+ break;
+ case 20:
+ PERFORM(test20);
+ break;
+ case 21:
+ PERFORM(test21);
+ break;
+ case 22:
+ PERFORM(test22);
+ break;
+ default:
+ fprintf(stderr, "impossible value of single_test %d\n",
+ single_test);
+ rc = -1;
+ break;
+ }
+
+ return rc;
+}
}
run_test 255b "check 'lfs ladvise -a dontneed'"
+test_255c() {
+ local count
+ local new_count
+ local difference
+ local i
+ local rc
+ test_mkdir -p $DIR/$tdir
+ $SETSTRIPE -i 0 $DIR/$tdir
+
+ #test 10 returns only success/failure
+ i=10
+ lockahead_test -d $DIR/$tdir -t $i
+ rc=$?
+ if [ $rc -eq 255 ]; then
+ error "Ladvise test${i} failed, ${rc}"
+ fi
+
+ #test 11 counts lock enqueue requests, all others count new locks
+ i=11
+ count=$(do_facet ost1 \
+ $LCTL get_param -n ost.OSS.ost.stats)
+ count=$(echo "$count" | grep ldlm_extent_enqueue | awk '{ print $2 }')
+
+ lockahead_test -d $DIR/$tdir -t $i
+ rc=$?
+ if [ $rc -eq 255 ]; then
+ error "Ladvise test${i} failed, ${rc}"
+ fi
+
+ new_count=$(do_facet ost1 \
+ $LCTL get_param -n ost.OSS.ost.stats)
+ new_count=$(echo "$new_count" | grep ldlm_extent_enqueue | \
+ awk '{ print $2 }')
+
+ difference="$((new_count - count))"
+ if [ $difference -ne $rc ]; then
+ error "Ladvise test${i}, bad enqueue count, returned " \
+ "${rc}, actual ${difference}"
+ fi
+
+ for i in $(seq 12 21); do
+ # If we do not do this, we run the risk of having too many
+ # locks and starting lock cancellation while we are checking
+ # lock counts.
+ cancel_lru_locks osc
+
+ count=$($LCTL get_param -n \
+ ldlm.namespaces.$FSNAME-OST0000*osc-f*.lock_unused_count)
+
+ lockahead_test -d $DIR/$tdir -t $i
+ rc=$?
+ if [ $rc -eq 255 ]; then
+ error "Ladvise test ${i} failed, ${rc}"
+ fi
+
+ new_count=$($LCTL get_param -n \
+ ldlm.namespaces.$FSNAME-OST0000*osc-f*.lock_unused_count)
+ difference="$((new_count - count))"
+
+ # Test 15 output is divided by 1000 to map down to valid return
+ if [ $i -eq 15 ]; then
+ rc="$((rc * 1000))"
+ fi
+
+ if [ $difference -ne $rc ]; then
+ error "Ladvise test ${i}, bad lock count, returned " \
+ "${rc}, actual ${difference}"
+ fi
+ done
+
+ #test 22 returns only success/failure
+ i=22
+ lockahead_test -d $DIR/$tdir -t $i
+ rc=$?
+ if [ $rc -eq 255 ]; then
+ error "Ladvise test${i} failed, ${rc}"
+ fi
+
+}
+run_test 255c "suite of ladvise lockahead tests"
+
test_256() {
local cl_user
local cat_sl
{"ladvise", lfs_ladvise, 0,
"Provide servers with advice about access patterns for a file.\n"
"usage: ladvise [--advice|-a ADVICE] [--start|-s START[kMGT]]\n"
- " [--background|-b]\n"
+ " [--background|-b] [--unset|-u]\n\n"
" {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}\n"
+ " {[--mode|-m [READ,WRITE]}\n"
" <file> ...\n"},
{"help", Parser_help, 0, "help"},
{"exit", Parser_quit, 0, "quit"},
static const char *const ladvise_names[] = LU_LADVISE_NAMES;
+static const char *const lock_mode_names[] = LOCK_MODE_NAMES;
+
+static const char *const lockahead_results[] = {
+ [LLA_RESULT_SENT] = "Lock request sent",
+ [LLA_RESULT_DIFFERENT] = "Different matching lock found",
+ [LLA_RESULT_SAME] = "Matching lock on identical extent found",
+};
+
+int lfs_get_mode(const char *string)
+{
+ enum lock_mode_user mode;
+
+ for (mode = 0; mode < ARRAY_SIZE(lock_mode_names); mode++) {
+ if (lock_mode_names[mode] == NULL)
+ continue;
+ if (strcmp(string, lock_mode_names[mode]) == 0)
+ return mode;
+ }
+
+ return -EINVAL;
+}
+
static enum lu_ladvise_type lfs_get_ladvice(const char *string)
{
enum lu_ladvise_type advice;
{ .val = 'b', .name = "background", .has_arg = no_argument },
{ .val = 'e', .name = "end", .has_arg = required_argument },
{ .val = 'l', .name = "length", .has_arg = required_argument },
+ { .val = 'm', .name = "mode", .has_arg = required_argument },
{ .val = 's', .name = "start", .has_arg = required_argument },
+ { .val = 'u', .name = "unset", .has_arg = no_argument },
{ .name = NULL } };
- char short_opts[] = "a:be:l:s:";
+ char short_opts[] = "a:be:l:m:s:u";
int c;
int rc = 0;
const char *path;
unsigned long long length = 0;
unsigned long long size_units;
unsigned long long flags = 0;
+ int mode = 0;
optind = 0;
while ((c = getopt_long(argc, argv, short_opts,
case 'b':
flags |= LF_ASYNC;
break;
+ case 'u':
+ flags |= LF_UNSET;
+ break;
case 'e':
size_units = 1;
rc = llapi_parse_size(optarg, &end,
return CMD_HELP;
}
break;
+ case 'm':
+ mode = lfs_get_mode(optarg);
+ if (mode < 0) {
+ fprintf(stderr, "%s: bad mode '%s', valid "
+ "modes are READ or WRITE\n",
+ argv[0], optarg);
+ return CMD_HELP;
+ }
+ break;
case '?':
return CMD_HELP;
default:
return CMD_HELP;
}
+ if (advice_type == LU_LADVISE_LOCKNOEXPAND) {
+ fprintf(stderr, "%s: Lock no expand advice is a per file "
+ "descriptor advice, so when called from lfs, "
+ "it does nothing.\n", argv[0]);
+ return CMD_HELP;
+ }
+
if (argc <= optind) {
fprintf(stderr, "%s: please give one or more file names\n",
argv[0]);
return CMD_HELP;
}
+ if (advice_type != LU_LADVISE_LOCKAHEAD && mode != 0) {
+ fprintf(stderr, "%s: mode is only valid with lockahead\n",
+ argv[0]);
+ return CMD_HELP;
+ }
+
+ if (advice_type == LU_LADVISE_LOCKAHEAD && mode == 0) {
+ fprintf(stderr, "%s: mode is required with lockahead\n",
+ argv[0]);
+ return CMD_HELP;
+ }
+
while (optind < argc) {
int rc2;
advice.lla_value2 = 0;
advice.lla_value3 = 0;
advice.lla_value4 = 0;
+ if (advice_type == LU_LADVISE_LOCKAHEAD) {
+ advice.lla_lockahead_mode = mode;
+ advice.lla_peradvice_flags = flags;
+ }
+
rc2 = llapi_ladvise(fd, flags, 1, &advice);
close(fd);
if (rc2 < 0) {
"'%s': %s\n", argv[0],
ladvise_names[advice_type],
path, strerror(errno));
+
+ goto next;
}
+
next:
if (rc == 0 && rc2 < 0)
rc = rc2;
int llapi_ladvise(int fd, unsigned long long flags, int num_advise,
struct llapi_lu_ladvise *ladvise)
{
- int rc;
struct llapi_ladvise_hdr *ladvise_hdr;
+ int rc;
+ int i;
if (num_advise < 1 || num_advise >= LAH_COUNT_MAX) {
errno = EINVAL;
llapi_error(LLAPI_MSG_ERROR, -errno, "cannot give advice");
return -1;
}
+
+ /* Copy results back in to caller provided structs */
+ for (i = 0; i < num_advise; i++) {
+ struct llapi_lu_ladvise *ladvise_iter;
+
+ ladvise_iter = &ladvise_hdr->lah_advise[i];
+
+ if (ladvise_iter->lla_advice == LU_LADVISE_LOCKAHEAD)
+ ladvise[i].lla_lockahead_result =
+ ladvise_iter->lla_lockahead_result;
+ }
+
return 0;
}
CHECK_DEFINE_64X(OBD_CONNECT_MULTIMODRPCS);
CHECK_DEFINE_64X(OBD_CONNECT_DIR_STRIPE);
CHECK_DEFINE_64X(OBD_CONNECT_SUBTREE);
- CHECK_DEFINE_64X(OBD_CONNECT_LOCK_AHEAD);
+ CHECK_DEFINE_64X(OBD_CONNECT_LOCKAHEAD_OLD);
CHECK_DEFINE_64X(OBD_CONNECT_BULK_MBITS);
CHECK_DEFINE_64X(OBD_CONNECT_OBDOPACK);
CHECK_DEFINE_64X(OBD_CONNECT_FLAGS2);
CHECK_DEFINE_64X(OBD_CONNECT2_FILE_SECCTX);
+ CHECK_DEFINE_64X(OBD_CONNECT2_LOCKAHEAD);
CHECK_VALUE_X(OBD_CKSUM_CRC32);
CHECK_VALUE_X(OBD_CKSUM_ADLER);
OBD_CONNECT_DIR_STRIPE);
LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
OBD_CONNECT_SUBTREE);
- LASSERTF(OBD_CONNECT_LOCK_AHEAD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
- OBD_CONNECT_LOCK_AHEAD);
+ LASSERTF(OBD_CONNECT_LOCKAHEAD_OLD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
+ OBD_CONNECT_LOCKAHEAD_OLD);
LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
OBD_CONNECT_BULK_MBITS);
LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
OBD_CONNECT_FLAGS2);
LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
OBD_CONNECT2_FILE_SECCTX);
+ LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
+ OBD_CONNECT2_LOCKAHEAD);
LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
(unsigned)OBD_CKSUM_CRC32);
LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",