LU-6179 llite: Implement ladvise lockahead

author Patrick Farrell <paf@cray.com>

Thu, 14 Sep 2017 15:24:50 +0000 (10:24 -0500)

committer Oleg Drokin <oleg.drokin@intel.com>

Thu, 21 Sep 2017 06:12:44 +0000 (06:12 +0000)
author Patrick Farrell <paf@cray.com>
Thu, 14 Sep 2017 15:24:50 +0000 (10:24 -0500)
committer Oleg Drokin <oleg.drokin@intel.com>
Thu, 21 Sep 2017 06:12:44 +0000 (06:12 +0000)
diff --git a/Documentation/ladvise_lockahead.txt b/Documentation/ladvise_lockahead.txt

new file mode 100644 (file)

index 0000000..91dffcd
--- /dev/null
+++ b/Documentation/ladvise_lockahead.txt
@@ -0,0 +1,304 @@
+Ladvise Lock Ahead design
+
+Lock ahead is a new Lustre feature aimed at solving a long standing problem
+with shared file write performance in Lustre.  It requires client and server
+support.  It will be used primarily via the MPI-I/O library, not directly from
+user applications.
+
+The first part of this document (sections 1 and 2) is an overview of the
+problem and high level description of the solution.  Section 3 explains how the
+library will make use of this feature, and sections 4 and 5 describe the design
+of the Lustre changes.
+
+1. Overview: Purpose & Interface
+Lock ahead is intended to allow optimization of certain I/O patterns which
+would otherwise suffer LDLM* lock contention.  It allows applications to
+manually request locks on specific extents of a file, avoiding the usual
+server side optimizations. This applications which know their I/O pattern to
+use that information to avoid false conflicts due to server side optimizations.
+
+*Lustre distributed lock manager.  This is the locking layer shared between
+clients and servers, to manage access between clients.
+
+Normally, clients get locks automatically as the first step of an I/O.
+The client asks for a lock which covers exactly the area of interest (ie, a
+read or write lock of n bytes at offset x), but the server attempts to optimize
+this by expanding the lock to cover as much of the file as possible.  This is
+useful for a single client, but can be trouble for multiple clients.
+
+In cases where multiple clients wish to write to the same file, this
+optimization can result in locks that conflict when the actual I/O operations
+do not.  This requires clients to wait for one another to complete I/O, even
+when there is no conflict between actual I/O requests.  This can significantly
+reduce performance (Anywhere from 40-90%, depending on system specs) for some
+workloads.
+
+The lockahead feature makes it possible to avoid this problem by acquiring the
+necessary locks in advance, by explicit requests with server side extent
+changes disabled.  We add a new lfs advice type, LU_LADVISE_LOCKAHEAD,
+which allows lock requests from userspace on the client, specifying the extent
+and the I/O mode (read/write) for the lock.  These lock requests explicitly
+disable server side changes to the lock extent, so the lock returned to the
+client covers only the extent requested.
+
+When using this feature, clients which intend to write to a file can request
+locks to cover their I/O pattern, wait a moment for the locks to be granted,
+then write or read the file.
+
+In this way, a set of clients which knows their I/O pattern in advance can
+force the LDLM layer to grant locks appropriate for that I/O pattern.  This
+allows applications which are poorly handled by the default lock optimization
+behavior to significantly improve their performance.
+
+2. I/O Pattern & Locking problems
+2. A. Strided writing and MPI-I/O
+There is a thorough explanation and overview of strided writing and the
+benefits of this functionality in the slides from the lock ahead presentation
+at LUG 2015.  It is highly recommended to read that first, as the graphics are
+much clearer than the prose here.
+
+See slides 1-13:
+http://wiki.lustre.org/images/f/f9/Shared-File-Performance-in-Lustre_Farrell.pdf
+
+MPI-I/O uses strided writing when doing I/O from a large job to a single file.
+I/O is aggregated from all the nodes running a particular application to a
+small number of I/O aggregator nodes which then write out the data, in a
+strided manner.
+
+In strided writing, different clients take turns writing different blocks of a
+file (A block is some arbitrary number of bytes).  Client 1 is responsible for
+writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1,
+block 3, etc.
+
+Without the ability to manually request locks, strided writing is set up in
+concert with Lustre file striping so each client writes to one OST.  (IE, for a
+file striped to three OSTs, we would write from three clients.)
+
+The particular case of interest is when we want to use more than one client
+per OST.  This is important, because an OST typically has much more bandwidth
+than one client.  Strided writes are non-overlapping, so they should be able to
+proceed in parallel with more than one client per OST.  In practice, on Lustre,
+they do not, due to lock expansion.
+
+2. B. Locking problems
+We will now describe locking when there is more than one client per OST.  This
+behavior is the same on a per OST basis in a file striped across multiple OSTs.
+When the first client asks to write block 0, it asks for the required lock from
+the server.  When it receives this request, the server sees that there are no
+other locks on the file.  Since it assumes the client will want to write to the
+file again, the server expands the lock as far as possible.  In this case, it
+expands the lock to the maximum file size (effectively, to infinity), then
+grants it to client 1.
+
+When client 2 wants to write block 1, it conflicts with the expanded lock
+granted to client 1.  The server then must revoke (In Lustre terms,
+'call back') the lock granted to client 1 so it can grant a lock to client 2.
+After the lock granted to client is revoked, there are no locks on the file.
+The server sees this when processing the lock request from client 2, and
+expands that lock to cover the whole file.
+
+Client 1 then wishes to write block 3 of the file...  And the cycle continues.
+The two clients exchange the extended lock throughout the write, allowing only
+one client to write at a time, plus latency to exchange the lock.  The effect is
+dramatic: Two clients are actually slower than one.  (Similar behavior is seen
+with more than two clients.)
+
+The solution is to use this new advice type to acquire locks before they are
+needed.  In effect, before it starts writing to the file, client 1 requests
+locks on block 0, block 2, etc. It locks 'ahead' a certain (tunable) number of
+locks. Client 2 does the same.  Then they both begin to write, and are able to
+do so in parallel.  A description of the actual library implementation follows.
+
+3. Library implementation
+Actually implementing this in the library carries a number of wrinkles.
+The basic pattern is this:
+Before writing, an I/O aggregator requests a certain number of locks on blocks
+that it is responsible for.  It may or may not ever write to these blocks, but
+it takes locks knowing it might.  It then begins to write, tracking how many of
+the locks it has used.  When the number of locks 'ahead' of the I/O is low
+enough, it requests more locks in advance of the I/O.
+
+For technical reasons which are explained in the implementation section, these
+lock requests are either asynchronous and non-blocking or synchronous and
+blocking.  In Lustre terms, non-blocking means if there is already a lock on
+the relevant extent of the file, the manual lock request is not granted.  This
+means that if there is already a lock on the file (quite common; imagine
+writing to a file which was previously read by another process), these lock
+requests will be denied.  However, once the first 'real' write arrives that
+was hoping to use a lockahead lock, that write will cause the blocking lock to
+be cancelled, so this interference is not fatal.
+
+It is of course possible for another process to get in the way by immediately
+asking for a lock on the file.  This is something users should try to avoid.
+When writing out a file, repeatedly trying to read it will impact performance
+even without this feature.
+
+These interfering locks can also happen if a manually requested lock is, for
+some reason, not available in time for the write which intended to use it.
+The lock which results from this write request is expanded using the
+normal rules.  So it's possible for that lock (depending on the position of
+other locks at the time) to be extended to cover the rest of the file.  That
+will block future lockahead locks.
+
+The expanded lock will be revoked when a write happens (from another client)
+in the range covered by that lock, but the lock for that write will be expanded
+as well - And then we return to handing the lock back and forth between
+clients.  These expanded locks will still block future lockahead locks,
+rendering them useless.
+
+The way to avoid this is to turn off lock expansion for I/Os which are
+supposed to be using these manually requested locks.  That way, if the
+manually requested lock is not available, the lock request for the I/O will not
+be expanded.  Instead, that request (which is blocking, unlike a lockahead
+request) will cancel any interfering locks, but the resulting lock will not be
+expanded.  This leaves the later parts of the file open, allowing future
+manual lock requests to succeed.  This means that if an interfering lock blocks
+some manual requests, those are lost, but the next set of manual requests can
+proceed as normal.
+
+In effect, the 'locking ahead of I/O' is interrupted, but then is able to
+re-assert itself. The feature used here is referred to as 'no expansion'
+locking (as only the extent required by the actual I/O operation is locked)
+and is turned on with another new ladvise advice, LU_LADVISE_NOEXPAND.  This
+feature is added as part of the lockahead patch.  The strided writing library
+will use this advice on the file descriptor it uses for writing.
+
+4. Client side design
+4. A. Ladvise lockahead
+Requestlock uses the existing asynchronous lock request functionality
+implemented for asynchronous glimpse locks (AGLs), a long standing Lustre
+feature.  AGLs are locks which are requested by statahead, which are used to
+get file size information before it's requested.  The key thing about an
+asynchronous lock request is that it does not have a specific I/O operation
+waiting for the lock.
+
+This means two key things:
+
+1. There is no OSC lock (lock layer above LDLM for data locking) associated
+with the LDLM lock
+2. There is no thread waiting for the LDLM lock, so lock grant processing
+must be handled by the ptlrpc daemon thread which received the reply
+
+Since both of these issues are addressed by the asynchronous lock request code
+which lockahead shares with AGL, we will not explore them in depth here.
+
+Finally, lockahead requests set the CEF_LOCK_NO_EXPAND flag, which tells the
+OSC (the per OST layer of the client) to set LDLM_FL_NO_EXPANSION on any lock
+requests.  LDLM_FL_NO_EXPANSION is a new LDLM lock flag which tells the server
+not to expand the lock extent.
+
+This leaves the user facing interface.  Requestlock is implemented as a new
+ladvise advice, and it uses the ladvise feature of multiple advices in one API
+call to put many lock requests in to an array of advices.
+
+The arguments required for this advice are a mode (read or write), range (start
+and end), and flags.
+
+The client will then make lock requests on these extents, one at a time.
+Because the lock requests are asynchronous (replies are handled by ptlrpcd),
+many requests can be made quickly by overlapping them, rather than waiting for
+each one to complete.  (This requires that they be non-blocking, as the
+ptlrpcd threads must not wait in the ldlm layer.)
+
+4. B. LU_LADVISE_LOCKNOEXPAND
+The lock no expand ladvise advice sets a boolean in a Lustre data structure
+associated with a file descriptor.  When an I/O is done to this file
+descriptor, the flag is picked up and passed through to the ldlm layer, where
+it sets LDLM_FL_NO_EXPANSION on lock requests made for that I/O.
+
+5. Server side changes
+Implementing lockahead requires server support for LDLM_FL_NO_EXPANSION, but
+it also required an additional pair of server side changes to fix issues which
+came up because of lockahead.  These changes are not part of the core design
+instead, they are separate fixes which are required for it to work.
+
+5. A. Support LDLM_FL_NO_EXPANSION
+
+Disabling server side lock expansion is done with a new LDLM flag.  This is
+done with a simple check for that flag on the server before attempting to
+expand the lock.  If the flag is found, lock expansion is skipped.
+
+5. B. Implement LDLM_FL_SPECULATIVE
+
+As described above, lock ahead locks are non-blocking. The BLOCK_NOWAIT LDLM
+flag is used now to implement some nonblocking behavior, but it only considers
+group locks blocking.  But, for asynchronous lock requests to work correctly,
+they cannot wait for any other locks.  For this purpose, we add
+LDLM_FL_SPECULATIVE.  This new flag is used for asynchronous lock requests,
+and implements the broader non-blocking behavior they require.
+
+5. C. File size & ofd_intent_policy changes
+
+Knowing the current file size during writes is tricky on a distributed file
+system, because multiple clients can be writing to a file at any time.  When
+writes are in progress, the server must identify which client is currently
+responsible for growing the file size, and ask that client what the file size
+is.
+
+To do this, the server uses glimpse locking (in ofd_intent_policy) to get the
+current file size from the clients.  This code uses the assumption that the
+holder of the highest write lock (PW lock) knows the current file size.  A
+client learns the (then current) file size when a lock is granted.  Because
+only the holder of the highest lock can grow a file, either the size hasn't
+changed, or that client knows the new size; so the server only has to contact
+the client which holds this lock, and it knows the current file size.
+
+Note that the above is actually racy. When the server asks, the client can
+still be writing, or another client could acquire a higher lock during this
+time.  The goal is a good approximation while the file is being written, and a
+correct answer once all the clients are done writing.  This is achieved because
+once writes to a file are complete, the holder of that highest lock is
+guaranteed to know the current file size.  This is where manually requested
+locks cause trouble.
+
+By creating write locks in advance of an actual I/O, lockahead breaks the
+assumption that the holder of the highest lock knows the file size.
+
+This assumption is normally true because locks which are created as part of
+IO - rather than in advance of it - are guaranteed to be 'active', IE,
+involved in IO, and the holder of the highest 'active' lock always knows the
+current file size, because the size is either not changing or the holder of
+that lock is responsible for updating it.
+
+Consider:  Two clients, A and B, strided writing.  Each client requests, for
+example, 2 manually requested locks.  (Real numbers are much higher.)  Client A
+holds locks on segments 0 and 2, client B holds locks on segments 1 and 3.
+
+The request comes to write 3 segments of data.  Client A writes to segment 0,
+client B writes to segment 1, and client A also writes to segment 2.  No data
+is written to segment 3.  At this point, the server checks the file size, by
+glimpsing the highest lock . The lock on segment 3.  Client B does not know
+about the writing done by client A to segment 2, so it gives an incorrect file
+size.
+
+This would be OK if client B had pending writes to segment 3, but it does not.
+In this situation, the server will never get the correct file size while this
+lock exists.
+
+The solution is relatively straightforward: The server needs to glimpse every
+client holding a write lock (starting from the top) until we find one holding
+an 'active' lock (because the size is known to be at least the size returned
+from an 'active' lock), and take the largest size returned. This avoids asking
+only a client which may not know the correct file size.
+
+Unfortunately, there is no way to know if a manually requested lock is active
+from the server side.  So when we see such a lock, we must send a glimpse to
+the holder (unless we have already sent a glimpse to that client*).  However,
+because locks without LDLM_FL_NO_EXPANSION set are guaranteed to be 'active',
+once we reach the first such lock, we can stop glimpsing.
+
+*This is because when we glimpse a specific lock, the client holding it returns
+its best idea of the size information, so we only need to send one glimpse to
+each client.
+
+This is less efficient than the standard "glimpse only the top lock"
+methodology, but since we only need to glimpse one lock per client (and the
+number of clients writing to the part of a file on a given OST is fairly
+limited), the cost is restrained.
+
+Additionally, lock cancellation methods such as early lock cancel aggressively
+clean up older locks, particularly when the LRU limit is exceeded, so the
+total lock count should also remain manageable.
+
+In the end, the final verdict here is performance. Requestlock testing for the
+strided I/O case has shown good performance results.
diff --git a/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c b/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c

index eb091fb..c94867e 100644 (file)
--- a/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c
+++ b/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c
@@ -11,6 +11,7 @@ static int hf_lustre_ldlm_fl_lock_changed        = -1;
  static int hf_lustre_ldlm_fl_block_granted       = -1;
  static int hf_lustre_ldlm_fl_block_conv          = -1;
  static int hf_lustre_ldlm_fl_block_wait          = -1;
+static int hf_lustre_ldlm_fl_speculative         = -1;
  static int hf_lustre_ldlm_fl_ast_sent            = -1;
  static int hf_lustre_ldlm_fl_replay              = -1;
  static int hf_lustre_ldlm_fl_intent_only         = -1;
@@ -22,6 +23,7 @@ static int hf_lustre_ldlm_fl_block_nowait        = -1;
  static int hf_lustre_ldlm_fl_test_lock           = -1;
  static int hf_lustre_ldlm_fl_cancel_on_block     = -1;
  static int hf_lustre_ldlm_fl_cos_incompat        = -1;
+static int hf_lustre_ldlm_fl_no_expansion        = -1;
  static int hf_lustre_ldlm_fl_deny_on_contention  = -1;
  static int hf_lustre_ldlm_fl_ast_discard_data    = -1;
  
@@ -30,6 +32,7 @@ const value_string lustre_ldlm_flags_vals[] = {
    {LDLM_FL_BLOCK_GRANTED,       "LDLM_FL_BLOCK_GRANTED"},
    {LDLM_FL_BLOCK_CONV,          "LDLM_FL_BLOCK_CONV"},
    {LDLM_FL_BLOCK_WAIT,          "LDLM_FL_BLOCK_WAIT"},
+  {LDLM_FL_SPECULATIVE,         "LDLM_FL_SPECULATIVE"},
    {LDLM_FL_AST_SENT,            "LDLM_FL_AST_SENT"},
    {LDLM_FL_REPLAY,              "LDLM_FL_REPLAY"},
    {LDLM_FL_INTENT_ONLY,         "LDLM_FL_INTENT_ONLY"},
@@ -41,6 +44,7 @@ const value_string lustre_ldlm_flags_vals[] = {
    {LDLM_FL_TEST_LOCK,           "LDLM_FL_TEST_LOCK"},
    {LDLM_FL_CANCEL_ON_BLOCK,     "LDLM_FL_CANCEL_ON_BLOCK"},
    {LDLM_FL_COS_INCOMPAT,        "LDLM_FL_COS_INCOMPAT"},
+  {LDLM_FL_NO_EXPANSION,        "LDLM_FL_NO_EXPANSION"},
    {LDLM_FL_DENY_ON_CONTENTION,  "LDLM_FL_DENY_ON_CONTENTION"},
    {LDLM_FL_AST_DISCARD_DATA,    "LDLM_FL_AST_DISCARD_DATA"},
    { 0, NULL }
@@ -73,6 +77,7 @@ lustre_dissect_element_ldlm_lock_flags(
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_granted);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_conv);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_wait);
+  dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_speculative);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_sent);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_replay);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_intent_only);
@@ -84,6 +89,7 @@ lustre_dissect_element_ldlm_lock_flags(
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_test_lock);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cancel_on_block);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cos_incompat);
+  dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_no_expansion);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_deny_on_contention);
    return
      dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_discard_data);
@@ -147,6 +153,21 @@ lustre_dissect_element_ldlm_lock_flags(
      }
    },
    {
+    /* p_id    */ &hf_lustre_ldlm_fl_speculative,
+    /* hfinfo  */ {
+      /* name    */ "LDLM_FL_SPECULATIVE",
+      /* abbrev  */ "lustre.ldlm_fl_speculative",
+      /* type    */ FT_BOOLEAN,
+      /* display */ 32,
+      /* strings */ TFS(&lnet_flags_set_truth),
+      /* bitmask */ LDLM_FL_SPECULATIVE,
+      /* blurb   */ "Lock request is speculative/asynchronous, and cannot\n"
+       "wait for any reason.  Fail the lock request if any blocking locks\n"
+       "encountered."
+      /* id      */ HFILL
+    }
+  },
+  {
      /* p_id    */ &hf_lustre_ldlm_fl_ast_sent,
      /* hfinfo  */ {
        /* name    */ "LDLM_FL_AST_SENT",
@@ -298,6 +319,21 @@ lustre_dissect_element_ldlm_lock_flags(
      }
    },
    {
+    /* p_id    */ &hf_lustre_ldlm_fl_no_expansion,
+    /* hfinfo  */ {
+      /* name    */ "LDLM_FL_NO_EXPANSION",
+      /* abbrev  */ "lustre.ldlm_fl_NO_EXPANSION",
+      /* type    */ FT_BOOLEAN,
+      /* display */ 32,
+      /* strings */ TFS(&lnet_flags_set_truth),
+      /* bitmask */ LDLM_FL_NO_EXPANSION,
+      /* blurb   */ "Do not expand this lock.  Grant it only on the extent\n"
+       "requested. Used for manually requested locks from the client\n"
+       "(LU_LADVISE_LOCKAHEAD)."
+      /* id      */ HFILL
+    }
+  },
+  {
      /* p_id    */ &hf_lustre_ldlm_fl_deny_on_contention,
      /* hfinfo  */ {
        /* name    */ "LDLM_FL_DENY_ON_CONTENTION",
diff --git a/lustre/doc/lfs-ladvise.1 b/lustre/doc/lfs-ladvise.1

index b676480..c6a1f05 100644 (file)
--- a/lustre/doc/lfs-ladvise.1
+++ b/lustre/doc/lfs-ladvise.1
@@ -6,6 +6,7 @@ lfs ladvise \- give file access advices or hints to server.
  .B lfs ladvise [--advice|-a ADVICE ] [--background|-b]
          \fB[--start|-s START[kMGT]]
          \fB{[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
+        \fB{[--mode|-m MODE] | [--unset|-u]}
          \fB<FILE> ...\fR
  .br
  .SH DESCRIPTION
@@ -24,6 +25,9 @@ Give advice or hint of type \fIADVICE\fR. Advice types are:
  \fBwillread\fR to prefetch data into server cache
  .TP
  \fBdontneed\fR to cleanup data cache on server
+.TP
+\fBlockahead\fR to request a lock on a specified extent of a file
+\fBlocknoexpand\fR to disable server side lock expansion for a file
  .RE
  .TP
  \fB\-b\fR, \fB\-\-background
@@ -39,6 +43,13 @@ This option may not be specified at the same time as the -l option.
  \fB\-l\fR, \fB\-\-length\fR=\fILENGTH\fR
  File range has length of \fILENGTH\fR. This option may not be specified at the
  same time as the -e option.
+.TP
+\fB\-m\fR, \fB\-\-mode\fR=\fIMODE\fR
+Specify the lock \fIMODE\fR. This option is only valid with lockahead
+advice.  Valid modes are: READ, WRITE
+.TP
+\fB\-u\fR, \fB\-\-unset\fR=\fIUNSET\fR
+Unset the previous advice.  Currently only valid with locknoexpand advice.
  .SH NOTE
  .PP
  Typically,
@@ -70,6 +81,14 @@ that the first 1GB of that file will be read soon.
  This gives the OST(s) holding the first 1GB of \fB/mnt/lustre/file1\fR a hint
  that the first 1GB of file will not be read in the near future, thus the OST(s)
  could clear the cache of that file in the memory.
+.B $ lfs ladvise -a lockahead -s 0 -e 1048576 -m READ /mnt/lustre/file1
+Request a read lock on the first 1 MiB of /mnt/lustre/file1.
+.B $ $ lfs ladvise -a lockahead -s 0 -e 4096 -m WRITE ./file1
+Request a write lock on the first 4KiB of /mnt/lustre/file1.
+.B $ $ lfs ladvise -a locknoexpand ./file1
+Set disable lock expansion on ./file1
+.B $ $ lfs ladvise -a locknoexpand -u ./file1
+Unset disable lock expansion on ./file1
  .SH AVAILABILITY
  The lfs ladvise command is part of the Lustre filesystem.
  .SH SEE ALSO
diff --git a/lustre/include/cl_object.h b/lustre/include/cl_object.h

index 78d0926..00bd414 100644 (file)
--- a/lustre/include/cl_object.h
+++ b/lustre/include/cl_object.h
@@ -1607,25 +1607,30 @@ enum cl_enq_flags {
           * -EWOULDBLOCK is returned immediately.
           */
          CEF_NONBLOCK     = 0x00000001,
-        /**
-         * take lock asynchronously (out of order), as it cannot
-         * deadlock. This is for LDLM_FL_HAS_INTENT locks used for glimpsing.
-         */
-        CEF_ASYNC        = 0x00000002,
+       /**
+        * Tell lower layers this is a glimpse request, translated to
+        * LDLM_FL_HAS_INTENT at LDLM layer.
+        *
+        * Also, because glimpse locks never block other locks, we count this
+        * as automatically compatible with other osc locks.
+        * (see osc_lock_compatible)
+        */
+       CEF_GLIMPSE        = 0x00000002,
          /**
           * tell the server to instruct (though a flag in the blocking ast) an
           * owner of the conflicting lock, that it can drop dirty pages
           * protected by this lock, without sending them to the server.
           */
          CEF_DISCARD_DATA = 0x00000004,
-        /**
-         * tell the sub layers that it must be a `real' lock. This is used for
-         * mmapped-buffer locks and glimpse locks that must be never converted
-         * into lockless mode.
-         *
-         * \see vvp_mmap_locks(), cl_glimpse_lock().
-         */
-        CEF_MUST         = 0x00000008,
+       /**
+        * tell the sub layers that it must be a `real' lock. This is used for
+        * mmapped-buffer locks, glimpse locks, manually requested locks
+        * (LU_LADVISE_LOCKAHEAD) that must never be converted into lockless
+        * mode.
+        *
+        * \see vvp_mmap_locks(), cl_glimpse_lock, cl_request_lock().
+        */
+       CEF_MUST         = 0x00000008,
          /**
           * tell the sub layers that never request a `real' lock. This flag is
           * not used currently.
@@ -1638,9 +1643,16 @@ enum cl_enq_flags {
           */
          CEF_NEVER        = 0x00000010,
          /**
-         * for async glimpse lock.
+        * tell the dlm layer this is a speculative lock request
+        * speculative lock requests are locks which are not requested as part
+        * of an I/O operation.  Instead, they are requested because we expect
+        * to use them in the future.  They are requested asynchronously at the
+        * ptlrpc layer.
+        *
+        * Currently used for asynchronous glimpse locks and manually requested
+        * locks (LU_LADVISE_LOCKAHEAD).
           */
-        CEF_AGL          = 0x00000020,
+       CEF_SPECULATIVE          = 0x00000020,
         /**
          * enqueue a lock to test DLM lock existence.
          */
@@ -1651,9 +1663,13 @@ enum cl_enq_flags {
          */
         CEF_LOCK_MATCH  = 0x00000080,
         /**
+        * tell the DLM layer to lock only the requested range
+        */
+       CEF_LOCK_NO_EXPAND    = 0x00000100,
+       /**
          * mask of enq_flags.
          */
-       CEF_MASK         = 0x000000ff,
+       CEF_MASK         = 0x000001ff,
  };
  
  /**
@@ -1871,7 +1887,9 @@ struct cl_io {
          */
                              ci_noatime:1,
         /** Set to 1 if parallel execution is allowed for current I/O? */
-                            ci_pio:1;
+                            ci_pio:1,
+       /* Tell sublayers not to expand LDLM locks requested for this IO */
+                            ci_lock_no_expand:1;
         /**
          * Number of pages owned by this IO. For invariant checking.
          */
diff --git a/lustre/include/lustre_dlm.h b/lustre/include/lustre_dlm.h

index 66e90c2..62382da 100644 (file)
--- a/lustre/include/lustre_dlm.h
+++ b/lustre/include/lustre_dlm.h
@@ -607,8 +607,8 @@ struct ldlm_cb_async_args {
         struct ldlm_lock        *ca_lock;
  };
  
-/** The ldlm_glimpse_work is allocated on the stack and should not be freed. */
-#define LDLM_GL_WORK_NOFREE 0x1
+/** The ldlm_glimpse_work was slab allocated & must be freed accordingly.*/
+#define LDLM_GL_WORK_SLAB_ALLOCATED 0x1
  
  /** Interval node data for each LDLM_EXTENT lock. */
  struct ldlm_interval {
diff --git a/lustre/include/lustre_dlm_flags.h b/lustre/include/lustre_dlm_flags.h

index 179cb71..7912883 100644 (file)
--- a/lustre/include/lustre_dlm_flags.h
+++ b/lustre/include/lustre_dlm_flags.h
@@ -58,6 +58,15 @@
  #define ldlm_set_block_wait(_l)         LDLM_SET_FLAG((  _l), 1ULL <<  3)
  #define ldlm_clear_block_wait(_l)       LDLM_CLEAR_FLAG((_l), 1ULL <<  3)
  
+/**
+ * Lock request is speculative/asynchronous, and cannot wait for any reason.
+ * Fail the lock request if any blocking locks are encountered.
+ * */
+#define LDLM_FL_SPECULATIVE            0x0000000000000010ULL /* bit   4 */
+#define ldlm_is_speculative(_l)                LDLM_TEST_FLAG((_l), 1ULL <<  4)
+#define ldlm_set_speculative(_l)       LDLM_SET_FLAG((_l), 1ULL <<  4)
+#define ldlm_clear_specualtive_(_l)    LDLM_CLEAR_FLAG((_l), 1ULL <<  4)
+
  /** blocking or cancel packet was queued for sending. */
  #define LDLM_FL_AST_SENT                0x0000000000000020ULL // bit   5
  #define ldlm_is_ast_sent(_l)            LDLM_TEST_FLAG(( _l), 1ULL <<  5)
@@ -139,6 +148,25 @@
  #define ldlm_clear_cos_incompat(_l)    LDLM_CLEAR_FLAG((_l), 1ULL << 24)
  
  /**
+ * Part of original lockahead implementation, OBD_CONNECT_LOCKAHEAD_OLD.
+ * Reserved temporarily to allow those implementations to keep working.
+ * Will be removed after 2.12 release.
+ * */
+#define LDLM_FL_LOCKAHEAD_OLD_RESERVED 0x0000000010000000ULL /* bit  28 */
+#define ldlm_is_do_not_expand_io(_l)    LDLM_TEST_FLAG((_l), 1ULL << 28)
+#define ldlm_set_do_not_expand_io(_l)   LDLM_SET_FLAG((_l), 1ULL << 28)
+#define ldlm_clear_do_not_expand_io(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 28)
+
+/**
+ * Do not expand this lock.  Grant it only on the extent requested.
+ * Used for manually requested locks from the client (LU_LADVISE_LOCKAHEAD).
+ * */
+#define LDLM_FL_NO_EXPANSION           0x0000000020000000ULL /* bit  29 */
+#define ldlm_is_do_not_expand(_l)      LDLM_TEST_FLAG((_l), 1ULL << 29)
+#define ldlm_set_do_not_expand(_l)     LDLM_SET_FLAG((_l), 1ULL << 29)
+#define ldlm_clear_do_not_expand(_l)   LDLM_CLEAR_FLAG((_l), 1ULL << 29)
+
+/**
   * measure lock contention and return -EUSERS if locking contention is high */
  #define LDLM_FL_DENY_ON_CONTENTION        0x0000000040000000ULL // bit  30
  #define ldlm_is_deny_on_contention(_l)    LDLM_TEST_FLAG(( _l), 1ULL << 30)
@@ -375,13 +403,16 @@
  #define LDLM_FL_GONE_MASK              (LDLM_FL_DESTROYED              |\
                                          LDLM_FL_FAILED)
  
-/** l_flags bits marked as "inherit" bits */
-/* Flags inherited from wire on enqueue/reply between client/server. */
-/* NO_TIMEOUT flag to force ldlm_lock_match() to wait with no timeout. */
-/* TEST_LOCK flag to not let TEST lock to be granted. */
+/** l_flags bits marked as "inherit" bits
+ * Flags inherited from wire on enqueue/reply between client/server.
+ * CANCEL_ON_BLOCK so server will not grant if a blocking lock is found
+ * NO_TIMEOUT flag to force ldlm_lock_match() to wait with no timeout.
+ * TEST_LOCK flag to not let TEST lock to be granted.
+ * NO_EXPANSION to tell server not to expand extent of lock request */
  #define LDLM_FL_INHERIT_MASK            (LDLM_FL_CANCEL_ON_BLOCK       |\
                                          LDLM_FL_NO_TIMEOUT             |\
-                                        LDLM_FL_TEST_LOCK)
+                                        LDLM_FL_TEST_LOCK              |\
+                                        LDLM_FL_NO_EXPANSION)
  
  /** flags returned in @flags parameter on ldlm_lock_enqueue,
   * to be re-constructed on re-send */
diff --git a/lustre/include/lustre_export.h b/lustre/include/lustre_export.h

index 1c2e347..ac05be7 100644 (file)
--- a/lustre/include/lustre_export.h
+++ b/lustre/include/lustre_export.h
@@ -318,6 +318,16 @@ static inline __u64 exp_connect_flags(struct obd_export *exp)
         return *exp_connect_flags_ptr(exp);
  }
  
+static inline __u64 *exp_connect_flags2_ptr(struct obd_export *exp)
+{
+       return &exp->exp_connect_data.ocd_connect_flags2;
+}
+
+static inline __u64 exp_connect_flags2(struct obd_export *exp)
+{
+       return *exp_connect_flags2_ptr(exp);
+}
+
  static inline int exp_max_brw_size(struct obd_export *exp)
  {
         LASSERT(exp != NULL);
@@ -420,6 +430,16 @@ static inline int exp_connect_large_acl(struct obd_export *exp)
         return !!(exp_connect_flags(exp) & OBD_CONNECT_LARGE_ACL);
  }
  
+static inline int exp_connect_lockahead_old(struct obd_export *exp)
+{
+       return !!(exp_connect_flags(exp) & OBD_CONNECT_LOCKAHEAD_OLD);
+}
+
+static inline int exp_connect_lockahead(struct obd_export *exp)
+{
+       return !!(exp_connect_flags2(exp) & OBD_CONNECT2_LOCKAHEAD);
+}
+
  extern struct obd_export *class_conn2export(struct lustre_handle *conn);
  extern struct obd_device *class_conn2obd(struct lustre_handle *conn);
  
diff --git a/lustre/include/lustre_osc.h b/lustre/include/lustre_osc.h

index f32f7a4..124300e 100644 (file)
--- a/lustre/include/lustre_osc.h
+++ b/lustre/include/lustre_osc.h
@@ -390,7 +390,16 @@ struct osc_lock {
         /**
          * For async glimpse lock.
          */
-                               ols_agl:1;
+                               ols_agl:1,
+       /**
+        * for speculative locks - asynchronous glimpse locks and ladvise
+        * lockahead manual lock requests
+        *
+        * Used to tell osc layer to not wait for the ldlm reply from the
+        * server, so the osc lock will be short lived - It only exists to
+        * create the ldlm request and is not updated on request completion.
+        */
+                               ols_speculative:1;
  };
  
  
diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h

index 54843ee..e82f36d 100644 (file)
--- a/lustre/include/obd_support.h
+++ b/lustre/include/obd_support.h
@@ -328,6 +328,7 @@ extern char obd_jobid_var[];
  #define OBD_FAIL_OST_LADVISE_PAUSE      0x237
  #define OBD_FAIL_OST_FAKE_RW            0x238
  #define OBD_FAIL_OST_LIST_ASSERT         0x239
+#define OBD_FAIL_OST_GL_WORK_ALLOC      0x240
  
  #define OBD_FAIL_LDLM                    0x300
  #define OBD_FAIL_LDLM_NAMESPACE_NEW      0x301
diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h

index 530e058..597bc36 100644 (file)
--- a/lustre/include/uapi/linux/lustre/lustre_idl.h
+++ b/lustre/include/uapi/linux/lustre/lustre_idl.h
@@ -794,13 +794,15 @@ struct ptlrpc_body_v2 {
                                                          RPCs in parallel */
  #define OBD_CONNECT_DIR_STRIPE  0x400000000000000ULL /* striped DNE dir */
  #define OBD_CONNECT_SUBTREE    0x800000000000000ULL /* fileset mount */
-#define OBD_CONNECT_LOCK_AHEAD  0x1000000000000000ULL /* lock ahead */
+#define OBD_CONNECT_LOCKAHEAD_OLD 0x1000000000000000ULL /* Old Cray lockahead */
+
  /** bulk matchbits is sent within ptlrpc_body */
  #define OBD_CONNECT_BULK_MBITS  0x2000000000000000ULL
  #define OBD_CONNECT_OBDOPACK    0x4000000000000000ULL /* compact OUT obdo */
  #define OBD_CONNECT_FLAGS2      0x8000000000000000ULL /* second flags word */
  /* ocd_connect_flags2 flags */
  #define OBD_CONNECT2_FILE_SECCTX       0x1ULL /* set file security context at create */
+#define OBD_CONNECT2_LOCKAHEAD 0x2ULL /* ladvise lockahead v2 */
  
  /* XXX README XXX:
   * Please DO NOT add flag values here before first ensuring that this same
@@ -867,8 +869,9 @@ struct ptlrpc_body_v2 {
                                 OBD_CONNECT_LAYOUTLOCK | OBD_CONNECT_FID | \
                                 OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK | \
                                 OBD_CONNECT_BULK_MBITS | \
-                               OBD_CONNECT_GRANT_PARAM)
-#define OST_CONNECT_SUPPORTED2 0
+                               OBD_CONNECT_GRANT_PARAM | OBD_CONNECT_FLAGS2)
+
+#define OST_CONNECT_SUPPORTED2 OBD_CONNECT2_LOCKAHEAD
  
  #define ECHO_CONNECT_SUPPORTED 0
  #define ECHO_CONNECT_SUPPORTED2 0
@@ -2291,6 +2294,12 @@ struct ldlm_extent {
          __u64 gid;
  };
  
+static inline bool ldlm_extent_equal(const struct ldlm_extent *ex1,
+                                   const struct ldlm_extent *ex2)
+{
+       return ex1->start == ex2->start && ex1->end == ex2->end;
+}
+
  struct ldlm_inodebits {
          __u64 bits;
         __u64 try_bits; /* optional bits to try */
diff --git a/lustre/include/uapi/linux/lustre/lustre_user.h b/lustre/include/uapi/linux/lustre/lustre_user.h

index 6ae38c4..bb18450 100644 (file)
--- a/lustre/include/uapi/linux/lustre/lustre_user.h
+++ b/lustre/include/uapi/linux/lustre/lustre_user.h
@@ -1551,11 +1551,16 @@ enum lu_ladvise_type {
         LU_LADVISE_INVALID      = 0,
         LU_LADVISE_WILLREAD     = 1,
         LU_LADVISE_DONTNEED     = 2,
+       LU_LADVISE_LOCKNOEXPAND = 3,
+       LU_LADVISE_LOCKAHEAD    = 4,
+       LU_LADVISE_MAX
  };
  
  #define LU_LADVISE_NAMES {                                             \
-       [LU_LADVISE_WILLREAD]   = "willread",                           \
-       [LU_LADVISE_DONTNEED]   = "dontneed",                           \
+       [LU_LADVISE_WILLREAD]           = "willread",                   \
+       [LU_LADVISE_DONTNEED]           = "dontneed",                   \
+       [LU_LADVISE_LOCKNOEXPAND]       = "locknoexpand",               \
+       [LU_LADVISE_LOCKAHEAD]          = "lockahead",                  \
  }
  
  /* This is the userspace argument for ladvise.  It is currently the same as
@@ -1573,10 +1578,20 @@ struct llapi_lu_ladvise {
  
  enum ladvise_flag {
         LF_ASYNC        = 0x00000001,
+       LF_UNSET        = 0x00000002,
  };
  
  #define LADVISE_MAGIC 0x1ADF1CE0
-#define LF_MASK LF_ASYNC
+/* Masks of valid flags for each advice */
+#define LF_LOCKNOEXPAND_MASK LF_UNSET
+/* Flags valid for all advices not explicitly specified */
+#define LF_DEFAULT_MASK LF_ASYNC
+/* All flags */
+#define LF_MASK (LF_ASYNC | LF_UNSET)
+
+#define lla_lockahead_mode   lla_value1
+#define lla_peradvice_flags    lla_value2
+#define lla_lockahead_result lla_value3
  
  /* This is the userspace argument for ladvise, corresponds to ladvise_hdr which
   * is used on the wire.  It is defined separately as we may need info which is
@@ -1619,5 +1634,23 @@ struct sk_hmac_type {
         size_t   sht_bytes;
  };
  
+enum lock_mode_user {
+       MODE_READ_USER = 1,
+       MODE_WRITE_USER,
+       MODE_MAX_USER,
+};
+
+#define LOCK_MODE_NAMES { \
+       [MODE_READ_USER]  = "READ",\
+       [MODE_WRITE_USER] = "WRITE"\
+}
+
+enum lockahead_results {
+       LLA_RESULT_SENT = 0,
+       LLA_RESULT_DIFFERENT,
+       LLA_RESULT_SAME,
+};
+
  /** @} lustreuser */
+
  #endif /* _LUSTRE_USER_H */
diff --git a/lustre/ldlm/ldlm_extent.c b/lustre/ldlm/ldlm_extent.c

index a950b0b..5001b66 100644 (file)
--- a/lustre/ldlm/ldlm_extent.c
+++ b/lustre/ldlm/ldlm_extent.c
@@ -269,32 +269,43 @@ ldlm_extent_internal_policy_waiting(struct ldlm_lock *req,
  static void ldlm_extent_policy(struct ldlm_resource *res,
                                struct ldlm_lock *lock, __u64 *flags)
  {
-        struct ldlm_extent new_ex = { .start = 0, .end = OBD_OBJECT_EOF };
-
-        if (lock->l_export == NULL)
-                /*
-                 * this is local lock taken by server (e.g., as a part of
-                 * OST-side locking, or unlink handling). Expansion doesn't
-                 * make a lot of sense for local locks, because they are
-                 * dropped immediately on operation completion and would only
-                 * conflict with other threads.
-                 */
-                return;
+       struct ldlm_extent new_ex = { .start = 0, .end = OBD_OBJECT_EOF };
+
+       if (lock->l_export == NULL)
+               /*
+                * this is a local lock taken by server (e.g., as a part of
+                * OST-side locking, or unlink handling). Expansion doesn't
+                * make a lot of sense for local locks, because they are
+                * dropped immediately on operation completion and would only
+                * conflict with other threads.
+                */
+               return;
  
-        if (lock->l_policy_data.l_extent.start == 0 &&
-            lock->l_policy_data.l_extent.end == OBD_OBJECT_EOF)
-                /* fast-path whole file locks */
-                return;
+       if (lock->l_policy_data.l_extent.start == 0 &&
+           lock->l_policy_data.l_extent.end == OBD_OBJECT_EOF)
+               /* fast-path whole file locks */
+               return;
  
-        ldlm_extent_internal_policy_granted(lock, &new_ex);
-        ldlm_extent_internal_policy_waiting(lock, &new_ex);
+       /* Because reprocess_queue zeroes flags and uses it to return
+        * LDLM_FL_LOCK_CHANGED, we must check for the NO_EXPANSION flag
+        * in the lock flags rather than the 'flags' argument */
+       if (likely(!(lock->l_flags & LDLM_FL_NO_EXPANSION))) {
+               ldlm_extent_internal_policy_granted(lock, &new_ex);
+               ldlm_extent_internal_policy_waiting(lock, &new_ex);
+       } else {
+               LDLM_DEBUG(lock, "Not expanding manually requested lock.\n");
+               new_ex.start = lock->l_policy_data.l_extent.start;
+               new_ex.end = lock->l_policy_data.l_extent.end;
+               /* In case the request is not on correct boundaries, we call
+                * fixup. (normally called in ldlm_extent_internal_policy_*) */
+               ldlm_extent_internal_policy_fixup(lock, &new_ex, 0);
+       }
  
-        if (new_ex.start != lock->l_policy_data.l_extent.start ||
-            new_ex.end != lock->l_policy_data.l_extent.end) {
-                *flags |= LDLM_FL_LOCK_CHANGED;
-                lock->l_policy_data.l_extent.start = new_ex.start;
-                lock->l_policy_data.l_extent.end = new_ex.end;
-        }
+       if (!ldlm_extent_equal(&new_ex, &lock->l_policy_data.l_extent)) {
+               *flags |= LDLM_FL_LOCK_CHANGED;
+               lock->l_policy_data.l_extent.start = new_ex.start;
+               lock->l_policy_data.l_extent.end = new_ex.end;
+       }
  }
  
  static int ldlm_check_contention(struct ldlm_lock *lock, int contended_locks)
@@ -421,7 +432,8 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                          }
  
                          if (tree->lit_mode == LCK_GROUP) {
-                                if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+                               if (*flags & (LDLM_FL_BLOCK_NOWAIT |
+                                             LDLM_FL_SPECULATIVE)) {
                                          compat = -EWOULDBLOCK;
                                          goto destroylock;
                                  }
@@ -438,10 +450,24 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                                  continue;
                          }
  
-                        if (!work_list) {
-                                rc = interval_is_overlapped(tree->lit_root,&ex);
-                                if (rc)
-                                        RETURN(0);
+                       /* We've found a potentially blocking lock, check
+                        * compatibility.  This handles locks other than GROUP
+                        * locks, which are handled separately above.
+                        *
+                        * Locks with FL_SPECULATIVE are asynchronous requests
+                        * which must never wait behind another lock, so they
+                        * fail if any conflicting lock is found. */
+                       if (!work_list || (*flags & LDLM_FL_SPECULATIVE)) {
+                               rc = interval_is_overlapped(tree->lit_root,
+                                                           &ex);
+                               if (rc) {
+                                       if (!work_list) {
+                                               RETURN(0);
+                                       } else {
+                                               compat = -EWOULDBLOCK;
+                                               goto destroylock;
+                                       }
+                               }
                          } else {
                                  interval_search(tree->lit_root, &ex,
                                                  ldlm_extent_compat_cb, &data);
@@ -537,7 +563,8 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                                           * already blocked.
                                           * If we are in nonblocking mode - return
                                           * immediately */
-                                        if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+                                       if (*flags & (LDLM_FL_BLOCK_NOWAIT
+                                                     | LDLM_FL_SPECULATIVE)) {
                                                  compat = -EWOULDBLOCK;
                                                  goto destroylock;
                                          }
@@ -580,10 +607,11 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                          }
  
                          if (unlikely(lock->l_req_mode == LCK_GROUP)) {
-                                /* If compared lock is GROUP, then requested is PR/PW/
-                                 * so this is not compatible; extent range does not
-                                 * matter */
-                                if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+                               /* If compared lock is GROUP, then requested is
+                                * PR/PW so this is not compatible; extent
+                                * range does not matter */
+                               if (*flags & (LDLM_FL_BLOCK_NOWAIT
+                                             | LDLM_FL_SPECULATIVE)) {
                                          compat = -EWOULDBLOCK;
                                          goto destroylock;
                                  } else {
@@ -602,6 +630,11 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                          if (!work_list)
                                  RETURN(0);
  
+                       if (*flags & LDLM_FL_SPECULATIVE) {
+                               compat = -EWOULDBLOCK;
+                               goto destroylock;
+                       }
+
                          /* don't count conflicting glimpse locks */
                          if (lock->l_req_mode == LCK_PR &&
                              lock->l_policy_data.l_extent.start == 0 &&
@@ -764,11 +797,11 @@ int ldlm_process_extent_lock(struct ldlm_lock *lock, __u64 *flags,
         *err = ELDLM_OK;
  
         if (intention == LDLM_PROCESS_RESCAN) {
-                /* Careful observers will note that we don't handle -EWOULDBLOCK
-                 * here, but it's ok for a non-obvious reason -- compat_queue
-                 * can only return -EWOULDBLOCK if (flags & BLOCK_NOWAIT).
-                 * flags should always be zero here, and if that ever stops
-                 * being true, we want to find out. */
+               /* Careful observers will note that we don't handle -EWOULDBLOCK
+                * here, but it's ok for a non-obvious reason -- compat_queue
+                * can only return -EWOULDBLOCK if (flags & BLOCK_NOWAIT |
+                * SPECULATIVE). flags should always be zero here, and if that
+                * ever stops being true, we want to find out. */
                  LASSERT(*flags == 0);
                  rc = ldlm_extent_compat_queue(&res->lr_granted, lock, flags,
                                                err, NULL, &contended_locks);
diff --git a/lustre/ldlm/ldlm_lib.c b/lustre/ldlm/ldlm_lib.c

index 66a88d4..ea5af07 100644 (file)
--- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -599,6 +599,7 @@ int client_connect_import(const struct lu_env *env,
                          ocd->ocd_connect_flags, "old %#llx, new %#llx\n",
                          data->ocd_connect_flags, ocd->ocd_connect_flags);
                 data->ocd_connect_flags = ocd->ocd_connect_flags;
+               data->ocd_connect_flags2 = ocd->ocd_connect_flags2;
         }
  
         ptlrpc_pinger_add_import(imp);
diff --git a/lustre/ldlm/ldlm_lock.c b/lustre/ldlm/ldlm_lock.c

index b712e22..a480054 100644 (file)
--- a/lustre/ldlm/ldlm_lock.c
+++ b/lustre/ldlm/ldlm_lock.c
@@ -44,6 +44,9 @@
  
  #include "ldlm_internal.h"
  
+struct kmem_cache *ldlm_glimpse_work_kmem;
+EXPORT_SYMBOL(ldlm_glimpse_work_kmem);
+
  /* lock types */
  char *ldlm_lockname[] = {
         [0] = "--",
@@ -2138,8 +2141,9 @@ int ldlm_work_gl_ast_lock(struct ptlrpc_request_set *rqset, void *opaq)
                 rc = 1;
  
         LDLM_LOCK_RELEASE(lock);
-
-       if ((gl_work->gl_flags & LDLM_GL_WORK_NOFREE) == 0)
+       if (gl_work->gl_flags & LDLM_GL_WORK_SLAB_ALLOCATED)
+               OBD_SLAB_FREE_PTR(gl_work, ldlm_glimpse_work_kmem);
+       else
                 OBD_FREE_PTR(gl_work);
  
         RETURN(rc);
diff --git a/lustre/llite/file.c b/lustre/llite/file.c

index 2e976c5..36e3a67 100644 (file)
--- a/lustre/llite/file.c
+++ b/lustre/llite/file.c
@@ -1083,12 +1083,15 @@ static int ll_file_io_ptask(struct cfs_ptask *ptask);
  static void ll_io_init(struct cl_io *io, struct file *file, enum cl_io_type iot)
  {
         struct inode *inode = file_inode(file);
+       struct ll_file_data *fd  = LUSTRE_FPRIVATE(file);
  
         memset(&io->u.ci_rw.rw_iter, 0, sizeof(io->u.ci_rw.rw_iter));
         init_sync_kiocb(&io->u.ci_rw.rw_iocb, file);
         io->u.ci_rw.rw_file = file;
         io->u.ci_rw.rw_ptask = ll_file_io_ptask;
         io->u.ci_rw.rw_nonblock = !!(file->f_flags & O_NONBLOCK);
+       io->ci_lock_no_expand = fd->ll_lock_no_expand;
+
         if (iot == CIT_WRITE) {
                 io->u.ci_rw.rw_append = !!(file->f_flags & O_APPEND);
                 io->u.ci_rw.rw_sync   = !!(file->f_flags & O_SYNC ||
@@ -2435,6 +2438,189 @@ static int ll_file_futimes_3(struct file *file, const struct ll_futimes_3 *lfu)
         RETURN(rc);
  }
  
+static enum cl_lock_mode cl_mode_user_to_kernel(enum lock_mode_user mode)
+{
+       switch (mode) {
+       case MODE_READ_USER:
+               return CLM_READ;
+       case MODE_WRITE_USER:
+               return CLM_WRITE;
+       default:
+               return -EINVAL;
+       }
+}
+
+static const char *const user_lockname[] = LOCK_MODE_NAMES;
+
+/* Used to allow the upper layers of the client to request an LDLM lock
+ * without doing an actual read or write.
+ *
+ * Used for ladvise lockahead to manually request specific locks.
+ *
+ * \param[in] file     file this ladvise lock request is on
+ * \param[in] ladvise  ladvise struct describing this lock request
+ *
+ * \retval 0           success, no detailed result available (sync requests
+ *                     and requests sent to the server [not handled locally]
+ *                     cannot return detailed results)
+ * \retval LLA_RESULT_{SAME,DIFFERENT} - detailed result of the lock request,
+ *                                      see definitions for details.
+ * \retval negative    negative errno on error
+ */
+int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise)
+{
+       struct lu_env *env = NULL;
+       struct cl_io *io  = NULL;
+       struct cl_lock *lock = NULL;
+       struct cl_lock_descr *descr = NULL;
+       struct dentry *dentry = file->f_path.dentry;
+       struct inode *inode = dentry->d_inode;
+       enum cl_lock_mode cl_mode;
+       off_t start = ladvise->lla_start;
+       off_t end = ladvise->lla_end;
+       int result;
+       __u16 refcheck;
+
+       ENTRY;
+
+       CDEBUG(D_VFSTRACE, "Lock request: file=%.*s, inode=%p, mode=%s "
+              "start=%llu, end=%llu\n", dentry->d_name.len,
+              dentry->d_name.name, dentry->d_inode,
+              user_lockname[ladvise->lla_lockahead_mode], (__u64) start,
+              (__u64) end);
+
+       cl_mode = cl_mode_user_to_kernel(ladvise->lla_lockahead_mode);
+       if (cl_mode < 0)
+               GOTO(out, result = cl_mode);
+
+       /* Get IO environment */
+       result = cl_io_get(inode, &env, &io, &refcheck);
+       if (result <= 0)
+               GOTO(out, result);
+
+       result = cl_io_init(env, io, CIT_MISC, io->ci_obj);
+       if (result > 0) {
+               /*
+                * nothing to do for this io. This currently happens when
+                * stripe sub-object's are not yet created.
+                */
+               result = io->ci_result;
+       } else if (result == 0) {
+               lock = vvp_env_lock(env);
+               descr = &lock->cll_descr;
+
+               descr->cld_obj   = io->ci_obj;
+               /* Convert byte offsets to pages */
+               descr->cld_start = cl_index(io->ci_obj, start);
+               descr->cld_end   = cl_index(io->ci_obj, end);
+               descr->cld_mode  = cl_mode;
+               /* CEF_MUST is used because we do not want to convert a
+                * lockahead request to a lockless lock */
+               descr->cld_enq_flags = CEF_MUST | CEF_LOCK_NO_EXPAND |
+                                      CEF_NONBLOCK;
+
+               if (ladvise->lla_peradvice_flags & LF_ASYNC)
+                       descr->cld_enq_flags |= CEF_SPECULATIVE;
+
+               result = cl_lock_request(env, io, lock);
+
+               /* On success, we need to release the lock */
+               if (result >= 0)
+                       cl_lock_release(env, lock);
+       }
+       cl_io_fini(env, io);
+       cl_env_put(env, &refcheck);
+
+       /* -ECANCELED indicates a matching lock with a different extent
+        * was already present, and -EEXIST indicates a matching lock
+        * on exactly the same extent was already present.
+        * We convert them to positive values for userspace to make
+        * recognizing true errors easier.
+        * Note we can only return these detailed results on async requests,
+        * as sync requests look the same as i/o requests for locking. */
+       if (result == -ECANCELED)
+               result = LLA_RESULT_DIFFERENT;
+       else if (result == -EEXIST)
+               result = LLA_RESULT_SAME;
+
+out:
+       RETURN(result);
+}
+static const char *const ladvise_names[] = LU_LADVISE_NAMES;
+
+static int ll_ladvise_sanity(struct inode *inode,
+                            struct llapi_lu_ladvise *ladvise)
+{
+       enum lu_ladvise_type advice = ladvise->lla_advice;
+       /* Note the peradvice flags is a 32 bit field, so per advice flags must
+        * be in the first 32 bits of enum ladvise_flags */
+       __u32 flags = ladvise->lla_peradvice_flags;
+       /* 3 lines at 80 characters per line, should be plenty */
+       int rc = 0;
+
+       if (advice > LU_LADVISE_MAX || advice == LU_LADVISE_INVALID) {
+               rc = -EINVAL;
+               CDEBUG(D_VFSTRACE, "%s: advice with value '%d' not recognized,"
+                      "last supported advice is %s (value '%d'): rc = %d\n",
+                      ll_get_fsname(inode->i_sb, NULL, 0), advice,
+                      ladvise_names[LU_LADVISE_MAX-1], LU_LADVISE_MAX-1, rc);
+               GOTO(out, rc);
+       }
+
+       /* Per-advice checks */
+       switch (advice) {
+       case LU_LADVISE_LOCKNOEXPAND:
+               if (flags & ~LF_LOCKNOEXPAND_MASK) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid flags (%x) for %s: "
+                              "rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0), flags,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+               break;
+       case LU_LADVISE_LOCKAHEAD:
+               /* Currently only READ and WRITE modes can be requested */
+               if (ladvise->lla_lockahead_mode >= MODE_MAX_USER ||
+                   ladvise->lla_lockahead_mode == 0) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid mode (%d) for %s: "
+                              "rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0),
+                              ladvise->lla_lockahead_mode,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+       case LU_LADVISE_WILLREAD:
+       case LU_LADVISE_DONTNEED:
+       default:
+               /* Note fall through above - These checks apply to all advices
+                * except LOCKNOEXPAND */
+               if (flags & ~LF_DEFAULT_MASK) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid flags (%x) for %s: "
+                              "rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0), flags,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+               if (ladvise->lla_start >= ladvise->lla_end) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid range (%llu to %llu) "
+                              "for %s: rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0),
+                              ladvise->lla_start, ladvise->lla_end,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+               break;
+       }
+
+out:
+       return rc;
+}
+#undef ERRSIZE
+
  /*
   * Give file access advices
   *
@@ -2484,6 +2670,15 @@ static int ll_ladvise(struct inode *inode, struct file *file, __u64 flags,
         RETURN(rc);
  }
  
+static int ll_lock_noexpand(struct file *file, int flags)
+{
+       struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+
+       fd->ll_lock_no_expand = !(flags & LF_UNSET);
+
+       return 0;
+}
+
  int ll_ioctl_fsgetxattr(struct inode *inode, unsigned int cmd,
                         unsigned long arg)
  {
@@ -2885,53 +3080,81 @@ out:
                 RETURN(ll_file_futimes_3(file, &lfu));
         }
         case LL_IOC_LADVISE: {
-               struct llapi_ladvise_hdr *ladvise_hdr;
+               struct llapi_ladvise_hdr *k_ladvise_hdr;
+               struct llapi_ladvise_hdr __user *u_ladvise_hdr;
                 int i;
                 int num_advise;
-               int alloc_size = sizeof(*ladvise_hdr);
+               int alloc_size = sizeof(*k_ladvise_hdr);
  
                 rc = 0;
-               OBD_ALLOC_PTR(ladvise_hdr);
-               if (ladvise_hdr == NULL)
+               u_ladvise_hdr = (void __user *)arg;
+               OBD_ALLOC_PTR(k_ladvise_hdr);
+               if (k_ladvise_hdr == NULL)
                         RETURN(-ENOMEM);
  
-               if (copy_from_user(ladvise_hdr,
-                                  (const struct llapi_ladvise_hdr __user *)arg,
-                                  alloc_size))
+               if (copy_from_user(k_ladvise_hdr, u_ladvise_hdr, alloc_size))
                         GOTO(out_ladvise, rc = -EFAULT);
  
-               if (ladvise_hdr->lah_magic != LADVISE_MAGIC ||
-                   ladvise_hdr->lah_count < 1)
+               if (k_ladvise_hdr->lah_magic != LADVISE_MAGIC ||
+                   k_ladvise_hdr->lah_count < 1)
                         GOTO(out_ladvise, rc = -EINVAL);
  
-               num_advise = ladvise_hdr->lah_count;
+               num_advise = k_ladvise_hdr->lah_count;
                 if (num_advise >= LAH_COUNT_MAX)
                         GOTO(out_ladvise, rc = -EFBIG);
  
-               OBD_FREE_PTR(ladvise_hdr);
-               alloc_size = offsetof(typeof(*ladvise_hdr),
+               OBD_FREE_PTR(k_ladvise_hdr);
+               alloc_size = offsetof(typeof(*k_ladvise_hdr),
                                       lah_advise[num_advise]);
-               OBD_ALLOC(ladvise_hdr, alloc_size);
-               if (ladvise_hdr == NULL)
+               OBD_ALLOC(k_ladvise_hdr, alloc_size);
+               if (k_ladvise_hdr == NULL)
                         RETURN(-ENOMEM);
  
                 /*
                  * TODO: submit multiple advices to one server in a single RPC
                  */
-               if (copy_from_user(ladvise_hdr,
-                                  (const struct llapi_ladvise_hdr __user *)arg,
-                                  alloc_size))
+               if (copy_from_user(k_ladvise_hdr, u_ladvise_hdr, alloc_size))
                         GOTO(out_ladvise, rc = -EFAULT);
  
                 for (i = 0; i < num_advise; i++) {
-                       rc = ll_ladvise(inode, file, ladvise_hdr->lah_flags,
-                                       &ladvise_hdr->lah_advise[i]);
+                       struct llapi_lu_ladvise *k_ladvise =
+                                       &k_ladvise_hdr->lah_advise[i];
+                       struct llapi_lu_ladvise __user *u_ladvise =
+                                       &u_ladvise_hdr->lah_advise[i];
+
+                       rc = ll_ladvise_sanity(inode, k_ladvise);
                         if (rc)
+                               GOTO(out_ladvise, rc);
+
+                       switch (k_ladvise->lla_advice) {
+                       case LU_LADVISE_LOCKNOEXPAND:
+                               rc = ll_lock_noexpand(file,
+                                              k_ladvise->lla_peradvice_flags);
+                               GOTO(out_ladvise, rc);
+                       case LU_LADVISE_LOCKAHEAD:
+
+                               rc = ll_file_lock_ahead(file, k_ladvise);
+
+                               if (rc < 0)
+                                       GOTO(out_ladvise, rc);
+
+                               if (put_user(rc,
+                                            &u_ladvise->lla_lockahead_result))
+                                       GOTO(out_ladvise, rc = -EFAULT);
+                               break;
+                       default:
+                               rc = ll_ladvise(inode, file,
+                                               k_ladvise_hdr->lah_flags,
+                                               k_ladvise);
+                               if (rc)
+                                       GOTO(out_ladvise, rc);
                                 break;
+                       }
+
                 }
  
  out_ladvise:
-               OBD_FREE(ladvise_hdr, alloc_size);
+               OBD_FREE(k_ladvise_hdr, alloc_size);
                 RETURN(rc);
         }
         case LL_IOC_FSGETXATTR:
diff --git a/lustre/llite/glimpse.c b/lustre/llite/glimpse.c

index d34be28..166fff0 100644 (file)
--- a/lustre/llite/glimpse.c
+++ b/lustre/llite/glimpse.c
@@ -92,7 +92,7 @@ int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
         CDEBUG(D_DLMTRACE, "Glimpsing inode "DFID"\n", PFID(fid));
  
         /* NOTE: this looks like DLM lock request, but it may
-        *       not be one. Due to CEF_ASYNC flag (translated
+        *       not be one. Due to CEF_GLIMPSE flag (translated
          *       to LDLM_FL_HAS_INTENT by osc), this is
          *       glimpse request, that won't revoke any
          *       conflicting DLM locks held. Instead,
@@ -107,14 +107,10 @@ int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
         *descr = whole_file;
         descr->cld_obj = clob;
         descr->cld_mode = CLM_READ;
-       descr->cld_enq_flags = CEF_ASYNC | CEF_MUST;
+       descr->cld_enq_flags = CEF_GLIMPSE | CEF_MUST;
         if (agl)
-               descr->cld_enq_flags |= CEF_AGL;
+               descr->cld_enq_flags |= CEF_SPECULATIVE | CEF_NONBLOCK;
         /*
-        * CEF_ASYNC is used because glimpse sub-locks cannot
-        * deadlock (because they never conflict with other
-        * locks) and, hence, can be enqueued out-of-order.
-        *
          * CEF_MUST protects glimpse lock from conversion into
          * a lockless mode.
          */
@@ -140,7 +136,20 @@ int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
         RETURN(result);
  }
  
-static int cl_io_get(struct inode *inode, struct lu_env **envout,
+/**
+ * Get an IO environment for special operations such as glimpse locks and
+ * manually requested locks (ladvise lockahead)
+ *
+ * \param[in]  inode   inode the operation is being performed on
+ * \param[out] envout  thread specific execution environment
+ * \param[out] ioout   client io description
+ * \param[out] refcheck        reference check
+ *
+ * \retval 1           on success
+ * \retval 0           not a regular file, cannot get environment
+ * \retval negative    negative errno on error
+ */
+int cl_io_get(struct inode *inode, struct lu_env **envout,
                      struct cl_io **ioout, __u16 *refcheck)
  {
         struct lu_env           *env;
diff --git a/lustre/llite/llite_internal.h b/lustre/llite/llite_internal.h

index 9ebbaf7..a9ab610 100644 (file)
--- a/lustre/llite/llite_internal.h
+++ b/lustre/llite/llite_internal.h
@@ -642,6 +642,7 @@ struct ll_file_data {
          * true: failure is known, not report again.
          * false: unknown failure, should report. */
         bool fd_write_failed;
+       bool ll_lock_no_expand;
         rwlock_t fd_lock; /* protect lcc list */
         struct list_head fd_lccs; /* list of ll_cl_context */
  };
@@ -1222,11 +1223,18 @@ static inline int cl_glimpse_size(struct inode *inode)
         return cl_glimpse_size0(inode, 0);
  }
  
+/* AGL is 'asychronous glimpse lock', which is a speculative lock taken as
+ * part of statahead */
  static inline int cl_agl(struct inode *inode)
  {
         return cl_glimpse_size0(inode, 1);
  }
  
+int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise);
+
+int cl_io_get(struct inode *inode, struct lu_env **envout,
+             struct cl_io **ioout, __u16 *refcheck);
+
  static inline int ll_glimpse_size(struct inode *inode)
  {
         struct ll_inode_info *lli = ll_i2info(inode);
diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c

index 5299cf5..87159a0 100644 (file)
--- a/lustre/llite/llite_lib.c
+++ b/lustre/llite/llite_lib.c
@@ -196,7 +196,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
                  RETURN(-ENOMEM);
          }
  
-        /* indicate the features supported by this client */
+       /* indicate MDT features supported by this client */
          data->ocd_connect_flags = OBD_CONNECT_IBITS    | OBD_CONNECT_NODEVOH  |
                                    OBD_CONNECT_ATTRFID  |
                                    OBD_CONNECT_VERSION  | OBD_CONNECT_BRW_SIZE |
@@ -388,6 +388,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
          * back its backend blocksize for grant calculation purpose */
         data->ocd_grant_blkbits = PAGE_SHIFT;
  
+       /* indicate OST features supported by this client */
         data->ocd_connect_flags = OBD_CONNECT_GRANT | OBD_CONNECT_VERSION |
                                   OBD_CONNECT_REQPORTAL | OBD_CONNECT_BRW_SIZE |
                                   OBD_CONNECT_CANCELSET | OBD_CONNECT_FID |
@@ -399,9 +400,26 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
                                   OBD_CONNECT_JOBSTATS | OBD_CONNECT_LVB_TYPE |
                                   OBD_CONNECT_LAYOUTLOCK |
                                   OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK |
-                                 OBD_CONNECT_BULK_MBITS;
+                                 OBD_CONNECT_BULK_MBITS |
+                                 OBD_CONNECT_FLAGS2;
  
-       data->ocd_connect_flags2 = 0;
+/* The client currently advertises support for OBD_CONNECT_LOCKAHEAD_OLD so it
+ * can interoperate with an older version of lockahead which was released prior
+ * to landing in master. This support will be dropped when 2.13 development
+ * starts.  At the point, we should not just drop the connect flag (below), we
+ * should also remove the support in the code.
+ *
+ * Removing it means a few things:
+ * 1. Remove this section here
+ * 2. Remove CEF_NONBLOCK in ll_file_lockahead()
+ * 3. Remove function exp_connect_lockahead_old
+ * 4. Remove LDLM_FL_LOCKAHEAD_OLD_RESERVED in lustre_dlm_flags.h
+ * */
+#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 12, 50, 0)
+       data->ocd_connect_flags |= OBD_CONNECT_LOCKAHEAD_OLD;
+#endif
+
+       data->ocd_connect_flags2 = OBD_CONNECT2_LOCKAHEAD;
  
         if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_GRANT_PARAM))
                 data->ocd_connect_flags |= OBD_CONNECT_GRANT_PARAM;
diff --git a/lustre/llite/vvp_io.c b/lustre/llite/vvp_io.c

index 9de5f9b..c0dad2d 100644 (file)
--- a/lustre/llite/vvp_io.c
+++ b/lustre/llite/vvp_io.c
@@ -541,6 +541,8 @@ static int vvp_io_rw_lock(const struct lu_env *env, struct cl_io *io,
  
         if (io->u.ci_rw.rw_nonblock)
                 ast_flags |= CEF_NONBLOCK;
+       if (io->ci_lock_no_expand)
+               ast_flags |= CEF_LOCK_NO_EXPAND;
  
         result = vvp_mmap_locks(env, io);
         if (result == 0)
diff --git a/lustre/lov/lov_io.c b/lustre/lov/lov_io.c

index f40dfa2..577e7d1 100644 (file)
--- a/lustre/lov/lov_io.c
+++ b/lustre/lov/lov_io.c
@@ -123,6 +123,7 @@ static int lov_io_sub_init(const struct lu_env *env, struct lov_io *lio,
         sub_io->ci_no_srvlock = io->ci_no_srvlock;
         sub_io->ci_noatime = io->ci_noatime;
         sub_io->ci_pio = io->ci_pio;
+       sub_io->ci_lock_no_expand = io->ci_lock_no_expand;
  
         result = cl_io_sub_init(sub->sub_env, sub_io, io->ci_type, sub_obj);
  
diff --git a/lustre/obdclass/cl_lock.c b/lustre/obdclass/cl_lock.c

index e92dbaf..83a3e8f 100644 (file)
--- a/lustre/obdclass/cl_lock.c
+++ b/lustre/obdclass/cl_lock.c
@@ -200,7 +200,7 @@ int cl_lock_request(const struct lu_env *env, struct cl_io *io,
         if (rc < 0)
                 RETURN(rc);
  
-       if ((enq_flags & CEF_ASYNC) && !(enq_flags & CEF_AGL)) {
+       if ((enq_flags & CEF_GLIMPSE) && !(enq_flags & CEF_SPECULATIVE)) {
                 anchor = &cl_env_info(env)->clt_anchor;
                 cl_sync_io_init(anchor, 1, cl_sync_io_end);
         }
diff --git a/lustre/obdclass/lprocfs_status.c b/lustre/obdclass/lprocfs_status.c

index 18bd5e0..79c1413 100644 (file)
--- a/lustre/obdclass/lprocfs_status.c
+++ b/lustre/obdclass/lprocfs_status.c
@@ -841,12 +841,13 @@ static const char *obd_connect_names[] = {
         "multi_mod_rpcs",
         "dir_stripe",
         "subtree",
-       "lock_ahead",
+       "lockahead",
         "bulk_mbits",
         "compact_obdo",
         "second_flags",
         /* flags2 names */
         "file_secctx",
+       "lockaheadv2",
         NULL
  };
  
diff --git a/lustre/ofd/ofd_dev.c b/lustre/ofd/ofd_dev.c

index 88ccc89..17ce15f 100644 (file)
--- a/lustre/ofd/ofd_dev.c
+++ b/lustre/ofd/ofd_dev.c
@@ -3260,6 +3260,13 @@ static int __init ofd_init(void)
                 return(rc);
         }
  
+       rc = ofd_dlm_init();
+       if (rc) {
+               lu_kmem_fini(ofd_caches);
+               ofd_fmd_exit();
+               return rc;
+       }
+
         rc = class_register_type(&ofd_obd_ops, NULL, true, NULL,
                                  LUSTRE_OST_NAME, &ofd_device_type);
         return rc;
@@ -3274,6 +3281,7 @@ static int __init ofd_init(void)
  static void __exit ofd_exit(void)
  {
         ofd_fmd_exit();
+       ofd_dlm_exit();
         lu_kmem_fini(ofd_caches);
         class_unregister_type(LUSTRE_OST_NAME);
  }
diff --git a/lustre/ofd/ofd_dlm.c b/lustre/ofd/ofd_dlm.c

index 4d86785..c18ade0 100644 (file)
--- a/lustre/ofd/ofd_dlm.c
+++ b/lustre/ofd/ofd_dlm.c
@@ -45,21 +45,77 @@
  #include "ofd_internal.h"
  
  struct ofd_intent_args {
-       struct ldlm_lock        **victim;
+       struct list_head        gl_list;
         __u64                    size;
-       int                     *liblustre;
+       bool                    no_glimpse_ast;
+       int                     error;
  };
  
+int ofd_dlm_init(void)
+{
+       ldlm_glimpse_work_kmem = kmem_cache_create("ldlm_glimpse_work_kmem",
+                                            sizeof(struct ldlm_glimpse_work),
+                                            0, 0, NULL);
+       if (ldlm_glimpse_work_kmem == NULL)
+               return -ENOMEM;
+       else
+               return 0;
+}
+
+void ofd_dlm_exit(void)
+{
+       if (ldlm_glimpse_work_kmem) {
+               kmem_cache_destroy(ldlm_glimpse_work_kmem);
+               ldlm_glimpse_work_kmem = NULL;
+       }
+}
+
  /**
   * OFD interval callback.
   *
   * The interval_callback_t is part of interval_iterate_reverse() and is called
   * for each interval in tree. The OFD interval callback searches for locks
- * covering extents beyond the given args->size. This is used to decide if LVB
- * data is outdated.
+ * covering extents beyond the given args->size. This is used to decide if the
+ * size is too small and needs to be updated.  Note that we are only interested
+ * in growing the size, as truncate is the only operation which can shrink it,
+ * and it is handled differently.  This is why we only look at locks beyond the
+ * current size.
+ *
+ * It finds the highest lock (by starting point) in this interval, and adds it
+ * to the list of locks to glimpse.  We must glimpse a list of locks - rather
+ * than only the highest lock on the file - because lockahead creates extent
+ * locks in advance of IO, and so breaks the assumption that the holder of the
+ * highest lock knows the current file size.
+ *
+ * This assumption is normally true because locks which are created as part of
+ * IO - rather than in advance of it - are guaranteed to be 'active', i.e.,
+ * involved in IO, and the holder of the highest 'active' lock always knows the
+ * current file size, because the size is either not changing or the holder of
+ * that lock is responsible for updating it.
+ *
+ * So we need only glimpse until we find the first client with an 'active'
+ * lock.
+ *
+ * Unfortunately, there is no way to know if a manually requested/speculative
+ * lock is 'active' from the server side.  So when we see a potentially
+ * speculative lock, we must send a glimpse for that lock unless we have
+ * already sent a glimpse to the holder of that lock.
+ *
+ * However, *all* non-speculative locks are active.  So we can stop glimpsing
+ * as soon as we find a non-speculative lock.  Currently, all speculative PW
+ * locks have LDLM_FL_NO_EXPANSION set, and we use this to identify them.  This
+ * is enforced by an assertion in osc_lock_init, which references this comment.
+ *
+ * If that ever changes, we will either need to find a new way to identify
+ * active locks or we will need to consider all PW locks (we will still only
+ * glimpse one per client).
+ *
+ * Note that it is safe to glimpse only the 'top' lock from each interval
+ * because ofd_intent_cb is only called for PW extent locks, and for PW locks,
+ * there is only one lock per interval.
   *
   * \param[in] n                interval node
- * \param[in] args     intent arguments
+ * \param[in,out] args intent arguments, gl work list for identified locks
   *
   * \retval             INTERVAL_ITER_STOP if the interval is lower than
   *                     file size, caller stops execution
@@ -71,39 +127,89 @@ static enum interval_iter ofd_intent_cb(struct interval_node *n, void *args)
         struct ldlm_interval     *node = (struct ldlm_interval *)n;
         struct ofd_intent_args   *arg = args;
         __u64                     size = arg->size;
-       struct ldlm_lock        **v = arg->victim;
+       struct ldlm_lock         *victim_lock = NULL;
         struct ldlm_lock         *lck;
+       struct ldlm_glimpse_work *gl_work = NULL;
+       int rc = 0;
  
         /* If the interval is lower than the current file size, just break. */
         if (interval_high(n) <= size)
-               return INTERVAL_ITER_STOP;
+               GOTO(out, rc = INTERVAL_ITER_STOP);
  
+       /* Find the 'victim' lock from this interval */
         list_for_each_entry(lck, &node->li_group, l_sl_policy) {
-               /* Don't send glimpse ASTs to liblustre clients.
-                * They aren't listening for them, and they do
-                * entirely synchronous I/O anyways. */
-               if (lck->l_export == NULL || lck->l_export->exp_libclient)
-                       continue;
-
-               if (*arg->liblustre)
-                       *arg->liblustre = 0;
  
-               if (*v == NULL) {
-                       *v = LDLM_LOCK_GET(lck);
-               } else if ((*v)->l_policy_data.l_extent.start <
-                          lck->l_policy_data.l_extent.start) {
-                       LDLM_LOCK_RELEASE(*v);
-                       *v = LDLM_LOCK_GET(lck);
-               }
+               victim_lock = LDLM_LOCK_GET(lck);
  
                 /* the same policy group - every lock has the
                  * same extent, so needn't do it any more */
                 break;
         }
  
-       return INTERVAL_ITER_CONT;
-}
+       /* l_export can be null in race with eviction - In that case, we will
+        * not find any locks in this interval */
+       if (!victim_lock)
+               GOTO(out, rc = INTERVAL_ITER_CONT);
+
+       /*
+        * This check is for lock taken in ofd_destroy_by_fid() that does
+        * not have l_glimpse_ast set. So the logic is: if there is a lock
+        * with no l_glimpse_ast set, this object is being destroyed already.
+        * Hence, if you are grabbing DLM locks on the server, always set
+        * non-NULL glimpse_ast (e.g., ldlm_request.c::ldlm_glimpse_ast()).
+        */
+       if (victim_lock->l_glimpse_ast == NULL) {
+               LDLM_DEBUG(victim_lock, "no l_glimpse_ast");
+               arg->no_glimpse_ast = true;
+               GOTO(out_release, rc = INTERVAL_ITER_STOP);
+       }
  
+       /* If NO_EXPANSION is not set, this is an active lock, and we don't need
+        * to glimpse any further once we've glimpsed the client holding this
+        * lock.  So set us up to stop.  See comment above this function. */
+       if (!(victim_lock->l_flags & LDLM_FL_NO_EXPANSION))
+               rc = INTERVAL_ITER_STOP;
+       else
+               rc = INTERVAL_ITER_CONT;
+
+       /* Check to see if we're already set up to send a glimpse to this
+        * client; if so, don't add this lock to the glimpse list - We need
+        * only glimpse each client once. (And if we know that client holds
+        * an active lock, we can stop glimpsing.  So keep the rc set in the
+        * check above.) */
+       list_for_each_entry(gl_work, &arg->gl_list, gl_list) {
+               if (gl_work->gl_lock->l_export == victim_lock->l_export)
+                       GOTO(out_release, rc);
+       }
+
+       if (!OBD_FAIL_CHECK(OBD_FAIL_OST_GL_WORK_ALLOC))
+               OBD_SLAB_ALLOC_PTR_GFP(gl_work, ldlm_glimpse_work_kmem,
+                                      GFP_ATOMIC);
+
+       if (!gl_work) {
+               arg->error = -ENOMEM;
+               GOTO(out_release, rc = INTERVAL_ITER_STOP);
+       }
+
+       /* Populate the gl_work structure. */
+       gl_work->gl_lock = victim_lock;
+       list_add_tail(&gl_work->gl_list, &arg->gl_list);
+       /* There is actually no need for a glimpse descriptor when glimpsing
+        * extent locks */
+       gl_work->gl_desc = NULL;
+       /* This tells ldlm_work_gl_ast_lock this was allocated from a slab and
+        * must be freed in a slab-aware manner. */
+       gl_work->gl_flags = LDLM_GL_WORK_SLAB_ALLOCATED;
+
+       GOTO(out, rc);
+
+out_release:
+       /* If the victim doesn't go on the glimpse list, we must release it */
+       LDLM_LOCK_RELEASE(victim_lock);
+
+out:
+       return rc;
+}
  /**
   * OFD lock intent policy
   *
@@ -124,20 +230,20 @@ static enum interval_iter ofd_intent_cb(struct interval_node *n, void *args)
   * \retval             ELDLM_LOCK_REPLACED if already granted lock was found
   *                     and placed in \a lockp
   * \retval             ELDLM_LOCK_ABORTED in other cases except error
- * \retval             negative value on error
+ * \retval             negative errno on error
   */
  int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                       void *req_cookie, enum ldlm_mode mode, __u64 flags,
                       void *data)
  {
         struct ptlrpc_request *req = req_cookie;
-       struct ldlm_lock *lock = *lockp, *l = NULL;
+       struct ldlm_lock *lock = *lockp;
         struct ldlm_resource *res = lock->l_resource;
         ldlm_processing_policy policy;
         struct ost_lvb *res_lvb, *reply_lvb;
         struct ldlm_reply *rep;
         enum ldlm_error err;
-       int idx, rc, only_liblustre = 1;
+       int idx, rc;
         struct ldlm_interval_tree *tree;
         struct ofd_intent_args arg;
         __u32 repsize[3] = {
@@ -145,11 +251,12 @@ int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                 [DLM_LOCKREPLY_OFF]   = sizeof(*rep),
                 [DLM_REPLY_REC_OFF]   = sizeof(*reply_lvb)
         };
-       struct ldlm_glimpse_work gl_work = {};
-       struct list_head gl_list;
+       struct ldlm_glimpse_work *pos, *tmp;
         ENTRY;
  
-       INIT_LIST_HEAD(&gl_list);
+       INIT_LIST_HEAD(&arg.gl_list);
+       arg.no_glimpse_ast = false;
+       arg.error = 0;
         lock->l_lvb_type = LVB_T_OST;
         policy = ldlm_get_processing_policy(res);
         LASSERT(policy != NULL);
@@ -195,13 +302,7 @@ int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
  
         /* The lock met with no resistance; we're finished. */
         if (rc == LDLM_ITER_CONTINUE) {
-               /* do not grant locks to the liblustre clients: they cannot
-                * handle ASTs robustly.  We need to do this while still
-                * holding ns_lock to avoid the lock remaining on the res_link
-                * list (and potentially being added to l_pending_list by an
-                * AST) when we are going to drop this lock ASAP. */
-               if (lock->l_export->exp_libclient ||
-                   OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_GLIMPSE, 2)) {
+               if (OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_GLIMPSE, 2)) {
                         ldlm_resource_unlink_lock(lock);
                         err = ELDLM_LOCK_ABORTED;
                 } else {
@@ -233,74 +334,48 @@ int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
          *  res->lr_lvb_sem.
          */
         arg.size = reply_lvb->lvb_size;
-       arg.victim = &l;
-       arg.liblustre = &only_liblustre;
  
+       /* Check for PW locks beyond the size in the LVB, build the list
+        * of locks to glimpse (arg.gl_list) */
         for (idx = 0; idx < LCK_MODE_NUM; idx++) {
                 tree = &res->lr_itree[idx];
                 if (tree->lit_mode == LCK_PR)
                         continue;
  
                 interval_iterate_reverse(tree->lit_root, ofd_intent_cb, &arg);
+               if (arg.error) {
+                       unlock_res(res);
+                       GOTO(out, rc = arg.error);
+               }
         }
         unlock_res(res);
  
         /* There were no PW locks beyond the size in the LVB; finished. */
-       if (l == NULL) {
-               if (only_liblustre) {
-                       /* If we discovered a liblustre client with a PW lock,
-                        * however, the LVB may be out of date!  The LVB is
-                        * updated only on glimpse (which we don't do for
-                        * liblustre clients) and cancel (which the client
-                        * obviously has not yet done).  So if it has written
-                        * data but kept the lock, the LVB is stale and needs
-                        * to be updated from disk.
-                        *
-                        * Of course, this will all disappear when we switch to
-                        * taking liblustre locks on the OST. */
-                       ldlm_res_lvbo_update(res, NULL, 1);
-               }
+       if (list_empty(&arg.gl_list))
                 RETURN(ELDLM_LOCK_ABORTED);
-       }
  
-       /*
-        * This check is for lock taken in ofd_destroy_by_fid() that does
-        * not have l_glimpse_ast set. So the logic is: if there is a lock
-        * with no l_glimpse_ast set, this object is being destroyed already.
-        * Hence, if you are grabbing DLM locks on the server, always set
-        * non-NULL glimpse_ast (e.g., ldlm_request.c::ldlm_glimpse_ast()).
-        */
-       if (l->l_glimpse_ast == NULL) {
+       if (arg.no_glimpse_ast) {
                 /* We are racing with unlink(); just return -ENOENT */
                 rep->lock_policy_res1 = ptlrpc_status_hton(-ENOENT);
-               goto out;
+               GOTO(out, ELDLM_LOCK_ABORTED);
         }
  
-       /* Populate the gl_work structure.
-        * Grab additional reference on the lock which will be released in
-        * ldlm_work_gl_ast_lock() */
-       gl_work.gl_lock = LDLM_LOCK_GET(l);
-       /* The glimpse callback is sent to one single extent lock. As a result,
-        * the gl_work list is just composed of one element */
-       list_add_tail(&gl_work.gl_list, &gl_list);
-       /* There is actually no need for a glimpse descriptor when glimpsing
-        * extent locks */
-       gl_work.gl_desc = NULL;
-       /* the ldlm_glimpse_work structure is allocated on the stack */
-       gl_work.gl_flags = LDLM_GL_WORK_NOFREE;
-
-       rc = ldlm_glimpse_locks(res, &gl_list); /* this will update the LVB */
-
-       if (!list_empty(&gl_list))
-               LDLM_LOCK_RELEASE(l);
+       /* this will update the LVB */
+       ldlm_glimpse_locks(res, &arg.gl_list);
  
         lock_res(res);
         *reply_lvb = *res_lvb;
         unlock_res(res);
  
  out:
-       LDLM_LOCK_RELEASE(l);
+       /* If the list is not empty, we failed to glimpse some locks and
+        * must clean up.  Usually due to a race with unlink.*/
+       list_for_each_entry_safe(pos, tmp, &arg.gl_list, gl_list) {
+               list_del(&pos->gl_list);
+               LDLM_LOCK_RELEASE(pos->gl_lock);
+               OBD_SLAB_FREE_PTR(pos, ldlm_glimpse_work_kmem);
+       }
  
-       RETURN(ELDLM_LOCK_ABORTED);
+       RETURN(rc < 0 ? rc : ELDLM_LOCK_ABORTED);
  }
  
diff --git a/lustre/ofd/ofd_internal.h b/lustre/ofd/ofd_internal.h

index bab0381..611c8db 100644 (file)
--- a/lustre/ofd/ofd_internal.h
+++ b/lustre/ofd/ofd_internal.h
@@ -418,6 +418,9 @@ int ofd_fid_fini(const struct lu_env *env, struct ofd_device *ofd);
  extern struct ldlm_valblock_ops ofd_lvbo;
  
  /* ofd_dlm.c */
+extern struct kmem_cache *ldlm_glimpse_work_kmem;
+int ofd_dlm_init(void);
+void ofd_dlm_exit(void);
  int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                       void *req_cookie, enum ldlm_mode mode, __u64 flags,
                       void *data);
diff --git a/lustre/osc/osc_internal.h b/lustre/osc/osc_internal.h

index 60965c7..9d00c5c 100644 (file)
--- a/lustre/osc/osc_internal.h
+++ b/lustre/osc/osc_internal.h
@@ -55,7 +55,8 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                      struct ost_lvb *lvb, int kms_valid,
                      osc_enqueue_upcall_f upcall,
                      void *cookie, struct ldlm_enqueue_info *einfo,
-                    struct ptlrpc_request_set *rqset, int async, int agl);
+                    struct ptlrpc_request_set *rqset, int async,
+                    bool speculative);
  
  int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                    enum ldlm_type type, union ldlm_policy_data *policy,
diff --git a/lustre/osc/osc_lock.c b/lustre/osc/osc_lock.c

index b17ebcc..6fd75da 100644 (file)
--- a/lustre/osc/osc_lock.c
+++ b/lustre/osc/osc_lock.c
@@ -160,11 +160,13 @@ static __u64 osc_enq2ldlm_flags(__u32 enqflags)
  {
         __u64 result = 0;
  
+       CDEBUG(D_DLMTRACE, "flags: %x\n", enqflags);
+
         LASSERT((enqflags & ~CEF_MASK) == 0);
  
         if (enqflags & CEF_NONBLOCK)
                 result |= LDLM_FL_BLOCK_NOWAIT;
-       if (enqflags & CEF_ASYNC)
+       if (enqflags & CEF_GLIMPSE)
                 result |= LDLM_FL_HAS_INTENT;
         if (enqflags & CEF_DISCARD_DATA)
                 result |= LDLM_FL_AST_DISCARD_DATA;
@@ -172,6 +174,10 @@ static __u64 osc_enq2ldlm_flags(__u32 enqflags)
                 result |= LDLM_FL_TEST_LOCK;
         if (enqflags & CEF_LOCK_MATCH)
                 result |= LDLM_FL_MATCH_LOCK;
+       if (enqflags & CEF_LOCK_NO_EXPAND)
+               result |= LDLM_FL_NO_EXPANSION;
+       if (enqflags & CEF_SPECULATIVE)
+               result |= LDLM_FL_SPECULATIVE;
         return result;
  }
  
@@ -350,8 +356,9 @@ static int osc_lock_upcall(void *cookie, struct lustre_handle *lockh,
         RETURN(rc);
  }
  
-static int osc_lock_upcall_agl(void *cookie, struct lustre_handle *lockh,
-                              int errcode)
+static int osc_lock_upcall_speculative(void *cookie,
+                                      struct lustre_handle *lockh,
+                                      int errcode)
  {
         struct osc_object       *osc = cookie;
         struct ldlm_lock        *dlmlock;
@@ -374,7 +381,7 @@ static int osc_lock_upcall_agl(void *cookie, struct lustre_handle *lockh,
         lock_res_and_lock(dlmlock);
         LASSERT(dlmlock->l_granted_mode == dlmlock->l_req_mode);
  
-       /* there is no osc_lock associated with AGL lock */
+       /* there is no osc_lock associated with speculative locks */
         osc_lock_lvb_update(env, osc, dlmlock, NULL);
  
         unlock_res_and_lock(dlmlock);
@@ -817,7 +824,7 @@ static bool osc_lock_compatible(const struct osc_lock *qing,
         struct cl_lock_descr *qed_descr = &qed->ols_cl.cls_lock->cll_descr;
         struct cl_lock_descr *qing_descr = &qing->ols_cl.cls_lock->cll_descr;
  
-       if (qed->ols_glimpse)
+       if (qed->ols_glimpse || qed->ols_speculative)
                 return true;
  
         if (qing_descr->cld_mode == CLM_READ && qed_descr->cld_mode == CLM_READ)
@@ -935,6 +942,7 @@ static int osc_lock_enqueue(const struct lu_env *env,
         struct osc_io                   *oio   = osc_env_io(env);
         struct osc_object               *osc   = cl2osc(slice->cls_obj);
         struct osc_lock                 *oscl  = cl2osc_lock(slice);
+       struct obd_export               *exp   = osc_export(osc);
         struct cl_lock                  *lock  = slice->cls_lock;
         struct ldlm_res_id              *resname = &info->oti_resname;
         union ldlm_policy_data          *policy  = &info->oti_policy;
@@ -951,11 +959,22 @@ static int osc_lock_enqueue(const struct lu_env *env,
         if (oscl->ols_state == OLS_GRANTED)
                 RETURN(0);
  
+       if ((oscl->ols_flags & LDLM_FL_NO_EXPANSION) &&
+           !(exp_connect_lockahead_old(exp) || exp_connect_lockahead(exp))) {
+               result = -EOPNOTSUPP;
+               CERROR("%s: server does not support lockahead/locknoexpand:"
+                      "rc = %d\n", exp->exp_obd->obd_name, result);
+               RETURN(result);
+       }
+
         if (oscl->ols_flags & LDLM_FL_TEST_LOCK)
                 GOTO(enqueue_base, 0);
  
-       if (oscl->ols_glimpse) {
-               LASSERT(equi(oscl->ols_agl, anchor == NULL));
+       /* For glimpse and/or speculative locks, do not wait for reply from
+        * server on LDLM request */
+       if (oscl->ols_glimpse || oscl->ols_speculative) {
+               /* Speculative and glimpse locks do not have an anchor */
+               LASSERT(equi(oscl->ols_speculative, anchor == NULL));
                 async = true;
                 GOTO(enqueue_base, 0);
         }
@@ -981,25 +1000,31 @@ enqueue_base:
  
         /**
          * DLM lock's ast data must be osc_object;
-        * if glimpse or AGL lock, async of osc_enqueue_base() must be true,
+        * if glimpse or speculative lock, async of osc_enqueue_base()
+        * must be true
+        *
+        * For non-speculative locks:
          * DLM's enqueue callback set to osc_lock_upcall() with cookie as
          * osc_lock.
+        * For speculative locks:
+        * osc_lock_upcall_speculative & cookie is the osc object, since
+        * there is no osc_lock
          */
         ostid_build_res_name(&osc->oo_oinfo->loi_oi, resname);
         osc_lock_build_policy(env, lock, policy);
-       if (oscl->ols_agl) {
+       if (oscl->ols_speculative) {
                 oscl->ols_einfo.ei_cbdata = NULL;
                 /* hold a reference for callback */
                 cl_object_get(osc2cl(osc));
-               upcall = osc_lock_upcall_agl;
+               upcall = osc_lock_upcall_speculative;
                 cookie = osc;
         }
-       result = osc_enqueue_base(osc_export(osc), resname, &oscl->ols_flags,
+       result = osc_enqueue_base(exp, resname, &oscl->ols_flags,
                                   policy, &oscl->ols_lvb,
                                   osc->oo_oinfo->loi_kms_valid,
                                   upcall, cookie,
                                   &oscl->ols_einfo, PTLRPCD_SET, async,
-                                 oscl->ols_agl);
+                                 oscl->ols_speculative);
         if (result == 0) {
                 if (osc_lock_is_lockless(oscl)) {
                         oio->oi_lockless = 1;
@@ -1008,9 +1033,12 @@ enqueue_base:
                         LASSERT(oscl->ols_hold);
                         LASSERT(oscl->ols_dlmlock != NULL);
                 }
-       } else if (oscl->ols_agl) {
+       } else if (oscl->ols_speculative) {
                 cl_object_put(env, osc2cl(osc));
-               result = 0;
+               if (oscl->ols_glimpse) {
+                       /* hide error for AGL request */
+                       result = 0;
+               }
         }
  
  out:
@@ -1178,10 +1206,15 @@ int osc_lock_init(const struct lu_env *env,
         INIT_LIST_HEAD(&oscl->ols_wait_entry);
         INIT_LIST_HEAD(&oscl->ols_nextlock_oscobj);
  
+       /* Speculative lock requests must be either no_expand or glimpse
+        * request (CEF_GLIMPSE).  non-glimpse no_expand speculative extent
+        * locks will break ofd_intent_cb. (see comment there)*/
+       LASSERT(ergo((enqflags & CEF_SPECULATIVE) != 0,
+               (enqflags & (CEF_LOCK_NO_EXPAND | CEF_GLIMPSE)) != 0));
+
         oscl->ols_flags = osc_enq2ldlm_flags(enqflags);
-       oscl->ols_agl = !!(enqflags & CEF_AGL);
-       if (oscl->ols_agl)
-               oscl->ols_flags |= LDLM_FL_BLOCK_NOWAIT;
+       oscl->ols_speculative = !!(enqflags & CEF_SPECULATIVE);
+
         if (oscl->ols_flags & LDLM_FL_HAS_INTENT) {
                 oscl->ols_flags |= LDLM_FL_BLOCK_GRANTED;
                 oscl->ols_glimpse = 1;
diff --git a/lustre/osc/osc_request.c b/lustre/osc/osc_request.c

index 5526814..5f259e9 100644 (file)
--- a/lustre/osc/osc_request.c
+++ b/lustre/osc/osc_request.c
@@ -100,7 +100,7 @@ struct osc_enqueue_args {
         void                    *oa_cookie;
         struct ost_lvb          *oa_lvb;
         struct lustre_handle    oa_lockh;
-       unsigned int            oa_agl:1;
+       bool                    oa_speculative;
  };
  
  static void osc_release_ppga(struct brw_page **ppga, size_t count);
@@ -2035,7 +2035,7 @@ static int osc_set_lock_data(struct ldlm_lock *lock, void *data)
  static int osc_enqueue_fini(struct ptlrpc_request *req,
                             osc_enqueue_upcall_f upcall, void *cookie,
                             struct lustre_handle *lockh, enum ldlm_mode mode,
-                           __u64 *flags, int agl, int errcode)
+                           __u64 *flags, bool speculative, int errcode)
  {
         bool intent = *flags & LDLM_FL_HAS_INTENT;
         int rc;
@@ -2052,7 +2052,7 @@ static int osc_enqueue_fini(struct ptlrpc_request *req,
                         ptlrpc_status_ntoh(rep->lock_policy_res1);
                 if (rep->lock_policy_res1)
                         errcode = rep->lock_policy_res1;
-               if (!agl)
+               if (!speculative)
                         *flags |= LDLM_FL_LVB_READY;
         } else if (errcode == ELDLM_OK) {
                 *flags |= LDLM_FL_LVB_READY;
@@ -2102,7 +2102,7 @@ static int osc_enqueue_interpret(const struct lu_env *env,
         /* Let CP AST to grant the lock first. */
         OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_ENQ_RACE, 1);
  
-       if (aa->oa_agl) {
+       if (aa->oa_speculative) {
                 LASSERT(aa->oa_lvb == NULL);
                 LASSERT(aa->oa_flags == NULL);
                 aa->oa_flags = &flags;
@@ -2114,7 +2114,7 @@ static int osc_enqueue_interpret(const struct lu_env *env,
                                    lockh, rc);
         /* Complete osc stuff. */
         rc = osc_enqueue_fini(req, aa->oa_upcall, aa->oa_cookie, lockh, mode,
-                             aa->oa_flags, aa->oa_agl, rc);
+                             aa->oa_flags, aa->oa_speculative, rc);
  
          OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_CANCEL_RACE, 10);
  
@@ -2137,7 +2137,8 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                      struct ost_lvb *lvb, int kms_valid,
                      osc_enqueue_upcall_f upcall, void *cookie,
                      struct ldlm_enqueue_info *einfo,
-                    struct ptlrpc_request_set *rqset, int async, int agl)
+                    struct ptlrpc_request_set *rqset, int async,
+                    bool speculative)
  {
         struct obd_device *obd = exp->exp_obd;
         struct lustre_handle lockh = { 0 };
@@ -2153,14 +2154,14 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
         policy->l_extent.start -= policy->l_extent.start & ~PAGE_MASK;
         policy->l_extent.end |= ~PAGE_MASK;
  
-        /*
-         * kms is not valid when either object is completely fresh (so that no
-         * locks are cached), or object was evicted. In the latter case cached
-         * lock cannot be used, because it would prime inode state with
-         * potentially stale LVB.
-         */
-        if (!kms_valid)
-                goto no_match;
+       /*
+        * kms is not valid when either object is completely fresh (so that no
+        * locks are cached), or object was evicted. In the latter case cached
+        * lock cannot be used, because it would prime inode state with
+        * potentially stale LVB.
+        */
+       if (!kms_valid)
+               goto no_match;
  
          /* Next, search for already existing extent locks that will cover us */
          /* If we're trying to read, we also search for an existing PW lock.  The
@@ -2177,7 +2178,10 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
          mode = einfo->ei_mode;
          if (einfo->ei_mode == LCK_PR)
                  mode |= LCK_PW;
-       if (agl == 0)
+       /* Normal lock requests must wait for the LVB to be ready before
+        * matching a lock; speculative lock requests do not need to,
+        * because they will not actually use the lock. */
+       if (!speculative)
                 match_flags |= LDLM_FL_LVB_READY;
         if (intent != 0)
                 match_flags |= LDLM_FL_BLOCK_GRANTED;
@@ -2190,13 +2194,22 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                         RETURN(ELDLM_OK);
  
                 matched = ldlm_handle2lock(&lockh);
-               if (agl) {
-                       /* AGL enqueues DLM locks speculatively. Therefore if
-                        * it already exists a DLM lock, it wll just inform the
-                        * caller to cancel the AGL process for this stripe. */
+               if (speculative) {
+                       /* This DLM lock request is speculative, and does not
+                        * have an associated IO request. Therefore if there
+                        * is already a DLM lock, it wll just inform the
+                        * caller to cancel the request for this stripe.*/
+                       lock_res_and_lock(matched);
+                       if (ldlm_extent_equal(&policy->l_extent,
+                           &matched->l_policy_data.l_extent))
+                               rc = -EEXIST;
+                       else
+                               rc = -ECANCELED;
+                       unlock_res_and_lock(matched);
+
                         ldlm_lock_decref(&lockh, mode);
                         LDLM_LOCK_PUT(matched);
-                       RETURN(-ECANCELED);
+                       RETURN(rc);
                 } else if (osc_set_lock_data(matched, einfo->ei_cbdata)) {
                         *flags |= LDLM_FL_LVB_READY;
  
@@ -2243,20 +2256,20 @@ no_match:
                         struct osc_enqueue_args *aa;
                         CLASSERT(sizeof(*aa) <= sizeof(req->rq_async_args));
                         aa = ptlrpc_req_async_args(req);
-                       aa->oa_exp    = exp;
-                       aa->oa_mode   = einfo->ei_mode;
-                       aa->oa_type   = einfo->ei_type;
+                       aa->oa_exp         = exp;
+                       aa->oa_mode        = einfo->ei_mode;
+                       aa->oa_type        = einfo->ei_type;
                         lustre_handle_copy(&aa->oa_lockh, &lockh);
-                       aa->oa_upcall = upcall;
-                       aa->oa_cookie = cookie;
-                       aa->oa_agl    = !!agl;
-                       if (!agl) {
+                       aa->oa_upcall      = upcall;
+                       aa->oa_cookie      = cookie;
+                       aa->oa_speculative = speculative;
+                       if (!speculative) {
                                 aa->oa_flags  = flags;
                                 aa->oa_lvb    = lvb;
                         } else {
-                               /* AGL is essentially to enqueue an DLM lock
-                                * in advance, so we don't care about the
-                                * result of AGL enqueue. */
+                               /* speculative locks are essentially to enqueue
+                                * a DLM lock  in advance, so we don't care
+                                * about the result of the enqueue. */
                                 aa->oa_lvb    = NULL;
                                 aa->oa_flags  = NULL;
                         }
@@ -2274,7 +2287,7 @@ no_match:
         }
  
         rc = osc_enqueue_fini(req, upcall, cookie, &lockh, einfo->ei_mode,
-                             flags, agl, rc);
+                             flags, speculative, rc);
         if (intent)
                 ptlrpc_req_finished(req);
  
diff --git a/lustre/ptlrpc/wiretest.c b/lustre/ptlrpc/wiretest.c

index 1f6afb7..1eb0078 100644 (file)
--- a/lustre/ptlrpc/wiretest.c
+++ b/lustre/ptlrpc/wiretest.c
@@ -1300,8 +1300,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_DIR_STRIPE);
         LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_SUBTREE);
-       LASSERTF(OBD_CONNECT_LOCK_AHEAD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
-                OBD_CONNECT_LOCK_AHEAD);
+       LASSERTF(OBD_CONNECT_LOCKAHEAD_OLD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT_LOCKAHEAD_OLD);
         LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_BULK_MBITS);
         LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
@@ -1310,6 +1310,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_FLAGS2);
         LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT2_FILE_SECCTX);
+       LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT2_LOCKAHEAD);
         LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
                 (unsigned)OBD_CKSUM_CRC32);
         LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/lustre/tests/Makefile.am b/lustre/tests/Makefile.am

index 33bb37a..33598d2 100644 (file)
--- a/lustre/tests/Makefile.am
+++ b/lustre/tests/Makefile.am
@@ -77,7 +77,7 @@ noinst_PROGRAMS += write_time_limit rwv lgetxattr_size_check checkfiemap
  noinst_PROGRAMS += listxattr_size_check check_fhandle_syscalls badarea_io
  noinst_PROGRAMS += llapi_layout_test orphan_linkea_check llapi_hsm_test
  noinst_PROGRAMS += group_lock_test llapi_fid_test sendfile_grouplock mmap_cat
-noinst_PROGRAMS += swap_lock_test
+noinst_PROGRAMS += swap_lock_test lockahead_test
  
  bin_PROGRAMS = mcreate munlink
  testdir = $(libdir)/lustre/tests
@@ -100,6 +100,7 @@ swap_lock_test_LDADD=$(LIBLUSTREAPI)
  statmany_LDADD=$(LIBLUSTREAPI)
  statone_LDADD=$(LIBLUSTREAPI)
  rwv_LDADD=$(LIBCFS)
+lockahead_test_LDADD=$(LIBLUSTREAPI)
  
  ll_dirstripe_verify_SOURCES = ll_dirstripe_verify.c
  ll_dirstripe_verify_LDADD = $(LIBLUSTREAPI) $(LIBCFS) $(PTHREAD_LIBS)
diff --git a/lustre/tests/lockahead_test.c b/lustre/tests/lockahead_test.c

new file mode 100644 (file)

index 0000000..11cb843
--- /dev/null
+++ b/lustre/tests/lockahead_test.c
@@ -0,0 +1,1204 @@
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+
+/*
+ * Copyright 2016 Cray Inc. All rights reserved.
+ * Authors: Patrick Farrell, Frank Zago
+ *
+ * A few portions are extracted from llapi_layout_test.c
+ *
+ * The purpose of this test is to exercise the lockahead advice of ladvise.
+ *
+ * The program will exit as soon as a test fails.
+ */
+
+#include <stdlib.h>
+#include <errno.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <poll.h>
+#include <time.h>
+
+#include <lustre/lustreapi.h>
+#include <linux/lustre/lustre_idl.h>
+
+#define ERROR(fmt, ...)                                                        \
+       fprintf(stderr, "%s: %s:%d: %s: " fmt "\n",                     \
+               program_invocation_short_name, __FILE__, __LINE__,      \
+               __func__, ## __VA_ARGS__);
+
+#define DIE(fmt, ...)                          \
+       do {                                    \
+               ERROR(fmt, ## __VA_ARGS__);     \
+               exit(-1);               \
+       } while (0)
+
+#define ASSERTF(cond, fmt, ...)                                                \
+       do {                                                            \
+               if (!(cond))                                            \
+                       DIE("assertion '%s' failed: "fmt,               \
+                           #cond, ## __VA_ARGS__);                     \
+       } while (0)
+
+#define PERFORM(testfn) \
+       do {                                                            \
+               cleanup();                                              \
+               fprintf(stderr, "Starting test " #testfn " at %lld\n",  \
+                       (unsigned long long)time(NULL));                \
+               rc = testfn();                                          \
+               fprintf(stderr, "Finishing test " #testfn " at %lld\n", \
+                       (unsigned long long)time(NULL));                \
+               cleanup();                                              \
+       } while (0)
+
+/* Name of file/directory. Will be set once and will not change. */
+static char mainpath[PATH_MAX];
+static const char *mainfile = "lockahead_test_654";
+
+static char fsmountdir[PATH_MAX];      /* Lustre mountpoint */
+static char *lustre_dir;               /* Test directory inside Lustre */
+static int single_test;                        /* Number of a single test to execute*/
+
+/* Cleanup our test file. */
+static void cleanup(void)
+{
+       unlink(mainpath);
+}
+
+/* Trivial helper for one advice */
+void setup_ladvise_lockahead(struct llapi_lu_ladvise *advice, int mode,
+                            int flags, size_t start, size_t end, bool async)
+{
+       advice->lla_advice = LU_LADVISE_LOCKAHEAD;
+       advice->lla_lockahead_mode = mode;
+       if (async)
+               advice->lla_peradvice_flags = flags | LF_ASYNC;
+       else
+               advice->lla_peradvice_flags = flags;
+       advice->lla_start = start;
+       advice->lla_end = end;
+       advice->lla_value3 = 0;
+       advice->lla_value4 = 0;
+}
+
+/* Test valid single lock ahead request */
+static int test10(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                                 write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+
+       close(fd);
+
+       return 0;
+}
+
+/* Get lock, wait until lock is taken */
+static int test11(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int enqueue_requests = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                                 write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       enqueue_requests++;
+
+       /* Ask again until we get the lock (status 1). */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice.lla_lockahead_result > 0)
+                       break;
+
+               enqueue_requests++;
+       }
+
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Again. This time it is always there. */
+       for (i = 0; i < 100; i++) {
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice.lla_lockahead_result > 0,
+                       "unexpected extent result: %d",
+                       advice.lla_lockahead_result);
+       }
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return enqueue_requests;
+}
+
+/* Test with several times the same extent */
+static int test12(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 10;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       for (i = 0; i < count; i++) {
+               setup_ladvise_lockahead(&(advice[i]), MODE_WRITE_USER, 0, 0,
+                                         write_size - 1, true);
+               advice[i].lla_lockahead_result = 98674;
+       }
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       for (i = 0; i < count; i++) {
+               ASSERTF(advice[i].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[i].lla_lockahead_result);
+       }
+       /* Since all the requests are for the same extent, we should only have
+        * one lock at the end. */
+       expected_lock_count = 1;
+
+       /* Ask again until we get the locks. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[count-1].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[count-1].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[count-1].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[count-1].lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       free(advice);
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Grow a lock forward */
+static int test13(void)
+{
+       struct llapi_lu_ladvise *advice = NULL;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       for (i = 0; i < 100; i++) {
+               if (advice)
+                       free(advice);
+               advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0,
+                                       i * write_size, (i+1)*write_size - 1,
+                                       true);
+               advice[0].lla_lockahead_result = 98674;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s' at offset %llu: %s",
+                       mainpath,
+                       advice[0].lla_end,
+                       strerror(errno));
+
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+
+               expected_lock_count++;
+       }
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Grow a lock backward */
+static int test14(void)
+{
+       struct llapi_lu_ladvise *advice = NULL;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       const int num_blocks = 100;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       for (i = 0; i < num_blocks; i++) {
+               size_t start = (num_blocks - i - 1) * write_size;
+               size_t end = (num_blocks - i) * write_size - 1;
+
+               if (advice)
+                       free(advice);
+               advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+               advice[0].lla_lockahead_result = 98674;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s' at offset %llu: %s",
+                       mainpath,
+                       advice[0].lla_end,
+                       strerror(errno));
+
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+
+               expected_lock_count++;
+       }
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Request many locks at 10MiB intervals */
+static int test15(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       for (i = 0; i < 5000; i++) {
+               /* The 'UL' designators are required to avoid undefined
+                * behavior which GCC turns in to an infinite loop */
+               __u64 start = i * 1024UL * 1024UL * 10UL;
+               __u64 end = start + 1;
+
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+
+               advice[0].lla_lockahead_result = 345678;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+
+               ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+               expected_lock_count++;
+       }
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* The write should cancel the first lock (which was too small)
+        * and create one of its own, so the net effect on lock count is 0. */
+
+       free(advice);
+
+       close(fd);
+
+       /* We have to map our expected return in to the range of valid return
+        * codes, 0-255. */
+       expected_lock_count = expected_lock_count/1000;
+
+       return expected_lock_count;
+}
+
+/* Use lockahead to verify behavior of ladvise locknoexpand */
+static int test16(void)
+{
+       struct llapi_lu_ladvise *advice;
+       struct llapi_lu_ladvise *advice_noexpand;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       __u64 start = 0;
+       __u64 end = write_size - 1;
+       int rc;
+       char buf[write_size];
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       advice_noexpand = malloc(sizeof(struct llapi_lu_ladvise));
+
+       /* First ask for a read lock, which will conflict with the write */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, false);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == 0,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Use an async request to verify we got the read lock we asked for */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, true);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Set noexpand */
+       advice_noexpand[0].lla_advice = LU_LADVISE_LOCKNOEXPAND;
+       advice_noexpand[0].lla_peradvice_flags = 0;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+
+       /* This write should generate a lock on exactly "write_size" bytes */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Now, disable locknoexpand and try writing again. */
+       advice_noexpand[0].lla_peradvice_flags = LF_UNSET;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       /* This write should get an expanded lock */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       /* Verify it didn't get a lock on just the bytes it wrote.*/
+       usleep(100000); /* 0.1 second, plenty of time to get the lock */
+
+       start = start + write_size;
+       end = end + write_size;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+               "unexpected extent result for extent %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Use lockahead to verify behavior of ladvise locknoexpand, with O_NONBLOCK.
+ * There should be no change in behavior. */
+static int test17(void)
+{
+       struct llapi_lu_ladvise *advice;
+       struct llapi_lu_ladvise *advice_noexpand;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       __u64 start = 0;
+       __u64 end = write_size - 1;
+       int rc;
+       char buf[write_size];
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC | O_NONBLOCK,
+                 S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       advice_noexpand = malloc(sizeof(struct llapi_lu_ladvise));
+
+       /* First ask for a read lock, which will conflict with the write */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, false);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == 0,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Use an async request to verify we got the read lock we asked for */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, true);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Set noexpand */
+       advice_noexpand[0].lla_advice = LU_LADVISE_LOCKNOEXPAND;
+       advice_noexpand[0].lla_peradvice_flags = 0;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+
+       /* This write should generate a lock on exactly "write_size" bytes */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Now, disable locknoexpand and try writing again. */
+       advice_noexpand[0].lla_peradvice_flags = LF_UNSET;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       /* This write should get an expanded lock */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       /* Verify it didn't get a lock on just the bytes it wrote.*/
+       usleep(100000); /* 0.1 second, plenty of time to get the lock */
+
+       start = start + write_size;
+       end = end + write_size;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+               "unexpected extent result for extent %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test overlapping requests */
+static int test18(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       int rc;
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       /* Overlapping locks - Should only end up with 1 */
+       for (i = 0; i < 10; i++) {
+               __u64 start = i;
+               __u64 end = start + 4096;
+
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+
+               advice[0].lla_lockahead_result = 345678;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+
+               ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+       }
+       expected_lock_count = 1;
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, 0, 4096,
+                                       true);
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test that normal request blocks lock ahead requests */
+static int test19(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       /* This should create a lock on the whole file, which will block lock
+        * ahead requests. */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       expected_lock_count = 1;
+
+       /* These should all be blocked. */
+       for (i = 0; i < 10; i++) {
+               __u64 start = i * 4096;
+               __u64 end = start + 4096;
+
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+
+               advice[0].lla_lockahead_result = 345678;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+
+               ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+       }
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test sync requests, and matching with async requests */
+static int test20(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 1;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       /* Async request */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Ask again until we get the lock (status 1). */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice.lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Convert to a sync request on smaller range, should match and not
+        * cancel */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1 - write_size/2, false);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       /* Sync requests cannot give detailed results */
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Use an async request to test original lock is still present */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1, true);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test sync requests, and conflict with async requests */
+static int test21(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 1;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       /* Async request */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Ask again until we get the lock (status 1). */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice.lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Convert to a sync request on larger range, should cancel existing
+        * lock */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size*2 - 1, false);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       /* Sync requests cannot give detailed results */
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Use an async request to test new lock is there */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size*2 - 1, true);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test various valid and invalid inputs */
+static int test22(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       int rc;
+       size_t start = 0;
+       size_t end = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       /* A valid async request first */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024*1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       free(advice);
+
+       /* Valid request sync request */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024*1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, false);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       free(advice);
+
+       /* No actual block */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 0;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for no block lock: %d %s",
+               rc, strerror(errno));
+       free(advice);
+
+       /* end before start */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 1024 * 1024;
+       end = 0;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for reversed block: %d %s",
+               rc, strerror(errno));
+       free(advice);
+
+       /* bogus lock mode - 0x65464 */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, 0x65464, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus lock mode: %d %s",
+               rc, strerror(errno));
+       free(advice);
+
+       /* bogus flags, 0x80 */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0x80, start, end,
+                               true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus flags: %u %d %s",
+               0x80, rc, strerror(errno));
+       free(advice);
+
+       /* bogus flags, 0xff - CEF_MASK */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0xff, start, end,
+                               true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus flags: %u %d %s",
+               0xff, rc, strerror(errno));
+       free(advice);
+
+       /* bogus flags, 0xffffffff */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0xffffffff, start,
+                               end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus flags: %u %d %s",
+               0xffffffff, rc, strerror(errno));
+       free(advice);
+
+       close(fd);
+
+       return 0;
+}
+
+static void usage(char *prog)
+{
+       fprintf(stderr, "Usage: %s [-d lustre_dir], [-t single_test]\n", prog);
+       exit(-1);
+}
+
+static void process_args(int argc, char *argv[])
+{
+       int c;
+
+       while ((c = getopt(argc, argv, "d:t:")) != -1) {
+               switch (c) {
+               case 'd':
+                       lustre_dir = optarg;
+                       break;
+               case 't':
+                       single_test = atoi(optarg);
+                       break;
+               case '?':
+               default:
+                       fprintf(stderr, "Invalid option '%c'\n", optopt);
+                       usage(argv[0]);
+                       break;
+               }
+       }
+}
+
+int main(int argc, char *argv[])
+{
+       char fsname[8];
+       int rc;
+
+       process_args(argc, argv);
+       if (lustre_dir == NULL)
+               lustre_dir = "/mnt/lustre";
+
+       rc = llapi_search_mounts(lustre_dir, 0, fsmountdir, fsname);
+       if (rc != 0) {
+               fprintf(stderr, "Error: '%s': not a Lustre filesystem\n",
+                       lustre_dir);
+               return -1;
+       }
+
+       /* Play nice with Lustre test scripts. Non-line buffered output
+        * stream under I/O redirection may appear incorrectly. */
+       setvbuf(stdout, NULL, _IOLBF, 0);
+
+       /* Create a test filename and reuse it. Remove possibly old files. */
+       rc = snprintf(mainpath, sizeof(mainpath), "%s/%s", lustre_dir,
+                     mainfile);
+       ASSERTF(rc > 0 && rc < sizeof(mainpath), "invalid name for mainpath");
+       cleanup();
+
+       atexit(cleanup);
+
+       switch (single_test) {
+       case 0:
+               PERFORM(test10);
+               PERFORM(test11);
+               PERFORM(test12);
+               PERFORM(test13);
+               PERFORM(test14);
+               PERFORM(test15);
+               PERFORM(test16);
+               PERFORM(test17);
+               PERFORM(test18);
+               PERFORM(test19);
+               PERFORM(test20);
+               PERFORM(test21);
+               PERFORM(test22);
+               /* When running all the test cases, we can't use the return
+                * from the last test case, as it might be non-zero to return
+                * info, rather than for an error.  Test cases assert and exit
+                * if an error occurs. */
+               rc = 0;
+               break;
+       case 10:
+               PERFORM(test10);
+               break;
+       case 11:
+               PERFORM(test11);
+               break;
+       case 12:
+               PERFORM(test12);
+               break;
+       case 13:
+               PERFORM(test13);
+               break;
+       case 14:
+               PERFORM(test14);
+               break;
+       case 15:
+               PERFORM(test15);
+               break;
+       case 16:
+               PERFORM(test16);
+               break;
+       case 17:
+               PERFORM(test17);
+               break;
+       case 18:
+               PERFORM(test18);
+               break;
+       case 19:
+               PERFORM(test19);
+               break;
+       case 20:
+               PERFORM(test20);
+               break;
+       case 21:
+               PERFORM(test21);
+               break;
+       case 22:
+               PERFORM(test22);
+               break;
+       default:
+               fprintf(stderr, "impossible value of single_test %d\n",
+                       single_test);
+               rc = -1;
+               break;
+       }
+
+       return rc;
+}
diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh

index e9ed875..a9c7b1e 100755 (executable)
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -14837,6 +14837,87 @@ test_255b() {
  }
  run_test 255b "check 'lfs ladvise -a dontneed'"
  
+test_255c() {
+       local count
+       local new_count
+       local difference
+       local i
+       local rc
+       test_mkdir -p $DIR/$tdir
+       $SETSTRIPE -i 0 $DIR/$tdir
+
+       #test 10 returns only success/failure
+       i=10
+       lockahead_test -d $DIR/$tdir -t $i
+       rc=$?
+       if [ $rc -eq 255 ]; then
+               error "Ladvise test${i} failed, ${rc}"
+       fi
+
+       #test 11 counts lock enqueue requests, all others count new locks
+       i=11
+       count=$(do_facet ost1 \
+               $LCTL get_param -n ost.OSS.ost.stats)
+       count=$(echo "$count" | grep ldlm_extent_enqueue | awk '{ print $2 }')
+
+       lockahead_test -d $DIR/$tdir -t $i
+       rc=$?
+       if [ $rc -eq 255 ]; then
+               error "Ladvise test${i} failed, ${rc}"
+       fi
+
+       new_count=$(do_facet ost1 \
+               $LCTL get_param -n ost.OSS.ost.stats)
+       new_count=$(echo "$new_count" | grep ldlm_extent_enqueue | \
+                  awk '{ print $2 }')
+
+       difference="$((new_count - count))"
+       if [ $difference -ne $rc ]; then
+               error "Ladvise test${i}, bad enqueue count, returned " \
+                     "${rc}, actual ${difference}"
+       fi
+
+       for i in $(seq 12 21); do
+               # If we do not do this, we run the risk of having too many
+               # locks and starting lock cancellation while we are checking
+               # lock counts.
+               cancel_lru_locks osc
+
+               count=$($LCTL get_param -n \
+                      ldlm.namespaces.$FSNAME-OST0000*osc-f*.lock_unused_count)
+
+               lockahead_test -d $DIR/$tdir -t $i
+               rc=$?
+               if [ $rc -eq 255 ]; then
+                       error "Ladvise test ${i} failed, ${rc}"
+               fi
+
+               new_count=$($LCTL get_param -n \
+                      ldlm.namespaces.$FSNAME-OST0000*osc-f*.lock_unused_count)
+               difference="$((new_count - count))"
+
+               # Test 15 output is divided by 1000 to map down to valid return
+               if [ $i -eq 15 ]; then
+                       rc="$((rc * 1000))"
+               fi
+
+               if [ $difference -ne $rc ]; then
+                       error "Ladvise test ${i}, bad lock count, returned " \
+                             "${rc}, actual ${difference}"
+               fi
+       done
+
+       #test 22 returns only success/failure
+       i=22
+       lockahead_test -d $DIR/$tdir -t $i
+       rc=$?
+       if [ $rc -eq 255 ]; then
+               error "Ladvise test${i} failed, ${rc}"
+       fi
+
+}
+run_test 255c "suite of ladvise lockahead tests"
+
  test_256() {
         local cl_user
         local cat_sl
diff --git a/lustre/utils/lfs.c b/lustre/utils/lfs.c

index 44ce851..dc07fb9 100644 (file)
--- a/lustre/utils/lfs.c
+++ b/lustre/utils/lfs.c
@@ -400,8 +400,9 @@ command_t cmdlist[] = {
         {"ladvise", lfs_ladvise, 0,
          "Provide servers with advice about access patterns for a file.\n"
          "usage: ladvise [--advice|-a ADVICE] [--start|-s START[kMGT]]\n"
-        "               [--background|-b]\n"
+        "               [--background|-b] [--unset|-u]\n\n"
          "               {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}\n"
+        "               {[--mode|-m [READ,WRITE]}\n"
          "               <file> ...\n"},
         {"help", Parser_help, 0, "help"},
         {"exit", Parser_quit, 0, "quit"},
@@ -5218,6 +5219,28 @@ static int lfs_swap_layouts(int argc, char **argv)
  
  static const char *const ladvise_names[] = LU_LADVISE_NAMES;
  
+static const char *const lock_mode_names[] = LOCK_MODE_NAMES;
+
+static const char *const lockahead_results[] = {
+       [LLA_RESULT_SENT] = "Lock request sent",
+       [LLA_RESULT_DIFFERENT] = "Different matching lock found",
+       [LLA_RESULT_SAME] = "Matching lock on identical extent found",
+};
+
+int lfs_get_mode(const char *string)
+{
+       enum lock_mode_user mode;
+
+       for (mode = 0; mode < ARRAY_SIZE(lock_mode_names); mode++) {
+               if (lock_mode_names[mode] == NULL)
+                       continue;
+               if (strcmp(string, lock_mode_names[mode]) == 0)
+                       return mode;
+       }
+
+       return -EINVAL;
+}
+
  static enum lu_ladvise_type lfs_get_ladvice(const char *string)
  {
         enum lu_ladvise_type advice;
@@ -5240,9 +5263,11 @@ static int lfs_ladvise(int argc, char **argv)
         { .val = 'b',   .name = "background",   .has_arg = no_argument },
         { .val = 'e',   .name = "end",          .has_arg = required_argument },
         { .val = 'l',   .name = "length",       .has_arg = required_argument },
+       { .val = 'm',   .name = "mode",         .has_arg = required_argument },
         { .val = 's',   .name = "start",        .has_arg = required_argument },
+       { .val = 'u',   .name = "unset",        .has_arg = no_argument },
         { .name = NULL } };
-       char                     short_opts[] = "a:be:l:s:";
+       char                     short_opts[] = "a:be:l:m:s:u";
         int                      c;
         int                      rc = 0;
         const char              *path;
@@ -5254,6 +5279,7 @@ static int lfs_ladvise(int argc, char **argv)
         unsigned long long       length = 0;
         unsigned long long       size_units;
         unsigned long long       flags = 0;
+       int                      mode = 0;
  
         optind = 0;
         while ((c = getopt_long(argc, argv, short_opts,
@@ -5282,6 +5308,9 @@ static int lfs_ladvise(int argc, char **argv)
                 case 'b':
                         flags |= LF_ASYNC;
                         break;
+               case 'u':
+                       flags |= LF_UNSET;
+                       break;
                 case 'e':
                         size_units = 1;
                         rc = llapi_parse_size(optarg, &end,
@@ -5312,6 +5341,15 @@ static int lfs_ladvise(int argc, char **argv)
                                 return CMD_HELP;
                         }
                         break;
+               case 'm':
+                       mode = lfs_get_mode(optarg);
+                       if (mode < 0) {
+                               fprintf(stderr, "%s: bad mode '%s', valid "
+                                                "modes are READ or WRITE\n",
+                                       argv[0], optarg);
+                               return CMD_HELP;
+                       }
+                       break;
                 case '?':
                         return CMD_HELP;
                 default:
@@ -5334,6 +5372,13 @@ static int lfs_ladvise(int argc, char **argv)
                 return CMD_HELP;
         }
  
+       if (advice_type == LU_LADVISE_LOCKNOEXPAND) {
+               fprintf(stderr, "%s: Lock no expand advice is a per file "
+                                "descriptor advice, so when called from lfs, "
+                                "it does nothing.\n", argv[0]);
+               return CMD_HELP;
+       }
+
         if (argc <= optind) {
                 fprintf(stderr, "%s: please give one or more file names\n",
                         argv[0]);
@@ -5355,6 +5400,18 @@ static int lfs_ladvise(int argc, char **argv)
                 return CMD_HELP;
         }
  
+       if (advice_type != LU_LADVISE_LOCKAHEAD && mode != 0) {
+               fprintf(stderr, "%s: mode is only valid with lockahead\n",
+                       argv[0]);
+               return CMD_HELP;
+       }
+
+       if (advice_type == LU_LADVISE_LOCKAHEAD && mode == 0) {
+               fprintf(stderr, "%s: mode is required with lockahead\n",
+                       argv[0]);
+               return CMD_HELP;
+       }
+
         while (optind < argc) {
                 int rc2;
  
@@ -5375,6 +5432,11 @@ static int lfs_ladvise(int argc, char **argv)
                 advice.lla_value2 = 0;
                 advice.lla_value3 = 0;
                 advice.lla_value4 = 0;
+               if (advice_type == LU_LADVISE_LOCKAHEAD) {
+                       advice.lla_lockahead_mode = mode;
+                       advice.lla_peradvice_flags = flags;
+               }
+
                 rc2 = llapi_ladvise(fd, flags, 1, &advice);
                 close(fd);
                 if (rc2 < 0) {
@@ -5382,7 +5444,10 @@ static int lfs_ladvise(int argc, char **argv)
                                 "'%s': %s\n", argv[0],
                                 ladvise_names[advice_type],
                                 path, strerror(errno));
+
+                       goto next;
                 }
+
  next:
                 if (rc == 0 && rc2 < 0)
                         rc = rc2;
diff --git a/lustre/utils/liblustreapi_ladvise.c b/lustre/utils/liblustreapi_ladvise.c

index 445c145..c098889 100644 (file)
--- a/lustre/utils/liblustreapi_ladvise.c
+++ b/lustre/utils/liblustreapi_ladvise.c
@@ -52,8 +52,9 @@
  int llapi_ladvise(int fd, unsigned long long flags, int num_advise,
                   struct llapi_lu_ladvise *ladvise)
  {
-       int rc;
         struct llapi_ladvise_hdr *ladvise_hdr;
+       int rc;
+       int i;
  
         if (num_advise < 1 || num_advise >= LAH_COUNT_MAX) {
                 errno = EINVAL;
@@ -79,6 +80,18 @@ int llapi_ladvise(int fd, unsigned long long flags, int num_advise,
                 llapi_error(LLAPI_MSG_ERROR, -errno, "cannot give advice");
                 return -1;
         }
+
+       /* Copy results back in to caller provided structs */
+       for (i = 0; i < num_advise; i++) {
+               struct llapi_lu_ladvise *ladvise_iter;
+
+               ladvise_iter = &ladvise_hdr->lah_advise[i];
+
+               if (ladvise_iter->lla_advice == LU_LADVISE_LOCKAHEAD)
+                       ladvise[i].lla_lockahead_result =
+                                       ladvise_iter->lla_lockahead_result;
+       }
+
         return 0;
  }
  
diff --git a/lustre/utils/wirecheck.c b/lustre/utils/wirecheck.c

index 91afaee..dc3e115 100644 (file)
--- a/lustre/utils/wirecheck.c
+++ b/lustre/utils/wirecheck.c
@@ -589,11 +589,12 @@ check_obd_connect_data(void)
         CHECK_DEFINE_64X(OBD_CONNECT_MULTIMODRPCS);
         CHECK_DEFINE_64X(OBD_CONNECT_DIR_STRIPE);
         CHECK_DEFINE_64X(OBD_CONNECT_SUBTREE);
-       CHECK_DEFINE_64X(OBD_CONNECT_LOCK_AHEAD);
+       CHECK_DEFINE_64X(OBD_CONNECT_LOCKAHEAD_OLD);
         CHECK_DEFINE_64X(OBD_CONNECT_BULK_MBITS);
         CHECK_DEFINE_64X(OBD_CONNECT_OBDOPACK);
         CHECK_DEFINE_64X(OBD_CONNECT_FLAGS2);
         CHECK_DEFINE_64X(OBD_CONNECT2_FILE_SECCTX);
+       CHECK_DEFINE_64X(OBD_CONNECT2_LOCKAHEAD);
  
         CHECK_VALUE_X(OBD_CKSUM_CRC32);
         CHECK_VALUE_X(OBD_CKSUM_ADLER);
diff --git a/lustre/utils/wiretest.c b/lustre/utils/wiretest.c

index 1788512..3851c24 100644 (file)
--- a/lustre/utils/wiretest.c
+++ b/lustre/utils/wiretest.c
@@ -1319,8 +1319,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_DIR_STRIPE);
         LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_SUBTREE);
-       LASSERTF(OBD_CONNECT_LOCK_AHEAD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
-                OBD_CONNECT_LOCK_AHEAD);
+       LASSERTF(OBD_CONNECT_LOCKAHEAD_OLD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT_LOCKAHEAD_OLD);
         LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_BULK_MBITS);
         LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
@@ -1329,6 +1329,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_FLAGS2);
         LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT2_FILE_SECCTX);
+       LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT2_LOCKAHEAD);
         LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
                 (unsigned)OBD_CKSUM_CRC32);
         LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
author	Patrick Farrell <paf@cray.com>
	Thu, 14 Sep 2017 15:24:50 +0000 (10:24 -0500)
committer	Oleg Drokin <oleg.drokin@intel.com>
	Thu, 21 Sep 2017 06:12:44 +0000 (06:12 +0000)
Documentation/ladvise_lockahead.txt	[new file with mode: 0644]	patch \| blob
lustre/contrib/wireshark/lustre_dlm_flags_wshark.c		patch \| blob \| history
lustre/doc/lfs-ladvise.1		patch \| blob \| history
lustre/include/cl_object.h		patch \| blob \| history
lustre/include/lustre_dlm.h		patch \| blob \| history
lustre/include/lustre_dlm_flags.h		patch \| blob \| history
lustre/include/lustre_export.h		patch \| blob \| history
lustre/include/lustre_osc.h		patch \| blob \| history
lustre/include/obd_support.h		patch \| blob \| history
lustre/include/uapi/linux/lustre/lustre_idl.h		patch \| blob \| history
lustre/include/uapi/linux/lustre/lustre_user.h		patch \| blob \| history
lustre/ldlm/ldlm_extent.c		patch \| blob \| history
lustre/ldlm/ldlm_lib.c		patch \| blob \| history
lustre/ldlm/ldlm_lock.c		patch \| blob \| history
lustre/llite/file.c		patch \| blob \| history
lustre/llite/glimpse.c		patch \| blob \| history
lustre/llite/llite_internal.h		patch \| blob \| history
lustre/llite/llite_lib.c		patch \| blob \| history
lustre/llite/vvp_io.c		patch \| blob \| history
lustre/lov/lov_io.c		patch \| blob \| history
lustre/obdclass/cl_lock.c		patch \| blob \| history
lustre/obdclass/lprocfs_status.c		patch \| blob \| history
lustre/ofd/ofd_dev.c		patch \| blob \| history
lustre/ofd/ofd_dlm.c		patch \| blob \| history
lustre/ofd/ofd_internal.h		patch \| blob \| history
lustre/osc/osc_internal.h		patch \| blob \| history
lustre/osc/osc_lock.c		patch \| blob \| history
lustre/osc/osc_request.c		patch \| blob \| history
lustre/ptlrpc/wiretest.c		patch \| blob \| history
lustre/tests/Makefile.am		patch \| blob \| history
lustre/tests/lockahead_test.c	[new file with mode: 0644]	patch \| blob
lustre/tests/sanity.sh		patch \| blob \| history
lustre/utils/lfs.c		patch \| blob \| history
lustre/utils/liblustreapi_ladvise.c		patch \| blob \| history
lustre/utils/wirecheck.c		patch \| blob \| history
lustre/utils/wiretest.c		patch \| blob \| history