LU-6179 llite: Implement ladvise lockahead

author Patrick Farrell <paf@cray.com>

Thu, 14 Sep 2017 15:24:50 +0000 (10:24 -0500)

committer Oleg Drokin <oleg.drokin@intel.com>

Thu, 21 Sep 2017 06:12:44 +0000 (06:12 +0000)
author Patrick Farrell <paf@cray.com>
Thu, 14 Sep 2017 15:24:50 +0000 (10:24 -0500)
committer Oleg Drokin <oleg.drokin@intel.com>
Thu, 21 Sep 2017 06:12:44 +0000 (06:12 +0000)
diff --git a/Documentation/ladvise_lockahead.txt b/Documentation/ladvise_lockahead.txt

new file mode 100644 (file)

index 0000000..91dffcd
--- /dev/null
+++ b/Documentation/ladvise_lockahead.txt
@@ -0,0 +1,304 @@
+Ladvise Lock Ahead design
+
+Lock ahead is a new Lustre feature aimed at solving a long standing problem
+with shared file write performance in Lustre.  It requires client and server
+support.  It will be used primarily via the MPI-I/O library, not directly from
+user applications.
+
+The first part of this document (sections 1 and 2) is an overview of the
+problem and high level description of the solution.  Section 3 explains how the
+library will make use of this feature, and sections 4 and 5 describe the design
+of the Lustre changes.
+
+1. Overview: Purpose & Interface
+Lock ahead is intended to allow optimization of certain I/O patterns which
+would otherwise suffer LDLM* lock contention.  It allows applications to
+manually request locks on specific extents of a file, avoiding the usual
+server side optimizations. This applications which know their I/O pattern to
+use that information to avoid false conflicts due to server side optimizations.
+
+*Lustre distributed lock manager.  This is the locking layer shared between
+clients and servers, to manage access between clients.
+
+Normally, clients get locks automatically as the first step of an I/O.
+The client asks for a lock which covers exactly the area of interest (ie, a
+read or write lock of n bytes at offset x), but the server attempts to optimize
+this by expanding the lock to cover as much of the file as possible.  This is
+useful for a single client, but can be trouble for multiple clients.
+
+In cases where multiple clients wish to write to the same file, this
+optimization can result in locks that conflict when the actual I/O operations
+do not.  This requires clients to wait for one another to complete I/O, even
+when there is no conflict between actual I/O requests.  This can significantly
+reduce performance (Anywhere from 40-90%, depending on system specs) for some
+workloads.
+
+The lockahead feature makes it possible to avoid this problem by acquiring the
+necessary locks in advance, by explicit requests with server side extent
+changes disabled.  We add a new lfs advice type, LU_LADVISE_LOCKAHEAD,
+which allows lock requests from userspace on the client, specifying the extent
+and the I/O mode (read/write) for the lock.  These lock requests explicitly
+disable server side changes to the lock extent, so the lock returned to the
+client covers only the extent requested.
+
+When using this feature, clients which intend to write to a file can request
+locks to cover their I/O pattern, wait a moment for the locks to be granted,
+then write or read the file.
+
+In this way, a set of clients which knows their I/O pattern in advance can
+force the LDLM layer to grant locks appropriate for that I/O pattern.  This
+allows applications which are poorly handled by the default lock optimization
+behavior to significantly improve their performance.
+
+2. I/O Pattern & Locking problems
+2. A. Strided writing and MPI-I/O
+There is a thorough explanation and overview of strided writing and the
+benefits of this functionality in the slides from the lock ahead presentation
+at LUG 2015.  It is highly recommended to read that first, as the graphics are
+much clearer than the prose here.
+
+See slides 1-13:
+http://wiki.lustre.org/images/f/f9/Shared-File-Performance-in-Lustre_Farrell.pdf
+
+MPI-I/O uses strided writing when doing I/O from a large job to a single file.
+I/O is aggregated from all the nodes running a particular application to a
+small number of I/O aggregator nodes which then write out the data, in a
+strided manner.
+
+In strided writing, different clients take turns writing different blocks of a
+file (A block is some arbitrary number of bytes).  Client 1 is responsible for
+writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1,
+block 3, etc.
+
+Without the ability to manually request locks, strided writing is set up in
+concert with Lustre file striping so each client writes to one OST.  (IE, for a
+file striped to three OSTs, we would write from three clients.)
+
+The particular case of interest is when we want to use more than one client
+per OST.  This is important, because an OST typically has much more bandwidth
+than one client.  Strided writes are non-overlapping, so they should be able to
+proceed in parallel with more than one client per OST.  In practice, on Lustre,
+they do not, due to lock expansion.
+
+2. B. Locking problems
+We will now describe locking when there is more than one client per OST.  This
+behavior is the same on a per OST basis in a file striped across multiple OSTs.
+When the first client asks to write block 0, it asks for the required lock from
+the server.  When it receives this request, the server sees that there are no
+other locks on the file.  Since it assumes the client will want to write to the
+file again, the server expands the lock as far as possible.  In this case, it
+expands the lock to the maximum file size (effectively, to infinity), then
+grants it to client 1.
+
+When client 2 wants to write block 1, it conflicts with the expanded lock
+granted to client 1.  The server then must revoke (In Lustre terms,
+'call back') the lock granted to client 1 so it can grant a lock to client 2.
+After the lock granted to client is revoked, there are no locks on the file.
+The server sees this when processing the lock request from client 2, and
+expands that lock to cover the whole file.
+
+Client 1 then wishes to write block 3 of the file...  And the cycle continues.
+The two clients exchange the extended lock throughout the write, allowing only
+one client to write at a time, plus latency to exchange the lock.  The effect is
+dramatic: Two clients are actually slower than one.  (Similar behavior is seen
+with more than two clients.)
+
+The solution is to use this new advice type to acquire locks before they are
+needed.  In effect, before it starts writing to the file, client 1 requests
+locks on block 0, block 2, etc. It locks 'ahead' a certain (tunable) number of
+locks. Client 2 does the same.  Then they both begin to write, and are able to
+do so in parallel.  A description of the actual library implementation follows.
+
+3. Library implementation
+Actually implementing this in the library carries a number of wrinkles.
+The basic pattern is this:
+Before writing, an I/O aggregator requests a certain number of locks on blocks
+that it is responsible for.  It may or may not ever write to these blocks, but
+it takes locks knowing it might.  It then begins to write, tracking how many of
+the locks it has used.  When the number of locks 'ahead' of the I/O is low
+enough, it requests more locks in advance of the I/O.
+
+For technical reasons which are explained in the implementation section, these
+lock requests are either asynchronous and non-blocking or synchronous and
+blocking.  In Lustre terms, non-blocking means if there is already a lock on
+the relevant extent of the file, the manual lock request is not granted.  This
+means that if there is already a lock on the file (quite common; imagine
+writing to a file which was previously read by another process), these lock
+requests will be denied.  However, once the first 'real' write arrives that
+was hoping to use a lockahead lock, that write will cause the blocking lock to
+be cancelled, so this interference is not fatal.
+
+It is of course possible for another process to get in the way by immediately
+asking for a lock on the file.  This is something users should try to avoid.
+When writing out a file, repeatedly trying to read it will impact performance
+even without this feature.
+
+These interfering locks can also happen if a manually requested lock is, for
+some reason, not available in time for the write which intended to use it.
+The lock which results from this write request is expanded using the
+normal rules.  So it's possible for that lock (depending on the position of
+other locks at the time) to be extended to cover the rest of the file.  That
+will block future lockahead locks.
+
+The expanded lock will be revoked when a write happens (from another client)
+in the range covered by that lock, but the lock for that write will be expanded
+as well - And then we return to handing the lock back and forth between
+clients.  These expanded locks will still block future lockahead locks,
+rendering them useless.
+
+The way to avoid this is to turn off lock expansion for I/Os which are
+supposed to be using these manually requested locks.  That way, if the
+manually requested lock is not available, the lock request for the I/O will not
+be expanded.  Instead, that request (which is blocking, unlike a lockahead
+request) will cancel any interfering locks, but the resulting lock will not be
+expanded.  This leaves the later parts of the file open, allowing future
+manual lock requests to succeed.  This means that if an interfering lock blocks
+some manual requests, those are lost, but the next set of manual requests can
+proceed as normal.
+
+In effect, the 'locking ahead of I/O' is interrupted, but then is able to
+re-assert itself. The feature used here is referred to as 'no expansion'
+locking (as only the extent required by the actual I/O operation is locked)
+and is turned on with another new ladvise advice, LU_LADVISE_NOEXPAND.  This
+feature is added as part of the lockahead patch.  The strided writing library
+will use this advice on the file descriptor it uses for writing.
+
+4. Client side design
+4. A. Ladvise lockahead
+Requestlock uses the existing asynchronous lock request functionality
+implemented for asynchronous glimpse locks (AGLs), a long standing Lustre
+feature.  AGLs are locks which are requested by statahead, which are used to
+get file size information before it's requested.  The key thing about an
+asynchronous lock request is that it does not have a specific I/O operation
+waiting for the lock.
+
+This means two key things:
+
+1. There is no OSC lock (lock layer above LDLM for data locking) associated
+with the LDLM lock
+2. There is no thread waiting for the LDLM lock, so lock grant processing
+must be handled by the ptlrpc daemon thread which received the reply
+
+Since both of these issues are addressed by the asynchronous lock request code
+which lockahead shares with AGL, we will not explore them in depth here.
+
+Finally, lockahead requests set the CEF_LOCK_NO_EXPAND flag, which tells the
+OSC (the per OST layer of the client) to set LDLM_FL_NO_EXPANSION on any lock
+requests.  LDLM_FL_NO_EXPANSION is a new LDLM lock flag which tells the server
+not to expand the lock extent.
+
+This leaves the user facing interface.  Requestlock is implemented as a new
+ladvise advice, and it uses the ladvise feature of multiple advices in one API
+call to put many lock requests in to an array of advices.
+
+The arguments required for this advice are a mode (read or write), range (start
+and end), and flags.
+
+The client will then make lock requests on these extents, one at a time.
+Because the lock requests are asynchronous (replies are handled by ptlrpcd),
+many requests can be made quickly by overlapping them, rather than waiting for
+each one to complete.  (This requires that they be non-blocking, as the
+ptlrpcd threads must not wait in the ldlm layer.)
+
+4. B. LU_LADVISE_LOCKNOEXPAND
+The lock no expand ladvise advice sets a boolean in a Lustre data structure
+associated with a file descriptor.  When an I/O is done to this file
+descriptor, the flag is picked up and passed through to the ldlm layer, where
+it sets LDLM_FL_NO_EXPANSION on lock requests made for that I/O.
+
+5. Server side changes
+Implementing lockahead requires server support for LDLM_FL_NO_EXPANSION, but
+it also required an additional pair of server side changes to fix issues which
+came up because of lockahead.  These changes are not part of the core design
+instead, they are separate fixes which are required for it to work.
+
+5. A. Support LDLM_FL_NO_EXPANSION
+
+Disabling server side lock expansion is done with a new LDLM flag.  This is
+done with a simple check for that flag on the server before attempting to
+expand the lock.  If the flag is found, lock expansion is skipped.
+
+5. B. Implement LDLM_FL_SPECULATIVE
+
+As described above, lock ahead locks are non-blocking. The BLOCK_NOWAIT LDLM
+flag is used now to implement some nonblocking behavior, but it only considers
+group locks blocking.  But, for asynchronous lock requests to work correctly,
+they cannot wait for any other locks.  For this purpose, we add
+LDLM_FL_SPECULATIVE.  This new flag is used for asynchronous lock requests,
+and implements the broader non-blocking behavior they require.
+
+5. C. File size & ofd_intent_policy changes
+
+Knowing the current file size during writes is tricky on a distributed file
+system, because multiple clients can be writing to a file at any time.  When
+writes are in progress, the server must identify which client is currently
+responsible for growing the file size, and ask that client what the file size
+is.
+
+To do this, the server uses glimpse locking (in ofd_intent_policy) to get the
+current file size from the clients.  This code uses the assumption that the
+holder of the highest write lock (PW lock) knows the current file size.  A
+client learns the (then current) file size when a lock is granted.  Because
+only the holder of the highest lock can grow a file, either the size hasn't
+changed, or that client knows the new size; so the server only has to contact
+the client which holds this lock, and it knows the current file size.
+
+Note that the above is actually racy. When the server asks, the client can
+still be writing, or another client could acquire a higher lock during this
+time.  The goal is a good approximation while the file is being written, and a
+correct answer once all the clients are done writing.  This is achieved because
+once writes to a file are complete, the holder of that highest lock is
+guaranteed to know the current file size.  This is where manually requested
+locks cause trouble.
+
+By creating write locks in advance of an actual I/O, lockahead breaks the
+assumption that the holder of the highest lock knows the file size.
+
+This assumption is normally true because locks which are created as part of
+IO - rather than in advance of it - are guaranteed to be 'active', IE,
+involved in IO, and the holder of the highest 'active' lock always knows the
+current file size, because the size is either not changing or the holder of
+that lock is responsible for updating it.
+
+Consider:  Two clients, A and B, strided writing.  Each client requests, for
+example, 2 manually requested locks.  (Real numbers are much higher.)  Client A
+holds locks on segments 0 and 2, client B holds locks on segments 1 and 3.
+
+The request comes to write 3 segments of data.  Client A writes to segment 0,
+client B writes to segment 1, and client A also writes to segment 2.  No data
+is written to segment 3.  At this point, the server checks the file size, by
+glimpsing the highest lock . The lock on segment 3.  Client B does not know
+about the writing done by client A to segment 2, so it gives an incorrect file
+size.
+
+This would be OK if client B had pending writes to segment 3, but it does not.
+In this situation, the server will never get the correct file size while this
+lock exists.
+
+The solution is relatively straightforward: The server needs to glimpse every
+client holding a write lock (starting from the top) until we find one holding
+an 'active' lock (because the size is known to be at least the size returned
+from an 'active' lock), and take the largest size returned. This avoids asking
+only a client which may not know the correct file size.
+
+Unfortunately, there is no way to know if a manually requested lock is active
+from the server side.  So when we see such a lock, we must send a glimpse to
+the holder (unless we have already sent a glimpse to that client*).  However,
+because locks without LDLM_FL_NO_EXPANSION set are guaranteed to be 'active',
+once we reach the first such lock, we can stop glimpsing.
+
+*This is because when we glimpse a specific lock, the client holding it returns
+its best idea of the size information, so we only need to send one glimpse to
+each client.
+
+This is less efficient than the standard "glimpse only the top lock"
+methodology, but since we only need to glimpse one lock per client (and the
+number of clients writing to the part of a file on a given OST is fairly
+limited), the cost is restrained.
+
+Additionally, lock cancellation methods such as early lock cancel aggressively
+clean up older locks, particularly when the LRU limit is exceeded, so the
+total lock count should also remain manageable.
+
+In the end, the final verdict here is performance. Requestlock testing for the
+strided I/O case has shown good performance results.
diff --git a/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c b/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c

index eb091fb..c94867e 100644 (file)
--- a/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c
+++ b/lustre/contrib/wireshark/lustre_dlm_flags_wshark.c
@@ -11,6 +11,7 @@ static int hf_lustre_ldlm_fl_lock_changed        = -1;
  static int hf_lustre_ldlm_fl_block_granted       = -1;
  static int hf_lustre_ldlm_fl_block_conv          = -1;
  static int hf_lustre_ldlm_fl_block_wait          = -1;
  static int hf_lustre_ldlm_fl_block_granted       = -1;
  static int hf_lustre_ldlm_fl_block_conv          = -1;
  static int hf_lustre_ldlm_fl_block_wait          = -1;
+static int hf_lustre_ldlm_fl_speculative         = -1;
  static int hf_lustre_ldlm_fl_ast_sent            = -1;
  static int hf_lustre_ldlm_fl_replay              = -1;
  static int hf_lustre_ldlm_fl_intent_only         = -1;
  static int hf_lustre_ldlm_fl_ast_sent            = -1;
  static int hf_lustre_ldlm_fl_replay              = -1;
  static int hf_lustre_ldlm_fl_intent_only         = -1;
@@ -22,6 +23,7 @@ static int hf_lustre_ldlm_fl_block_nowait        = -1;
  static int hf_lustre_ldlm_fl_test_lock           = -1;
  static int hf_lustre_ldlm_fl_cancel_on_block     = -1;
  static int hf_lustre_ldlm_fl_cos_incompat        = -1;
  static int hf_lustre_ldlm_fl_test_lock           = -1;
  static int hf_lustre_ldlm_fl_cancel_on_block     = -1;
  static int hf_lustre_ldlm_fl_cos_incompat        = -1;
+static int hf_lustre_ldlm_fl_no_expansion        = -1;
  static int hf_lustre_ldlm_fl_deny_on_contention  = -1;
  static int hf_lustre_ldlm_fl_ast_discard_data    = -1;
  
  static int hf_lustre_ldlm_fl_deny_on_contention  = -1;
  static int hf_lustre_ldlm_fl_ast_discard_data    = -1;
  
@@ -30,6 +32,7 @@ const value_string lustre_ldlm_flags_vals[] = {
    {LDLM_FL_BLOCK_GRANTED,       "LDLM_FL_BLOCK_GRANTED"},
    {LDLM_FL_BLOCK_CONV,          "LDLM_FL_BLOCK_CONV"},
    {LDLM_FL_BLOCK_WAIT,          "LDLM_FL_BLOCK_WAIT"},
    {LDLM_FL_BLOCK_GRANTED,       "LDLM_FL_BLOCK_GRANTED"},
    {LDLM_FL_BLOCK_CONV,          "LDLM_FL_BLOCK_CONV"},
    {LDLM_FL_BLOCK_WAIT,          "LDLM_FL_BLOCK_WAIT"},
+  {LDLM_FL_SPECULATIVE,         "LDLM_FL_SPECULATIVE"},
    {LDLM_FL_AST_SENT,            "LDLM_FL_AST_SENT"},
    {LDLM_FL_REPLAY,              "LDLM_FL_REPLAY"},
    {LDLM_FL_INTENT_ONLY,         "LDLM_FL_INTENT_ONLY"},
    {LDLM_FL_AST_SENT,            "LDLM_FL_AST_SENT"},
    {LDLM_FL_REPLAY,              "LDLM_FL_REPLAY"},
    {LDLM_FL_INTENT_ONLY,         "LDLM_FL_INTENT_ONLY"},
@@ -41,6 +44,7 @@ const value_string lustre_ldlm_flags_vals[] = {
    {LDLM_FL_TEST_LOCK,           "LDLM_FL_TEST_LOCK"},
    {LDLM_FL_CANCEL_ON_BLOCK,     "LDLM_FL_CANCEL_ON_BLOCK"},
    {LDLM_FL_COS_INCOMPAT,        "LDLM_FL_COS_INCOMPAT"},
    {LDLM_FL_TEST_LOCK,           "LDLM_FL_TEST_LOCK"},
    {LDLM_FL_CANCEL_ON_BLOCK,     "LDLM_FL_CANCEL_ON_BLOCK"},
    {LDLM_FL_COS_INCOMPAT,        "LDLM_FL_COS_INCOMPAT"},
+  {LDLM_FL_NO_EXPANSION,        "LDLM_FL_NO_EXPANSION"},
    {LDLM_FL_DENY_ON_CONTENTION,  "LDLM_FL_DENY_ON_CONTENTION"},
    {LDLM_FL_AST_DISCARD_DATA,    "LDLM_FL_AST_DISCARD_DATA"},
    { 0, NULL }
    {LDLM_FL_DENY_ON_CONTENTION,  "LDLM_FL_DENY_ON_CONTENTION"},
    {LDLM_FL_AST_DISCARD_DATA,    "LDLM_FL_AST_DISCARD_DATA"},
    { 0, NULL }
@@ -73,6 +77,7 @@ lustre_dissect_element_ldlm_lock_flags(
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_granted);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_conv);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_wait);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_granted);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_conv);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_block_wait);
+  dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_speculative);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_sent);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_replay);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_intent_only);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_sent);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_replay);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_intent_only);
@@ -84,6 +89,7 @@ lustre_dissect_element_ldlm_lock_flags(
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_test_lock);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cancel_on_block);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cos_incompat);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_test_lock);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cancel_on_block);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_cos_incompat);
+  dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_no_expansion);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_deny_on_contention);
    return
      dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_discard_data);
    dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_deny_on_contention);
    return
      dissect_uint32(tvb, offset, pinfo, tree, hf_lustre_ldlm_fl_ast_discard_data);
@@ -147,6 +153,21 @@ lustre_dissect_element_ldlm_lock_flags(
      }
    },
    {
      }
    },
    {
+    /* p_id    */ &hf_lustre_ldlm_fl_speculative,
+    /* hfinfo  */ {
+      /* name    */ "LDLM_FL_SPECULATIVE",
+      /* abbrev  */ "lustre.ldlm_fl_speculative",
+      /* type    */ FT_BOOLEAN,
+      /* display */ 32,
+      /* strings */ TFS(&lnet_flags_set_truth),
+      /* bitmask */ LDLM_FL_SPECULATIVE,
+      /* blurb   */ "Lock request is speculative/asynchronous, and cannot\n"
+       "wait for any reason.  Fail the lock request if any blocking locks\n"
+       "encountered."
+      /* id      */ HFILL
+    }
+  },
+  {
      /* p_id    */ &hf_lustre_ldlm_fl_ast_sent,
      /* hfinfo  */ {
        /* name    */ "LDLM_FL_AST_SENT",
      /* p_id    */ &hf_lustre_ldlm_fl_ast_sent,
      /* hfinfo  */ {
        /* name    */ "LDLM_FL_AST_SENT",
@@ -298,6 +319,21 @@ lustre_dissect_element_ldlm_lock_flags(
      }
    },
    {
      }
    },
    {
+    /* p_id    */ &hf_lustre_ldlm_fl_no_expansion,
+    /* hfinfo  */ {
+      /* name    */ "LDLM_FL_NO_EXPANSION",
+      /* abbrev  */ "lustre.ldlm_fl_NO_EXPANSION",
+      /* type    */ FT_BOOLEAN,
+      /* display */ 32,
+      /* strings */ TFS(&lnet_flags_set_truth),
+      /* bitmask */ LDLM_FL_NO_EXPANSION,
+      /* blurb   */ "Do not expand this lock.  Grant it only on the extent\n"
+       "requested. Used for manually requested locks from the client\n"
+       "(LU_LADVISE_LOCKAHEAD)."
+      /* id      */ HFILL
+    }
+  },
+  {
      /* p_id    */ &hf_lustre_ldlm_fl_deny_on_contention,
      /* hfinfo  */ {
        /* name    */ "LDLM_FL_DENY_ON_CONTENTION",
      /* p_id    */ &hf_lustre_ldlm_fl_deny_on_contention,
      /* hfinfo  */ {
        /* name    */ "LDLM_FL_DENY_ON_CONTENTION",
diff --git a/lustre/doc/lfs-ladvise.1 b/lustre/doc/lfs-ladvise.1

index b676480..c6a1f05 100644 (file)
--- a/lustre/doc/lfs-ladvise.1
+++ b/lustre/doc/lfs-ladvise.1
@@ -6,6 +6,7 @@ lfs ladvise \- give file access advices or hints to server.
  .B lfs ladvise [--advice|-a ADVICE ] [--background|-b]
          \fB[--start|-s START[kMGT]]
          \fB{[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
  .B lfs ladvise [--advice|-a ADVICE ] [--background|-b]
          \fB[--start|-s START[kMGT]]
          \fB{[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}
+        \fB{[--mode|-m MODE] | [--unset|-u]}
          \fB<FILE> ...\fR
  .br
  .SH DESCRIPTION
          \fB<FILE> ...\fR
  .br
  .SH DESCRIPTION
@@ -24,6 +25,9 @@ Give advice or hint of type \fIADVICE\fR. Advice types are:
  \fBwillread\fR to prefetch data into server cache
  .TP
  \fBdontneed\fR to cleanup data cache on server
  \fBwillread\fR to prefetch data into server cache
  .TP
  \fBdontneed\fR to cleanup data cache on server
+.TP
+\fBlockahead\fR to request a lock on a specified extent of a file
+\fBlocknoexpand\fR to disable server side lock expansion for a file
  .RE
  .TP
  \fB\-b\fR, \fB\-\-background
  .RE
  .TP
  \fB\-b\fR, \fB\-\-background
@@ -39,6 +43,13 @@ This option may not be specified at the same time as the -l option.
  \fB\-l\fR, \fB\-\-length\fR=\fILENGTH\fR
  File range has length of \fILENGTH\fR. This option may not be specified at the
  same time as the -e option.
  \fB\-l\fR, \fB\-\-length\fR=\fILENGTH\fR
  File range has length of \fILENGTH\fR. This option may not be specified at the
  same time as the -e option.
+.TP
+\fB\-m\fR, \fB\-\-mode\fR=\fIMODE\fR
+Specify the lock \fIMODE\fR. This option is only valid with lockahead
+advice.  Valid modes are: READ, WRITE
+.TP
+\fB\-u\fR, \fB\-\-unset\fR=\fIUNSET\fR
+Unset the previous advice.  Currently only valid with locknoexpand advice.
  .SH NOTE
  .PP
  Typically,
  .SH NOTE
  .PP
  Typically,
@@ -70,6 +81,14 @@ that the first 1GB of that file will be read soon.
  This gives the OST(s) holding the first 1GB of \fB/mnt/lustre/file1\fR a hint
  that the first 1GB of file will not be read in the near future, thus the OST(s)
  could clear the cache of that file in the memory.
  This gives the OST(s) holding the first 1GB of \fB/mnt/lustre/file1\fR a hint
  that the first 1GB of file will not be read in the near future, thus the OST(s)
  could clear the cache of that file in the memory.
+.B $ lfs ladvise -a lockahead -s 0 -e 1048576 -m READ /mnt/lustre/file1
+Request a read lock on the first 1 MiB of /mnt/lustre/file1.
+.B $ $ lfs ladvise -a lockahead -s 0 -e 4096 -m WRITE ./file1
+Request a write lock on the first 4KiB of /mnt/lustre/file1.
+.B $ $ lfs ladvise -a locknoexpand ./file1
+Set disable lock expansion on ./file1
+.B $ $ lfs ladvise -a locknoexpand -u ./file1
+Unset disable lock expansion on ./file1
  .SH AVAILABILITY
  The lfs ladvise command is part of the Lustre filesystem.
  .SH SEE ALSO
  .SH AVAILABILITY
  The lfs ladvise command is part of the Lustre filesystem.
  .SH SEE ALSO
diff --git a/lustre/include/cl_object.h b/lustre/include/cl_object.h

index 78d0926..00bd414 100644 (file)
--- a/lustre/include/cl_object.h
+++ b/lustre/include/cl_object.h
@@ -1607,25 +1607,30 @@ enum cl_enq_flags {
           * -EWOULDBLOCK is returned immediately.
           */
          CEF_NONBLOCK     = 0x00000001,
           * -EWOULDBLOCK is returned immediately.
           */
          CEF_NONBLOCK     = 0x00000001,
-        /**
-         * take lock asynchronously (out of order), as it cannot
-         * deadlock. This is for LDLM_FL_HAS_INTENT locks used for glimpsing.
-         */
-        CEF_ASYNC        = 0x00000002,
+       /**
+        * Tell lower layers this is a glimpse request, translated to
+        * LDLM_FL_HAS_INTENT at LDLM layer.
+        *
+        * Also, because glimpse locks never block other locks, we count this
+        * as automatically compatible with other osc locks.
+        * (see osc_lock_compatible)
+        */
+       CEF_GLIMPSE        = 0x00000002,
          /**
           * tell the server to instruct (though a flag in the blocking ast) an
           * owner of the conflicting lock, that it can drop dirty pages
           * protected by this lock, without sending them to the server.
           */
          CEF_DISCARD_DATA = 0x00000004,
          /**
           * tell the server to instruct (though a flag in the blocking ast) an
           * owner of the conflicting lock, that it can drop dirty pages
           * protected by this lock, without sending them to the server.
           */
          CEF_DISCARD_DATA = 0x00000004,
-        /**
-         * tell the sub layers that it must be a `real' lock. This is used for
-         * mmapped-buffer locks and glimpse locks that must be never converted
-         * into lockless mode.
-         *
-         * \see vvp_mmap_locks(), cl_glimpse_lock().
-         */
-        CEF_MUST         = 0x00000008,
+       /**
+        * tell the sub layers that it must be a `real' lock. This is used for
+        * mmapped-buffer locks, glimpse locks, manually requested locks
+        * (LU_LADVISE_LOCKAHEAD) that must never be converted into lockless
+        * mode.
+        *
+        * \see vvp_mmap_locks(), cl_glimpse_lock, cl_request_lock().
+        */
+       CEF_MUST         = 0x00000008,
          /**
           * tell the sub layers that never request a `real' lock. This flag is
           * not used currently.
          /**
           * tell the sub layers that never request a `real' lock. This flag is
           * not used currently.
@@ -1638,9 +1643,16 @@ enum cl_enq_flags {
           */
          CEF_NEVER        = 0x00000010,
          /**
           */
          CEF_NEVER        = 0x00000010,
          /**
-         * for async glimpse lock.
+        * tell the dlm layer this is a speculative lock request
+        * speculative lock requests are locks which are not requested as part
+        * of an I/O operation.  Instead, they are requested because we expect
+        * to use them in the future.  They are requested asynchronously at the
+        * ptlrpc layer.
+        *
+        * Currently used for asynchronous glimpse locks and manually requested
+        * locks (LU_LADVISE_LOCKAHEAD).
           */
           */
-        CEF_AGL          = 0x00000020,
+       CEF_SPECULATIVE          = 0x00000020,
         /**
          * enqueue a lock to test DLM lock existence.
          */
         /**
          * enqueue a lock to test DLM lock existence.
          */
@@ -1651,9 +1663,13 @@ enum cl_enq_flags {
          */
         CEF_LOCK_MATCH  = 0x00000080,
         /**
          */
         CEF_LOCK_MATCH  = 0x00000080,
         /**
+        * tell the DLM layer to lock only the requested range
+        */
+       CEF_LOCK_NO_EXPAND    = 0x00000100,
+       /**
          * mask of enq_flags.
          */
          * mask of enq_flags.
          */
-       CEF_MASK         = 0x000000ff,
+       CEF_MASK         = 0x000001ff,
  };
  
  /**
  };
  
  /**
@@ -1871,7 +1887,9 @@ struct cl_io {
          */
                              ci_noatime:1,
         /** Set to 1 if parallel execution is allowed for current I/O? */
          */
                              ci_noatime:1,
         /** Set to 1 if parallel execution is allowed for current I/O? */
-                            ci_pio:1;
+                            ci_pio:1,
+       /* Tell sublayers not to expand LDLM locks requested for this IO */
+                            ci_lock_no_expand:1;
         /**
          * Number of pages owned by this IO. For invariant checking.
          */
         /**
          * Number of pages owned by this IO. For invariant checking.
          */
diff --git a/lustre/include/lustre_dlm.h b/lustre/include/lustre_dlm.h

index 66e90c2..62382da 100644 (file)
--- a/lustre/include/lustre_dlm.h
+++ b/lustre/include/lustre_dlm.h
@@ -607,8 +607,8 @@ struct ldlm_cb_async_args {
         struct ldlm_lock        *ca_lock;
  };
  
         struct ldlm_lock        *ca_lock;
  };
  
-/** The ldlm_glimpse_work is allocated on the stack and should not be freed. */
-#define LDLM_GL_WORK_NOFREE 0x1
+/** The ldlm_glimpse_work was slab allocated & must be freed accordingly.*/
+#define LDLM_GL_WORK_SLAB_ALLOCATED 0x1
  
  /** Interval node data for each LDLM_EXTENT lock. */
  struct ldlm_interval {
  
  /** Interval node data for each LDLM_EXTENT lock. */
  struct ldlm_interval {
diff --git a/lustre/include/lustre_dlm_flags.h b/lustre/include/lustre_dlm_flags.h

index 179cb71..7912883 100644 (file)
--- a/lustre/include/lustre_dlm_flags.h
+++ b/lustre/include/lustre_dlm_flags.h
@@ -58,6 +58,15 @@
  #define ldlm_set_block_wait(_l)         LDLM_SET_FLAG((  _l), 1ULL <<  3)
  #define ldlm_clear_block_wait(_l)       LDLM_CLEAR_FLAG((_l), 1ULL <<  3)
  
  #define ldlm_set_block_wait(_l)         LDLM_SET_FLAG((  _l), 1ULL <<  3)
  #define ldlm_clear_block_wait(_l)       LDLM_CLEAR_FLAG((_l), 1ULL <<  3)
  
+/**
+ * Lock request is speculative/asynchronous, and cannot wait for any reason.
+ * Fail the lock request if any blocking locks are encountered.
+ * */
+#define LDLM_FL_SPECULATIVE            0x0000000000000010ULL /* bit   4 */
+#define ldlm_is_speculative(_l)                LDLM_TEST_FLAG((_l), 1ULL <<  4)
+#define ldlm_set_speculative(_l)       LDLM_SET_FLAG((_l), 1ULL <<  4)
+#define ldlm_clear_specualtive_(_l)    LDLM_CLEAR_FLAG((_l), 1ULL <<  4)
+
  /** blocking or cancel packet was queued for sending. */
  #define LDLM_FL_AST_SENT                0x0000000000000020ULL // bit   5
  #define ldlm_is_ast_sent(_l)            LDLM_TEST_FLAG(( _l), 1ULL <<  5)
  /** blocking or cancel packet was queued for sending. */
  #define LDLM_FL_AST_SENT                0x0000000000000020ULL // bit   5
  #define ldlm_is_ast_sent(_l)            LDLM_TEST_FLAG(( _l), 1ULL <<  5)
@@ -139,6 +148,25 @@
  #define ldlm_clear_cos_incompat(_l)    LDLM_CLEAR_FLAG((_l), 1ULL << 24)
  
  /**
  #define ldlm_clear_cos_incompat(_l)    LDLM_CLEAR_FLAG((_l), 1ULL << 24)
  
  /**
+ * Part of original lockahead implementation, OBD_CONNECT_LOCKAHEAD_OLD.
+ * Reserved temporarily to allow those implementations to keep working.
+ * Will be removed after 2.12 release.
+ * */
+#define LDLM_FL_LOCKAHEAD_OLD_RESERVED 0x0000000010000000ULL /* bit  28 */
+#define ldlm_is_do_not_expand_io(_l)    LDLM_TEST_FLAG((_l), 1ULL << 28)
+#define ldlm_set_do_not_expand_io(_l)   LDLM_SET_FLAG((_l), 1ULL << 28)
+#define ldlm_clear_do_not_expand_io(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 28)
+
+/**
+ * Do not expand this lock.  Grant it only on the extent requested.
+ * Used for manually requested locks from the client (LU_LADVISE_LOCKAHEAD).
+ * */
+#define LDLM_FL_NO_EXPANSION           0x0000000020000000ULL /* bit  29 */
+#define ldlm_is_do_not_expand(_l)      LDLM_TEST_FLAG((_l), 1ULL << 29)
+#define ldlm_set_do_not_expand(_l)     LDLM_SET_FLAG((_l), 1ULL << 29)
+#define ldlm_clear_do_not_expand(_l)   LDLM_CLEAR_FLAG((_l), 1ULL << 29)
+
+/**
   * measure lock contention and return -EUSERS if locking contention is high */
  #define LDLM_FL_DENY_ON_CONTENTION        0x0000000040000000ULL // bit  30
  #define ldlm_is_deny_on_contention(_l)    LDLM_TEST_FLAG(( _l), 1ULL << 30)
   * measure lock contention and return -EUSERS if locking contention is high */
  #define LDLM_FL_DENY_ON_CONTENTION        0x0000000040000000ULL // bit  30
  #define ldlm_is_deny_on_contention(_l)    LDLM_TEST_FLAG(( _l), 1ULL << 30)
@@ -375,13 +403,16 @@
  #define LDLM_FL_GONE_MASK              (LDLM_FL_DESTROYED              |\
                                          LDLM_FL_FAILED)
  
  #define LDLM_FL_GONE_MASK              (LDLM_FL_DESTROYED              |\
                                          LDLM_FL_FAILED)
  
-/** l_flags bits marked as "inherit" bits */
-/* Flags inherited from wire on enqueue/reply between client/server. */
-/* NO_TIMEOUT flag to force ldlm_lock_match() to wait with no timeout. */
-/* TEST_LOCK flag to not let TEST lock to be granted. */
+/** l_flags bits marked as "inherit" bits
+ * Flags inherited from wire on enqueue/reply between client/server.
+ * CANCEL_ON_BLOCK so server will not grant if a blocking lock is found
+ * NO_TIMEOUT flag to force ldlm_lock_match() to wait with no timeout.
+ * TEST_LOCK flag to not let TEST lock to be granted.
+ * NO_EXPANSION to tell server not to expand extent of lock request */
  #define LDLM_FL_INHERIT_MASK            (LDLM_FL_CANCEL_ON_BLOCK       |\
                                          LDLM_FL_NO_TIMEOUT             |\
  #define LDLM_FL_INHERIT_MASK            (LDLM_FL_CANCEL_ON_BLOCK       |\
                                          LDLM_FL_NO_TIMEOUT             |\
-                                        LDLM_FL_TEST_LOCK)
+                                        LDLM_FL_TEST_LOCK              |\
+                                        LDLM_FL_NO_EXPANSION)
  
  /** flags returned in @flags parameter on ldlm_lock_enqueue,
   * to be re-constructed on re-send */
  
  /** flags returned in @flags parameter on ldlm_lock_enqueue,
   * to be re-constructed on re-send */
diff --git a/lustre/include/lustre_export.h b/lustre/include/lustre_export.h

index 1c2e347..ac05be7 100644 (file)
--- a/lustre/include/lustre_export.h
+++ b/lustre/include/lustre_export.h
@@ -318,6 +318,16 @@ static inline __u64 exp_connect_flags(struct obd_export *exp)
         return *exp_connect_flags_ptr(exp);
  }
  
         return *exp_connect_flags_ptr(exp);
  }
  
+static inline __u64 *exp_connect_flags2_ptr(struct obd_export *exp)
+{
+       return &exp->exp_connect_data.ocd_connect_flags2;
+}
+
+static inline __u64 exp_connect_flags2(struct obd_export *exp)
+{
+       return *exp_connect_flags2_ptr(exp);
+}
+
  static inline int exp_max_brw_size(struct obd_export *exp)
  {
         LASSERT(exp != NULL);
  static inline int exp_max_brw_size(struct obd_export *exp)
  {
         LASSERT(exp != NULL);
@@ -420,6 +430,16 @@ static inline int exp_connect_large_acl(struct obd_export *exp)
         return !!(exp_connect_flags(exp) & OBD_CONNECT_LARGE_ACL);
  }
  
         return !!(exp_connect_flags(exp) & OBD_CONNECT_LARGE_ACL);
  }
  
+static inline int exp_connect_lockahead_old(struct obd_export *exp)
+{
+       return !!(exp_connect_flags(exp) & OBD_CONNECT_LOCKAHEAD_OLD);
+}
+
+static inline int exp_connect_lockahead(struct obd_export *exp)
+{
+       return !!(exp_connect_flags2(exp) & OBD_CONNECT2_LOCKAHEAD);
+}
+
  extern struct obd_export *class_conn2export(struct lustre_handle *conn);
  extern struct obd_device *class_conn2obd(struct lustre_handle *conn);
  
  extern struct obd_export *class_conn2export(struct lustre_handle *conn);
  extern struct obd_device *class_conn2obd(struct lustre_handle *conn);
  
diff --git a/lustre/include/lustre_osc.h b/lustre/include/lustre_osc.h

index f32f7a4..124300e 100644 (file)
--- a/lustre/include/lustre_osc.h
+++ b/lustre/include/lustre_osc.h
@@ -390,7 +390,16 @@ struct osc_lock {
         /**
          * For async glimpse lock.
          */
         /**
          * For async glimpse lock.
          */
-                               ols_agl:1;
+                               ols_agl:1,
+       /**
+        * for speculative locks - asynchronous glimpse locks and ladvise
+        * lockahead manual lock requests
+        *
+        * Used to tell osc layer to not wait for the ldlm reply from the
+        * server, so the osc lock will be short lived - It only exists to
+        * create the ldlm request and is not updated on request completion.
+        */
+                               ols_speculative:1;
  };
  
  
  };
  
  
diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h

index 54843ee..e82f36d 100644 (file)
--- a/lustre/include/obd_support.h
+++ b/lustre/include/obd_support.h
@@ -328,6 +328,7 @@ extern char obd_jobid_var[];
  #define OBD_FAIL_OST_LADVISE_PAUSE      0x237
  #define OBD_FAIL_OST_FAKE_RW            0x238
  #define OBD_FAIL_OST_LIST_ASSERT         0x239
  #define OBD_FAIL_OST_LADVISE_PAUSE      0x237
  #define OBD_FAIL_OST_FAKE_RW            0x238
  #define OBD_FAIL_OST_LIST_ASSERT         0x239
+#define OBD_FAIL_OST_GL_WORK_ALLOC      0x240
  
  #define OBD_FAIL_LDLM                    0x300
  #define OBD_FAIL_LDLM_NAMESPACE_NEW      0x301
  
  #define OBD_FAIL_LDLM                    0x300
  #define OBD_FAIL_LDLM_NAMESPACE_NEW      0x301
diff --git a/lustre/include/uapi/linux/lustre/lustre_idl.h b/lustre/include/uapi/linux/lustre/lustre_idl.h

index 530e058..597bc36 100644 (file)
--- a/lustre/include/uapi/linux/lustre/lustre_idl.h
+++ b/lustre/include/uapi/linux/lustre/lustre_idl.h
@@ -794,13 +794,15 @@ struct ptlrpc_body_v2 {
                                                          RPCs in parallel */
  #define OBD_CONNECT_DIR_STRIPE  0x400000000000000ULL /* striped DNE dir */
  #define OBD_CONNECT_SUBTREE    0x800000000000000ULL /* fileset mount */
                                                          RPCs in parallel */
  #define OBD_CONNECT_DIR_STRIPE  0x400000000000000ULL /* striped DNE dir */
  #define OBD_CONNECT_SUBTREE    0x800000000000000ULL /* fileset mount */
-#define OBD_CONNECT_LOCK_AHEAD  0x1000000000000000ULL /* lock ahead */
+#define OBD_CONNECT_LOCKAHEAD_OLD 0x1000000000000000ULL /* Old Cray lockahead */
+
  /** bulk matchbits is sent within ptlrpc_body */
  #define OBD_CONNECT_BULK_MBITS  0x2000000000000000ULL
  #define OBD_CONNECT_OBDOPACK    0x4000000000000000ULL /* compact OUT obdo */
  #define OBD_CONNECT_FLAGS2      0x8000000000000000ULL /* second flags word */
  /* ocd_connect_flags2 flags */
  #define OBD_CONNECT2_FILE_SECCTX       0x1ULL /* set file security context at create */
  /** bulk matchbits is sent within ptlrpc_body */
  #define OBD_CONNECT_BULK_MBITS  0x2000000000000000ULL
  #define OBD_CONNECT_OBDOPACK    0x4000000000000000ULL /* compact OUT obdo */
  #define OBD_CONNECT_FLAGS2      0x8000000000000000ULL /* second flags word */
  /* ocd_connect_flags2 flags */
  #define OBD_CONNECT2_FILE_SECCTX       0x1ULL /* set file security context at create */
+#define OBD_CONNECT2_LOCKAHEAD 0x2ULL /* ladvise lockahead v2 */
  
  /* XXX README XXX:
   * Please DO NOT add flag values here before first ensuring that this same
  
  /* XXX README XXX:
   * Please DO NOT add flag values here before first ensuring that this same
@@ -867,8 +869,9 @@ struct ptlrpc_body_v2 {
                                 OBD_CONNECT_LAYOUTLOCK | OBD_CONNECT_FID | \
                                 OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK | \
                                 OBD_CONNECT_BULK_MBITS | \
                                 OBD_CONNECT_LAYOUTLOCK | OBD_CONNECT_FID | \
                                 OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK | \
                                 OBD_CONNECT_BULK_MBITS | \
-                               OBD_CONNECT_GRANT_PARAM)
-#define OST_CONNECT_SUPPORTED2 0
+                               OBD_CONNECT_GRANT_PARAM | OBD_CONNECT_FLAGS2)
+
+#define OST_CONNECT_SUPPORTED2 OBD_CONNECT2_LOCKAHEAD
  
  #define ECHO_CONNECT_SUPPORTED 0
  #define ECHO_CONNECT_SUPPORTED2 0
  
  #define ECHO_CONNECT_SUPPORTED 0
  #define ECHO_CONNECT_SUPPORTED2 0
@@ -2291,6 +2294,12 @@ struct ldlm_extent {
          __u64 gid;
  };
  
          __u64 gid;
  };
  
+static inline bool ldlm_extent_equal(const struct ldlm_extent *ex1,
+                                   const struct ldlm_extent *ex2)
+{
+       return ex1->start == ex2->start && ex1->end == ex2->end;
+}
+
  struct ldlm_inodebits {
          __u64 bits;
         __u64 try_bits; /* optional bits to try */
  struct ldlm_inodebits {
          __u64 bits;
         __u64 try_bits; /* optional bits to try */
diff --git a/lustre/include/uapi/linux/lustre/lustre_user.h b/lustre/include/uapi/linux/lustre/lustre_user.h

index 6ae38c4..bb18450 100644 (file)
--- a/lustre/include/uapi/linux/lustre/lustre_user.h
+++ b/lustre/include/uapi/linux/lustre/lustre_user.h
@@ -1551,11 +1551,16 @@ enum lu_ladvise_type {
         LU_LADVISE_INVALID      = 0,
         LU_LADVISE_WILLREAD     = 1,
         LU_LADVISE_DONTNEED     = 2,
         LU_LADVISE_INVALID      = 0,
         LU_LADVISE_WILLREAD     = 1,
         LU_LADVISE_DONTNEED     = 2,
+       LU_LADVISE_LOCKNOEXPAND = 3,
+       LU_LADVISE_LOCKAHEAD    = 4,
+       LU_LADVISE_MAX
  };
  
  #define LU_LADVISE_NAMES {                                             \
  };
  
  #define LU_LADVISE_NAMES {                                             \
-       [LU_LADVISE_WILLREAD]   = "willread",                           \
-       [LU_LADVISE_DONTNEED]   = "dontneed",                           \
+       [LU_LADVISE_WILLREAD]           = "willread",                   \
+       [LU_LADVISE_DONTNEED]           = "dontneed",                   \
+       [LU_LADVISE_LOCKNOEXPAND]       = "locknoexpand",               \
+       [LU_LADVISE_LOCKAHEAD]          = "lockahead",                  \
  }
  
  /* This is the userspace argument for ladvise.  It is currently the same as
  }
  
  /* This is the userspace argument for ladvise.  It is currently the same as
@@ -1573,10 +1578,20 @@ struct llapi_lu_ladvise {
  
  enum ladvise_flag {
         LF_ASYNC        = 0x00000001,
  
  enum ladvise_flag {
         LF_ASYNC        = 0x00000001,
+       LF_UNSET        = 0x00000002,
  };
  
  #define LADVISE_MAGIC 0x1ADF1CE0
  };
  
  #define LADVISE_MAGIC 0x1ADF1CE0
-#define LF_MASK LF_ASYNC
+/* Masks of valid flags for each advice */
+#define LF_LOCKNOEXPAND_MASK LF_UNSET
+/* Flags valid for all advices not explicitly specified */
+#define LF_DEFAULT_MASK LF_ASYNC
+/* All flags */
+#define LF_MASK (LF_ASYNC | LF_UNSET)
+
+#define lla_lockahead_mode   lla_value1
+#define lla_peradvice_flags    lla_value2
+#define lla_lockahead_result lla_value3
  
  /* This is the userspace argument for ladvise, corresponds to ladvise_hdr which
   * is used on the wire.  It is defined separately as we may need info which is
  
  /* This is the userspace argument for ladvise, corresponds to ladvise_hdr which
   * is used on the wire.  It is defined separately as we may need info which is
@@ -1619,5 +1634,23 @@ struct sk_hmac_type {
         size_t   sht_bytes;
  };
  
         size_t   sht_bytes;
  };
  
+enum lock_mode_user {
+       MODE_READ_USER = 1,
+       MODE_WRITE_USER,
+       MODE_MAX_USER,
+};
+
+#define LOCK_MODE_NAMES { \
+       [MODE_READ_USER]  = "READ",\
+       [MODE_WRITE_USER] = "WRITE"\
+}
+
+enum lockahead_results {
+       LLA_RESULT_SENT = 0,
+       LLA_RESULT_DIFFERENT,
+       LLA_RESULT_SAME,
+};
+
  /** @} lustreuser */
  /** @} lustreuser */
+
  #endif /* _LUSTRE_USER_H */
  #endif /* _LUSTRE_USER_H */
diff --git a/lustre/ldlm/ldlm_extent.c b/lustre/ldlm/ldlm_extent.c

index a950b0b..5001b66 100644 (file)
--- a/lustre/ldlm/ldlm_extent.c
+++ b/lustre/ldlm/ldlm_extent.c
@@ -269,32 +269,43 @@ ldlm_extent_internal_policy_waiting(struct ldlm_lock *req,
  static void ldlm_extent_policy(struct ldlm_resource *res,
                                struct ldlm_lock *lock, __u64 *flags)
  {
  static void ldlm_extent_policy(struct ldlm_resource *res,
                                struct ldlm_lock *lock, __u64 *flags)
  {
-        struct ldlm_extent new_ex = { .start = 0, .end = OBD_OBJECT_EOF };
-
-        if (lock->l_export == NULL)
-                /*
-                 * this is local lock taken by server (e.g., as a part of
-                 * OST-side locking, or unlink handling). Expansion doesn't
-                 * make a lot of sense for local locks, because they are
-                 * dropped immediately on operation completion and would only
-                 * conflict with other threads.
-                 */
-                return;
+       struct ldlm_extent new_ex = { .start = 0, .end = OBD_OBJECT_EOF };
+
+       if (lock->l_export == NULL)
+               /*
+                * this is a local lock taken by server (e.g., as a part of
+                * OST-side locking, or unlink handling). Expansion doesn't
+                * make a lot of sense for local locks, because they are
+                * dropped immediately on operation completion and would only
+                * conflict with other threads.
+                */
+               return;
  
  
-        if (lock->l_policy_data.l_extent.start == 0 &&
-            lock->l_policy_data.l_extent.end == OBD_OBJECT_EOF)
-                /* fast-path whole file locks */
-                return;
+       if (lock->l_policy_data.l_extent.start == 0 &&
+           lock->l_policy_data.l_extent.end == OBD_OBJECT_EOF)
+               /* fast-path whole file locks */
+               return;
  
  
-        ldlm_extent_internal_policy_granted(lock, &new_ex);
-        ldlm_extent_internal_policy_waiting(lock, &new_ex);
+       /* Because reprocess_queue zeroes flags and uses it to return
+        * LDLM_FL_LOCK_CHANGED, we must check for the NO_EXPANSION flag
+        * in the lock flags rather than the 'flags' argument */
+       if (likely(!(lock->l_flags & LDLM_FL_NO_EXPANSION))) {
+               ldlm_extent_internal_policy_granted(lock, &new_ex);
+               ldlm_extent_internal_policy_waiting(lock, &new_ex);
+       } else {
+               LDLM_DEBUG(lock, "Not expanding manually requested lock.\n");
+               new_ex.start = lock->l_policy_data.l_extent.start;
+               new_ex.end = lock->l_policy_data.l_extent.end;
+               /* In case the request is not on correct boundaries, we call
+                * fixup. (normally called in ldlm_extent_internal_policy_*) */
+               ldlm_extent_internal_policy_fixup(lock, &new_ex, 0);
+       }
  
  
-        if (new_ex.start != lock->l_policy_data.l_extent.start ||
-            new_ex.end != lock->l_policy_data.l_extent.end) {
-                *flags |= LDLM_FL_LOCK_CHANGED;
-                lock->l_policy_data.l_extent.start = new_ex.start;
-                lock->l_policy_data.l_extent.end = new_ex.end;
-        }
+       if (!ldlm_extent_equal(&new_ex, &lock->l_policy_data.l_extent)) {
+               *flags |= LDLM_FL_LOCK_CHANGED;
+               lock->l_policy_data.l_extent.start = new_ex.start;
+               lock->l_policy_data.l_extent.end = new_ex.end;
+       }
  }
  
  static int ldlm_check_contention(struct ldlm_lock *lock, int contended_locks)
  }
  
  static int ldlm_check_contention(struct ldlm_lock *lock, int contended_locks)
@@ -421,7 +432,8 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                          }
  
                          if (tree->lit_mode == LCK_GROUP) {
                          }
  
                          if (tree->lit_mode == LCK_GROUP) {
-                                if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+                               if (*flags & (LDLM_FL_BLOCK_NOWAIT |
+                                             LDLM_FL_SPECULATIVE)) {
                                          compat = -EWOULDBLOCK;
                                          goto destroylock;
                                  }
                                          compat = -EWOULDBLOCK;
                                          goto destroylock;
                                  }
@@ -438,10 +450,24 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                                  continue;
                          }
  
                                  continue;
                          }
  
-                        if (!work_list) {
-                                rc = interval_is_overlapped(tree->lit_root,&ex);
-                                if (rc)
-                                        RETURN(0);
+                       /* We've found a potentially blocking lock, check
+                        * compatibility.  This handles locks other than GROUP
+                        * locks, which are handled separately above.
+                        *
+                        * Locks with FL_SPECULATIVE are asynchronous requests
+                        * which must never wait behind another lock, so they
+                        * fail if any conflicting lock is found. */
+                       if (!work_list || (*flags & LDLM_FL_SPECULATIVE)) {
+                               rc = interval_is_overlapped(tree->lit_root,
+                                                           &ex);
+                               if (rc) {
+                                       if (!work_list) {
+                                               RETURN(0);
+                                       } else {
+                                               compat = -EWOULDBLOCK;
+                                               goto destroylock;
+                                       }
+                               }
                          } else {
                                  interval_search(tree->lit_root, &ex,
                                                  ldlm_extent_compat_cb, &data);
                          } else {
                                  interval_search(tree->lit_root, &ex,
                                                  ldlm_extent_compat_cb, &data);
@@ -537,7 +563,8 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                                           * already blocked.
                                           * If we are in nonblocking mode - return
                                           * immediately */
                                           * already blocked.
                                           * If we are in nonblocking mode - return
                                           * immediately */
-                                        if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+                                       if (*flags & (LDLM_FL_BLOCK_NOWAIT
+                                                     | LDLM_FL_SPECULATIVE)) {
                                                  compat = -EWOULDBLOCK;
                                                  goto destroylock;
                                          }
                                                  compat = -EWOULDBLOCK;
                                                  goto destroylock;
                                          }
@@ -580,10 +607,11 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                          }
  
                          if (unlikely(lock->l_req_mode == LCK_GROUP)) {
                          }
  
                          if (unlikely(lock->l_req_mode == LCK_GROUP)) {
-                                /* If compared lock is GROUP, then requested is PR/PW/
-                                 * so this is not compatible; extent range does not
-                                 * matter */
-                                if (*flags & LDLM_FL_BLOCK_NOWAIT) {
+                               /* If compared lock is GROUP, then requested is
+                                * PR/PW so this is not compatible; extent
+                                * range does not matter */
+                               if (*flags & (LDLM_FL_BLOCK_NOWAIT
+                                             | LDLM_FL_SPECULATIVE)) {
                                          compat = -EWOULDBLOCK;
                                          goto destroylock;
                                  } else {
                                          compat = -EWOULDBLOCK;
                                          goto destroylock;
                                  } else {
@@ -602,6 +630,11 @@ ldlm_extent_compat_queue(struct list_head *queue, struct ldlm_lock *req,
                          if (!work_list)
                                  RETURN(0);
  
                          if (!work_list)
                                  RETURN(0);
  
+                       if (*flags & LDLM_FL_SPECULATIVE) {
+                               compat = -EWOULDBLOCK;
+                               goto destroylock;
+                       }
+
                          /* don't count conflicting glimpse locks */
                          if (lock->l_req_mode == LCK_PR &&
                              lock->l_policy_data.l_extent.start == 0 &&
                          /* don't count conflicting glimpse locks */
                          if (lock->l_req_mode == LCK_PR &&
                              lock->l_policy_data.l_extent.start == 0 &&
@@ -764,11 +797,11 @@ int ldlm_process_extent_lock(struct ldlm_lock *lock, __u64 *flags,
         *err = ELDLM_OK;
  
         if (intention == LDLM_PROCESS_RESCAN) {
         *err = ELDLM_OK;
  
         if (intention == LDLM_PROCESS_RESCAN) {
-                /* Careful observers will note that we don't handle -EWOULDBLOCK
-                 * here, but it's ok for a non-obvious reason -- compat_queue
-                 * can only return -EWOULDBLOCK if (flags & BLOCK_NOWAIT).
-                 * flags should always be zero here, and if that ever stops
-                 * being true, we want to find out. */
+               /* Careful observers will note that we don't handle -EWOULDBLOCK
+                * here, but it's ok for a non-obvious reason -- compat_queue
+                * can only return -EWOULDBLOCK if (flags & BLOCK_NOWAIT |
+                * SPECULATIVE). flags should always be zero here, and if that
+                * ever stops being true, we want to find out. */
                  LASSERT(*flags == 0);
                  rc = ldlm_extent_compat_queue(&res->lr_granted, lock, flags,
                                                err, NULL, &contended_locks);
                  LASSERT(*flags == 0);
                  rc = ldlm_extent_compat_queue(&res->lr_granted, lock, flags,
                                                err, NULL, &contended_locks);
diff --git a/lustre/ldlm/ldlm_lib.c b/lustre/ldlm/ldlm_lib.c

index 66a88d4..ea5af07 100644 (file)
--- a/lustre/ldlm/ldlm_lib.c
+++ b/lustre/ldlm/ldlm_lib.c
@@ -599,6 +599,7 @@ int client_connect_import(const struct lu_env *env,
                          ocd->ocd_connect_flags, "old %#llx, new %#llx\n",
                          data->ocd_connect_flags, ocd->ocd_connect_flags);
                 data->ocd_connect_flags = ocd->ocd_connect_flags;
                          ocd->ocd_connect_flags, "old %#llx, new %#llx\n",
                          data->ocd_connect_flags, ocd->ocd_connect_flags);
                 data->ocd_connect_flags = ocd->ocd_connect_flags;
+               data->ocd_connect_flags2 = ocd->ocd_connect_flags2;
         }
  
         ptlrpc_pinger_add_import(imp);
         }
  
         ptlrpc_pinger_add_import(imp);
diff --git a/lustre/ldlm/ldlm_lock.c b/lustre/ldlm/ldlm_lock.c

index b712e22..a480054 100644 (file)
--- a/lustre/ldlm/ldlm_lock.c
+++ b/lustre/ldlm/ldlm_lock.c
@@ -44,6 +44,9 @@
  
  #include "ldlm_internal.h"
  
  
  #include "ldlm_internal.h"
  
+struct kmem_cache *ldlm_glimpse_work_kmem;
+EXPORT_SYMBOL(ldlm_glimpse_work_kmem);
+
  /* lock types */
  char *ldlm_lockname[] = {
         [0] = "--",
  /* lock types */
  char *ldlm_lockname[] = {
         [0] = "--",
@@ -2138,8 +2141,9 @@ int ldlm_work_gl_ast_lock(struct ptlrpc_request_set *rqset, void *opaq)
                 rc = 1;
  
         LDLM_LOCK_RELEASE(lock);
                 rc = 1;
  
         LDLM_LOCK_RELEASE(lock);
-
-       if ((gl_work->gl_flags & LDLM_GL_WORK_NOFREE) == 0)
+       if (gl_work->gl_flags & LDLM_GL_WORK_SLAB_ALLOCATED)
+               OBD_SLAB_FREE_PTR(gl_work, ldlm_glimpse_work_kmem);
+       else
                 OBD_FREE_PTR(gl_work);
  
         RETURN(rc);
                 OBD_FREE_PTR(gl_work);
  
         RETURN(rc);
diff --git a/lustre/llite/file.c b/lustre/llite/file.c

index 2e976c5..36e3a67 100644 (file)
--- a/lustre/llite/file.c
+++ b/lustre/llite/file.c
@@ -1083,12 +1083,15 @@ static int ll_file_io_ptask(struct cfs_ptask *ptask);
  static void ll_io_init(struct cl_io *io, struct file *file, enum cl_io_type iot)
  {
         struct inode *inode = file_inode(file);
  static void ll_io_init(struct cl_io *io, struct file *file, enum cl_io_type iot)
  {
         struct inode *inode = file_inode(file);
+       struct ll_file_data *fd  = LUSTRE_FPRIVATE(file);
  
         memset(&io->u.ci_rw.rw_iter, 0, sizeof(io->u.ci_rw.rw_iter));
         init_sync_kiocb(&io->u.ci_rw.rw_iocb, file);
         io->u.ci_rw.rw_file = file;
         io->u.ci_rw.rw_ptask = ll_file_io_ptask;
         io->u.ci_rw.rw_nonblock = !!(file->f_flags & O_NONBLOCK);
  
         memset(&io->u.ci_rw.rw_iter, 0, sizeof(io->u.ci_rw.rw_iter));
         init_sync_kiocb(&io->u.ci_rw.rw_iocb, file);
         io->u.ci_rw.rw_file = file;
         io->u.ci_rw.rw_ptask = ll_file_io_ptask;
         io->u.ci_rw.rw_nonblock = !!(file->f_flags & O_NONBLOCK);
+       io->ci_lock_no_expand = fd->ll_lock_no_expand;
+
         if (iot == CIT_WRITE) {
                 io->u.ci_rw.rw_append = !!(file->f_flags & O_APPEND);
                 io->u.ci_rw.rw_sync   = !!(file->f_flags & O_SYNC ||
         if (iot == CIT_WRITE) {
                 io->u.ci_rw.rw_append = !!(file->f_flags & O_APPEND);
                 io->u.ci_rw.rw_sync   = !!(file->f_flags & O_SYNC ||
@@ -2435,6 +2438,189 @@ static int ll_file_futimes_3(struct file *file, const struct ll_futimes_3 *lfu)
         RETURN(rc);
  }
  
         RETURN(rc);
  }
  
+static enum cl_lock_mode cl_mode_user_to_kernel(enum lock_mode_user mode)
+{
+       switch (mode) {
+       case MODE_READ_USER:
+               return CLM_READ;
+       case MODE_WRITE_USER:
+               return CLM_WRITE;
+       default:
+               return -EINVAL;
+       }
+}
+
+static const char *const user_lockname[] = LOCK_MODE_NAMES;
+
+/* Used to allow the upper layers of the client to request an LDLM lock
+ * without doing an actual read or write.
+ *
+ * Used for ladvise lockahead to manually request specific locks.
+ *
+ * \param[in] file     file this ladvise lock request is on
+ * \param[in] ladvise  ladvise struct describing this lock request
+ *
+ * \retval 0           success, no detailed result available (sync requests
+ *                     and requests sent to the server [not handled locally]
+ *                     cannot return detailed results)
+ * \retval LLA_RESULT_{SAME,DIFFERENT} - detailed result of the lock request,
+ *                                      see definitions for details.
+ * \retval negative    negative errno on error
+ */
+int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise)
+{
+       struct lu_env *env = NULL;
+       struct cl_io *io  = NULL;
+       struct cl_lock *lock = NULL;
+       struct cl_lock_descr *descr = NULL;
+       struct dentry *dentry = file->f_path.dentry;
+       struct inode *inode = dentry->d_inode;
+       enum cl_lock_mode cl_mode;
+       off_t start = ladvise->lla_start;
+       off_t end = ladvise->lla_end;
+       int result;
+       __u16 refcheck;
+
+       ENTRY;
+
+       CDEBUG(D_VFSTRACE, "Lock request: file=%.*s, inode=%p, mode=%s "
+              "start=%llu, end=%llu\n", dentry->d_name.len,
+              dentry->d_name.name, dentry->d_inode,
+              user_lockname[ladvise->lla_lockahead_mode], (__u64) start,
+              (__u64) end);
+
+       cl_mode = cl_mode_user_to_kernel(ladvise->lla_lockahead_mode);
+       if (cl_mode < 0)
+               GOTO(out, result = cl_mode);
+
+       /* Get IO environment */
+       result = cl_io_get(inode, &env, &io, &refcheck);
+       if (result <= 0)
+               GOTO(out, result);
+
+       result = cl_io_init(env, io, CIT_MISC, io->ci_obj);
+       if (result > 0) {
+               /*
+                * nothing to do for this io. This currently happens when
+                * stripe sub-object's are not yet created.
+                */
+               result = io->ci_result;
+       } else if (result == 0) {
+               lock = vvp_env_lock(env);
+               descr = &lock->cll_descr;
+
+               descr->cld_obj   = io->ci_obj;
+               /* Convert byte offsets to pages */
+               descr->cld_start = cl_index(io->ci_obj, start);
+               descr->cld_end   = cl_index(io->ci_obj, end);
+               descr->cld_mode  = cl_mode;
+               /* CEF_MUST is used because we do not want to convert a
+                * lockahead request to a lockless lock */
+               descr->cld_enq_flags = CEF_MUST | CEF_LOCK_NO_EXPAND |
+                                      CEF_NONBLOCK;
+
+               if (ladvise->lla_peradvice_flags & LF_ASYNC)
+                       descr->cld_enq_flags |= CEF_SPECULATIVE;
+
+               result = cl_lock_request(env, io, lock);
+
+               /* On success, we need to release the lock */
+               if (result >= 0)
+                       cl_lock_release(env, lock);
+       }
+       cl_io_fini(env, io);
+       cl_env_put(env, &refcheck);
+
+       /* -ECANCELED indicates a matching lock with a different extent
+        * was already present, and -EEXIST indicates a matching lock
+        * on exactly the same extent was already present.
+        * We convert them to positive values for userspace to make
+        * recognizing true errors easier.
+        * Note we can only return these detailed results on async requests,
+        * as sync requests look the same as i/o requests for locking. */
+       if (result == -ECANCELED)
+               result = LLA_RESULT_DIFFERENT;
+       else if (result == -EEXIST)
+               result = LLA_RESULT_SAME;
+
+out:
+       RETURN(result);
+}
+static const char *const ladvise_names[] = LU_LADVISE_NAMES;
+
+static int ll_ladvise_sanity(struct inode *inode,
+                            struct llapi_lu_ladvise *ladvise)
+{
+       enum lu_ladvise_type advice = ladvise->lla_advice;
+       /* Note the peradvice flags is a 32 bit field, so per advice flags must
+        * be in the first 32 bits of enum ladvise_flags */
+       __u32 flags = ladvise->lla_peradvice_flags;
+       /* 3 lines at 80 characters per line, should be plenty */
+       int rc = 0;
+
+       if (advice > LU_LADVISE_MAX || advice == LU_LADVISE_INVALID) {
+               rc = -EINVAL;
+               CDEBUG(D_VFSTRACE, "%s: advice with value '%d' not recognized,"
+                      "last supported advice is %s (value '%d'): rc = %d\n",
+                      ll_get_fsname(inode->i_sb, NULL, 0), advice,
+                      ladvise_names[LU_LADVISE_MAX-1], LU_LADVISE_MAX-1, rc);
+               GOTO(out, rc);
+       }
+
+       /* Per-advice checks */
+       switch (advice) {
+       case LU_LADVISE_LOCKNOEXPAND:
+               if (flags & ~LF_LOCKNOEXPAND_MASK) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid flags (%x) for %s: "
+                              "rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0), flags,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+               break;
+       case LU_LADVISE_LOCKAHEAD:
+               /* Currently only READ and WRITE modes can be requested */
+               if (ladvise->lla_lockahead_mode >= MODE_MAX_USER ||
+                   ladvise->lla_lockahead_mode == 0) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid mode (%d) for %s: "
+                              "rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0),
+                              ladvise->lla_lockahead_mode,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+       case LU_LADVISE_WILLREAD:
+       case LU_LADVISE_DONTNEED:
+       default:
+               /* Note fall through above - These checks apply to all advices
+                * except LOCKNOEXPAND */
+               if (flags & ~LF_DEFAULT_MASK) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid flags (%x) for %s: "
+                              "rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0), flags,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+               if (ladvise->lla_start >= ladvise->lla_end) {
+                       rc = -EINVAL;
+                       CDEBUG(D_VFSTRACE, "%s: Invalid range (%llu to %llu) "
+                              "for %s: rc = %d\n",
+                              ll_get_fsname(inode->i_sb, NULL, 0),
+                              ladvise->lla_start, ladvise->lla_end,
+                              ladvise_names[advice], rc);
+                       GOTO(out, rc);
+               }
+               break;
+       }
+
+out:
+       return rc;
+}
+#undef ERRSIZE
+
  /*
   * Give file access advices
   *
  /*
   * Give file access advices
   *
@@ -2484,6 +2670,15 @@ static int ll_ladvise(struct inode *inode, struct file *file, __u64 flags,
         RETURN(rc);
  }
  
         RETURN(rc);
  }
  
+static int ll_lock_noexpand(struct file *file, int flags)
+{
+       struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+
+       fd->ll_lock_no_expand = !(flags & LF_UNSET);
+
+       return 0;
+}
+
  int ll_ioctl_fsgetxattr(struct inode *inode, unsigned int cmd,
                         unsigned long arg)
  {
  int ll_ioctl_fsgetxattr(struct inode *inode, unsigned int cmd,
                         unsigned long arg)
  {
@@ -2885,53 +3080,81 @@ out:
                 RETURN(ll_file_futimes_3(file, &lfu));
         }
         case LL_IOC_LADVISE: {
                 RETURN(ll_file_futimes_3(file, &lfu));
         }
         case LL_IOC_LADVISE: {
-               struct llapi_ladvise_hdr *ladvise_hdr;
+               struct llapi_ladvise_hdr *k_ladvise_hdr;
+               struct llapi_ladvise_hdr __user *u_ladvise_hdr;
                 int i;
                 int num_advise;
                 int i;
                 int num_advise;
-               int alloc_size = sizeof(*ladvise_hdr);
+               int alloc_size = sizeof(*k_ladvise_hdr);
  
                 rc = 0;
  
                 rc = 0;
-               OBD_ALLOC_PTR(ladvise_hdr);
-               if (ladvise_hdr == NULL)
+               u_ladvise_hdr = (void __user *)arg;
+               OBD_ALLOC_PTR(k_ladvise_hdr);
+               if (k_ladvise_hdr == NULL)
                         RETURN(-ENOMEM);
  
                         RETURN(-ENOMEM);
  
-               if (copy_from_user(ladvise_hdr,
-                                  (const struct llapi_ladvise_hdr __user *)arg,
-                                  alloc_size))
+               if (copy_from_user(k_ladvise_hdr, u_ladvise_hdr, alloc_size))
                         GOTO(out_ladvise, rc = -EFAULT);
  
                         GOTO(out_ladvise, rc = -EFAULT);
  
-               if (ladvise_hdr->lah_magic != LADVISE_MAGIC ||
-                   ladvise_hdr->lah_count < 1)
+               if (k_ladvise_hdr->lah_magic != LADVISE_MAGIC ||
+                   k_ladvise_hdr->lah_count < 1)
                         GOTO(out_ladvise, rc = -EINVAL);
  
                         GOTO(out_ladvise, rc = -EINVAL);
  
-               num_advise = ladvise_hdr->lah_count;
+               num_advise = k_ladvise_hdr->lah_count;
                 if (num_advise >= LAH_COUNT_MAX)
                         GOTO(out_ladvise, rc = -EFBIG);
  
                 if (num_advise >= LAH_COUNT_MAX)
                         GOTO(out_ladvise, rc = -EFBIG);
  
-               OBD_FREE_PTR(ladvise_hdr);
-               alloc_size = offsetof(typeof(*ladvise_hdr),
+               OBD_FREE_PTR(k_ladvise_hdr);
+               alloc_size = offsetof(typeof(*k_ladvise_hdr),
                                       lah_advise[num_advise]);
                                       lah_advise[num_advise]);
-               OBD_ALLOC(ladvise_hdr, alloc_size);
-               if (ladvise_hdr == NULL)
+               OBD_ALLOC(k_ladvise_hdr, alloc_size);
+               if (k_ladvise_hdr == NULL)
                         RETURN(-ENOMEM);
  
                 /*
                  * TODO: submit multiple advices to one server in a single RPC
                  */
                         RETURN(-ENOMEM);
  
                 /*
                  * TODO: submit multiple advices to one server in a single RPC
                  */
-               if (copy_from_user(ladvise_hdr,
-                                  (const struct llapi_ladvise_hdr __user *)arg,
-                                  alloc_size))
+               if (copy_from_user(k_ladvise_hdr, u_ladvise_hdr, alloc_size))
                         GOTO(out_ladvise, rc = -EFAULT);
  
                 for (i = 0; i < num_advise; i++) {
                         GOTO(out_ladvise, rc = -EFAULT);
  
                 for (i = 0; i < num_advise; i++) {
-                       rc = ll_ladvise(inode, file, ladvise_hdr->lah_flags,
-                                       &ladvise_hdr->lah_advise[i]);
+                       struct llapi_lu_ladvise *k_ladvise =
+                                       &k_ladvise_hdr->lah_advise[i];
+                       struct llapi_lu_ladvise __user *u_ladvise =
+                                       &u_ladvise_hdr->lah_advise[i];
+
+                       rc = ll_ladvise_sanity(inode, k_ladvise);
                         if (rc)
                         if (rc)
+                               GOTO(out_ladvise, rc);
+
+                       switch (k_ladvise->lla_advice) {
+                       case LU_LADVISE_LOCKNOEXPAND:
+                               rc = ll_lock_noexpand(file,
+                                              k_ladvise->lla_peradvice_flags);
+                               GOTO(out_ladvise, rc);
+                       case LU_LADVISE_LOCKAHEAD:
+
+                               rc = ll_file_lock_ahead(file, k_ladvise);
+
+                               if (rc < 0)
+                                       GOTO(out_ladvise, rc);
+
+                               if (put_user(rc,
+                                            &u_ladvise->lla_lockahead_result))
+                                       GOTO(out_ladvise, rc = -EFAULT);
+                               break;
+                       default:
+                               rc = ll_ladvise(inode, file,
+                                               k_ladvise_hdr->lah_flags,
+                                               k_ladvise);
+                               if (rc)
+                                       GOTO(out_ladvise, rc);
                                 break;
                                 break;
+                       }
+
                 }
  
  out_ladvise:
                 }
  
  out_ladvise:
-               OBD_FREE(ladvise_hdr, alloc_size);
+               OBD_FREE(k_ladvise_hdr, alloc_size);
                 RETURN(rc);
         }
         case LL_IOC_FSGETXATTR:
                 RETURN(rc);
         }
         case LL_IOC_FSGETXATTR:
diff --git a/lustre/llite/glimpse.c b/lustre/llite/glimpse.c

index d34be28..166fff0 100644 (file)
--- a/lustre/llite/glimpse.c
+++ b/lustre/llite/glimpse.c
@@ -92,7 +92,7 @@ int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
         CDEBUG(D_DLMTRACE, "Glimpsing inode "DFID"\n", PFID(fid));
  
         /* NOTE: this looks like DLM lock request, but it may
         CDEBUG(D_DLMTRACE, "Glimpsing inode "DFID"\n", PFID(fid));
  
         /* NOTE: this looks like DLM lock request, but it may
-        *       not be one. Due to CEF_ASYNC flag (translated
+        *       not be one. Due to CEF_GLIMPSE flag (translated
          *       to LDLM_FL_HAS_INTENT by osc), this is
          *       glimpse request, that won't revoke any
          *       conflicting DLM locks held. Instead,
          *       to LDLM_FL_HAS_INTENT by osc), this is
          *       glimpse request, that won't revoke any
          *       conflicting DLM locks held. Instead,
@@ -107,14 +107,10 @@ int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
         *descr = whole_file;
         descr->cld_obj = clob;
         descr->cld_mode = CLM_READ;
         *descr = whole_file;
         descr->cld_obj = clob;
         descr->cld_mode = CLM_READ;
-       descr->cld_enq_flags = CEF_ASYNC | CEF_MUST;
+       descr->cld_enq_flags = CEF_GLIMPSE | CEF_MUST;
         if (agl)
         if (agl)
-               descr->cld_enq_flags |= CEF_AGL;
+               descr->cld_enq_flags |= CEF_SPECULATIVE | CEF_NONBLOCK;
         /*
         /*
-        * CEF_ASYNC is used because glimpse sub-locks cannot
-        * deadlock (because they never conflict with other
-        * locks) and, hence, can be enqueued out-of-order.
-        *
          * CEF_MUST protects glimpse lock from conversion into
          * a lockless mode.
          */
          * CEF_MUST protects glimpse lock from conversion into
          * a lockless mode.
          */
@@ -140,7 +136,20 @@ int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
         RETURN(result);
  }
  
         RETURN(result);
  }
  
-static int cl_io_get(struct inode *inode, struct lu_env **envout,
+/**
+ * Get an IO environment for special operations such as glimpse locks and
+ * manually requested locks (ladvise lockahead)
+ *
+ * \param[in]  inode   inode the operation is being performed on
+ * \param[out] envout  thread specific execution environment
+ * \param[out] ioout   client io description
+ * \param[out] refcheck        reference check
+ *
+ * \retval 1           on success
+ * \retval 0           not a regular file, cannot get environment
+ * \retval negative    negative errno on error
+ */
+int cl_io_get(struct inode *inode, struct lu_env **envout,
                      struct cl_io **ioout, __u16 *refcheck)
  {
         struct lu_env           *env;
                      struct cl_io **ioout, __u16 *refcheck)
  {
         struct lu_env           *env;
diff --git a/lustre/llite/llite_internal.h b/lustre/llite/llite_internal.h

index 9ebbaf7..a9ab610 100644 (file)
--- a/lustre/llite/llite_internal.h
+++ b/lustre/llite/llite_internal.h
@@ -642,6 +642,7 @@ struct ll_file_data {
          * true: failure is known, not report again.
          * false: unknown failure, should report. */
         bool fd_write_failed;
          * true: failure is known, not report again.
          * false: unknown failure, should report. */
         bool fd_write_failed;
+       bool ll_lock_no_expand;
         rwlock_t fd_lock; /* protect lcc list */
         struct list_head fd_lccs; /* list of ll_cl_context */
  };
         rwlock_t fd_lock; /* protect lcc list */
         struct list_head fd_lccs; /* list of ll_cl_context */
  };
@@ -1222,11 +1223,18 @@ static inline int cl_glimpse_size(struct inode *inode)
         return cl_glimpse_size0(inode, 0);
  }
  
         return cl_glimpse_size0(inode, 0);
  }
  
+/* AGL is 'asychronous glimpse lock', which is a speculative lock taken as
+ * part of statahead */
  static inline int cl_agl(struct inode *inode)
  {
         return cl_glimpse_size0(inode, 1);
  }
  
  static inline int cl_agl(struct inode *inode)
  {
         return cl_glimpse_size0(inode, 1);
  }
  
+int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise);
+
+int cl_io_get(struct inode *inode, struct lu_env **envout,
+             struct cl_io **ioout, __u16 *refcheck);
+
  static inline int ll_glimpse_size(struct inode *inode)
  {
         struct ll_inode_info *lli = ll_i2info(inode);
  static inline int ll_glimpse_size(struct inode *inode)
  {
         struct ll_inode_info *lli = ll_i2info(inode);
diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c

index 5299cf5..87159a0 100644 (file)
--- a/lustre/llite/llite_lib.c
+++ b/lustre/llite/llite_lib.c
@@ -196,7 +196,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
                  RETURN(-ENOMEM);
          }
  
                  RETURN(-ENOMEM);
          }
  
-        /* indicate the features supported by this client */
+       /* indicate MDT features supported by this client */
          data->ocd_connect_flags = OBD_CONNECT_IBITS    | OBD_CONNECT_NODEVOH  |
                                    OBD_CONNECT_ATTRFID  |
                                    OBD_CONNECT_VERSION  | OBD_CONNECT_BRW_SIZE |
          data->ocd_connect_flags = OBD_CONNECT_IBITS    | OBD_CONNECT_NODEVOH  |
                                    OBD_CONNECT_ATTRFID  |
                                    OBD_CONNECT_VERSION  | OBD_CONNECT_BRW_SIZE |
@@ -388,6 +388,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
          * back its backend blocksize for grant calculation purpose */
         data->ocd_grant_blkbits = PAGE_SHIFT;
  
          * back its backend blocksize for grant calculation purpose */
         data->ocd_grant_blkbits = PAGE_SHIFT;
  
+       /* indicate OST features supported by this client */
         data->ocd_connect_flags = OBD_CONNECT_GRANT | OBD_CONNECT_VERSION |
                                   OBD_CONNECT_REQPORTAL | OBD_CONNECT_BRW_SIZE |
                                   OBD_CONNECT_CANCELSET | OBD_CONNECT_FID |
         data->ocd_connect_flags = OBD_CONNECT_GRANT | OBD_CONNECT_VERSION |
                                   OBD_CONNECT_REQPORTAL | OBD_CONNECT_BRW_SIZE |
                                   OBD_CONNECT_CANCELSET | OBD_CONNECT_FID |
@@ -399,9 +400,26 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
                                   OBD_CONNECT_JOBSTATS | OBD_CONNECT_LVB_TYPE |
                                   OBD_CONNECT_LAYOUTLOCK |
                                   OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK |
                                   OBD_CONNECT_JOBSTATS | OBD_CONNECT_LVB_TYPE |
                                   OBD_CONNECT_LAYOUTLOCK |
                                   OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK |
-                                 OBD_CONNECT_BULK_MBITS;
+                                 OBD_CONNECT_BULK_MBITS |
+                                 OBD_CONNECT_FLAGS2;
  
  
-       data->ocd_connect_flags2 = 0;
+/* The client currently advertises support for OBD_CONNECT_LOCKAHEAD_OLD so it
+ * can interoperate with an older version of lockahead which was released prior
+ * to landing in master. This support will be dropped when 2.13 development
+ * starts.  At the point, we should not just drop the connect flag (below), we
+ * should also remove the support in the code.
+ *
+ * Removing it means a few things:
+ * 1. Remove this section here
+ * 2. Remove CEF_NONBLOCK in ll_file_lockahead()
+ * 3. Remove function exp_connect_lockahead_old
+ * 4. Remove LDLM_FL_LOCKAHEAD_OLD_RESERVED in lustre_dlm_flags.h
+ * */
+#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(2, 12, 50, 0)
+       data->ocd_connect_flags |= OBD_CONNECT_LOCKAHEAD_OLD;
+#endif
+
+       data->ocd_connect_flags2 = OBD_CONNECT2_LOCKAHEAD;
  
         if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_GRANT_PARAM))
                 data->ocd_connect_flags |= OBD_CONNECT_GRANT_PARAM;
  
         if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_GRANT_PARAM))
                 data->ocd_connect_flags |= OBD_CONNECT_GRANT_PARAM;
diff --git a/lustre/llite/vvp_io.c b/lustre/llite/vvp_io.c

index 9de5f9b..c0dad2d 100644 (file)
--- a/lustre/llite/vvp_io.c
+++ b/lustre/llite/vvp_io.c
@@ -541,6 +541,8 @@ static int vvp_io_rw_lock(const struct lu_env *env, struct cl_io *io,
  
         if (io->u.ci_rw.rw_nonblock)
                 ast_flags |= CEF_NONBLOCK;
  
         if (io->u.ci_rw.rw_nonblock)
                 ast_flags |= CEF_NONBLOCK;
+       if (io->ci_lock_no_expand)
+               ast_flags |= CEF_LOCK_NO_EXPAND;
  
         result = vvp_mmap_locks(env, io);
         if (result == 0)
  
         result = vvp_mmap_locks(env, io);
         if (result == 0)
diff --git a/lustre/lov/lov_io.c b/lustre/lov/lov_io.c

index f40dfa2..577e7d1 100644 (file)
--- a/lustre/lov/lov_io.c
+++ b/lustre/lov/lov_io.c
@@ -123,6 +123,7 @@ static int lov_io_sub_init(const struct lu_env *env, struct lov_io *lio,
         sub_io->ci_no_srvlock = io->ci_no_srvlock;
         sub_io->ci_noatime = io->ci_noatime;
         sub_io->ci_pio = io->ci_pio;
         sub_io->ci_no_srvlock = io->ci_no_srvlock;
         sub_io->ci_noatime = io->ci_noatime;
         sub_io->ci_pio = io->ci_pio;
+       sub_io->ci_lock_no_expand = io->ci_lock_no_expand;
  
         result = cl_io_sub_init(sub->sub_env, sub_io, io->ci_type, sub_obj);
  
  
         result = cl_io_sub_init(sub->sub_env, sub_io, io->ci_type, sub_obj);
  
diff --git a/lustre/obdclass/cl_lock.c b/lustre/obdclass/cl_lock.c

index e92dbaf..83a3e8f 100644 (file)
--- a/lustre/obdclass/cl_lock.c
+++ b/lustre/obdclass/cl_lock.c
@@ -200,7 +200,7 @@ int cl_lock_request(const struct lu_env *env, struct cl_io *io,
         if (rc < 0)
                 RETURN(rc);
  
         if (rc < 0)
                 RETURN(rc);
  
-       if ((enq_flags & CEF_ASYNC) && !(enq_flags & CEF_AGL)) {
+       if ((enq_flags & CEF_GLIMPSE) && !(enq_flags & CEF_SPECULATIVE)) {
                 anchor = &cl_env_info(env)->clt_anchor;
                 cl_sync_io_init(anchor, 1, cl_sync_io_end);
         }
                 anchor = &cl_env_info(env)->clt_anchor;
                 cl_sync_io_init(anchor, 1, cl_sync_io_end);
         }
diff --git a/lustre/obdclass/lprocfs_status.c b/lustre/obdclass/lprocfs_status.c

index 18bd5e0..79c1413 100644 (file)
--- a/lustre/obdclass/lprocfs_status.c
+++ b/lustre/obdclass/lprocfs_status.c
@@ -841,12 +841,13 @@ static const char *obd_connect_names[] = {
         "multi_mod_rpcs",
         "dir_stripe",
         "subtree",
         "multi_mod_rpcs",
         "dir_stripe",
         "subtree",
-       "lock_ahead",
+       "lockahead",
         "bulk_mbits",
         "compact_obdo",
         "second_flags",
         /* flags2 names */
         "file_secctx",
         "bulk_mbits",
         "compact_obdo",
         "second_flags",
         /* flags2 names */
         "file_secctx",
+       "lockaheadv2",
         NULL
  };
  
         NULL
  };
  
diff --git a/lustre/ofd/ofd_dev.c b/lustre/ofd/ofd_dev.c

index 88ccc89..17ce15f 100644 (file)
--- a/lustre/ofd/ofd_dev.c
+++ b/lustre/ofd/ofd_dev.c
@@ -3260,6 +3260,13 @@ static int __init ofd_init(void)
                 return(rc);
         }
  
                 return(rc);
         }
  
+       rc = ofd_dlm_init();
+       if (rc) {
+               lu_kmem_fini(ofd_caches);
+               ofd_fmd_exit();
+               return rc;
+       }
+
         rc = class_register_type(&ofd_obd_ops, NULL, true, NULL,
                                  LUSTRE_OST_NAME, &ofd_device_type);
         return rc;
         rc = class_register_type(&ofd_obd_ops, NULL, true, NULL,
                                  LUSTRE_OST_NAME, &ofd_device_type);
         return rc;
@@ -3274,6 +3281,7 @@ static int __init ofd_init(void)
  static void __exit ofd_exit(void)
  {
         ofd_fmd_exit();
  static void __exit ofd_exit(void)
  {
         ofd_fmd_exit();
+       ofd_dlm_exit();
         lu_kmem_fini(ofd_caches);
         class_unregister_type(LUSTRE_OST_NAME);
  }
         lu_kmem_fini(ofd_caches);
         class_unregister_type(LUSTRE_OST_NAME);
  }
diff --git a/lustre/ofd/ofd_dlm.c b/lustre/ofd/ofd_dlm.c

index 4d86785..c18ade0 100644 (file)
--- a/lustre/ofd/ofd_dlm.c
+++ b/lustre/ofd/ofd_dlm.c
@@ -45,21 +45,77 @@
  #include "ofd_internal.h"
  
  struct ofd_intent_args {
  #include "ofd_internal.h"
  
  struct ofd_intent_args {
-       struct ldlm_lock        **victim;
+       struct list_head        gl_list;
         __u64                    size;
         __u64                    size;
-       int                     *liblustre;
+       bool                    no_glimpse_ast;
+       int                     error;
  };
  
  };
  
+int ofd_dlm_init(void)
+{
+       ldlm_glimpse_work_kmem = kmem_cache_create("ldlm_glimpse_work_kmem",
+                                            sizeof(struct ldlm_glimpse_work),
+                                            0, 0, NULL);
+       if (ldlm_glimpse_work_kmem == NULL)
+               return -ENOMEM;
+       else
+               return 0;
+}
+
+void ofd_dlm_exit(void)
+{
+       if (ldlm_glimpse_work_kmem) {
+               kmem_cache_destroy(ldlm_glimpse_work_kmem);
+               ldlm_glimpse_work_kmem = NULL;
+       }
+}
+
  /**
   * OFD interval callback.
   *
   * The interval_callback_t is part of interval_iterate_reverse() and is called
   * for each interval in tree. The OFD interval callback searches for locks
  /**
   * OFD interval callback.
   *
   * The interval_callback_t is part of interval_iterate_reverse() and is called
   * for each interval in tree. The OFD interval callback searches for locks
- * covering extents beyond the given args->size. This is used to decide if LVB
- * data is outdated.
+ * covering extents beyond the given args->size. This is used to decide if the
+ * size is too small and needs to be updated.  Note that we are only interested
+ * in growing the size, as truncate is the only operation which can shrink it,
+ * and it is handled differently.  This is why we only look at locks beyond the
+ * current size.
+ *
+ * It finds the highest lock (by starting point) in this interval, and adds it
+ * to the list of locks to glimpse.  We must glimpse a list of locks - rather
+ * than only the highest lock on the file - because lockahead creates extent
+ * locks in advance of IO, and so breaks the assumption that the holder of the
+ * highest lock knows the current file size.
+ *
+ * This assumption is normally true because locks which are created as part of
+ * IO - rather than in advance of it - are guaranteed to be 'active', i.e.,
+ * involved in IO, and the holder of the highest 'active' lock always knows the
+ * current file size, because the size is either not changing or the holder of
+ * that lock is responsible for updating it.
+ *
+ * So we need only glimpse until we find the first client with an 'active'
+ * lock.
+ *
+ * Unfortunately, there is no way to know if a manually requested/speculative
+ * lock is 'active' from the server side.  So when we see a potentially
+ * speculative lock, we must send a glimpse for that lock unless we have
+ * already sent a glimpse to the holder of that lock.
+ *
+ * However, *all* non-speculative locks are active.  So we can stop glimpsing
+ * as soon as we find a non-speculative lock.  Currently, all speculative PW
+ * locks have LDLM_FL_NO_EXPANSION set, and we use this to identify them.  This
+ * is enforced by an assertion in osc_lock_init, which references this comment.
+ *
+ * If that ever changes, we will either need to find a new way to identify
+ * active locks or we will need to consider all PW locks (we will still only
+ * glimpse one per client).
+ *
+ * Note that it is safe to glimpse only the 'top' lock from each interval
+ * because ofd_intent_cb is only called for PW extent locks, and for PW locks,
+ * there is only one lock per interval.
   *
   * \param[in] n                interval node
   *
   * \param[in] n                interval node
- * \param[in] args     intent arguments
+ * \param[in,out] args intent arguments, gl work list for identified locks
   *
   * \retval             INTERVAL_ITER_STOP if the interval is lower than
   *                     file size, caller stops execution
   *
   * \retval             INTERVAL_ITER_STOP if the interval is lower than
   *                     file size, caller stops execution
@@ -71,39 +127,89 @@ static enum interval_iter ofd_intent_cb(struct interval_node *n, void *args)
         struct ldlm_interval     *node = (struct ldlm_interval *)n;
         struct ofd_intent_args   *arg = args;
         __u64                     size = arg->size;
         struct ldlm_interval     *node = (struct ldlm_interval *)n;
         struct ofd_intent_args   *arg = args;
         __u64                     size = arg->size;
-       struct ldlm_lock        **v = arg->victim;
+       struct ldlm_lock         *victim_lock = NULL;
         struct ldlm_lock         *lck;
         struct ldlm_lock         *lck;
+       struct ldlm_glimpse_work *gl_work = NULL;
+       int rc = 0;
  
         /* If the interval is lower than the current file size, just break. */
         if (interval_high(n) <= size)
  
         /* If the interval is lower than the current file size, just break. */
         if (interval_high(n) <= size)
-               return INTERVAL_ITER_STOP;
+               GOTO(out, rc = INTERVAL_ITER_STOP);
  
  
+       /* Find the 'victim' lock from this interval */
         list_for_each_entry(lck, &node->li_group, l_sl_policy) {
         list_for_each_entry(lck, &node->li_group, l_sl_policy) {
-               /* Don't send glimpse ASTs to liblustre clients.
-                * They aren't listening for them, and they do
-                * entirely synchronous I/O anyways. */
-               if (lck->l_export == NULL || lck->l_export->exp_libclient)
-                       continue;
-
-               if (*arg->liblustre)
-                       *arg->liblustre = 0;
  
  
-               if (*v == NULL) {
-                       *v = LDLM_LOCK_GET(lck);
-               } else if ((*v)->l_policy_data.l_extent.start <
-                          lck->l_policy_data.l_extent.start) {
-                       LDLM_LOCK_RELEASE(*v);
-                       *v = LDLM_LOCK_GET(lck);
-               }
+               victim_lock = LDLM_LOCK_GET(lck);
  
                 /* the same policy group - every lock has the
                  * same extent, so needn't do it any more */
                 break;
         }
  
  
                 /* the same policy group - every lock has the
                  * same extent, so needn't do it any more */
                 break;
         }
  
-       return INTERVAL_ITER_CONT;
-}
+       /* l_export can be null in race with eviction - In that case, we will
+        * not find any locks in this interval */
+       if (!victim_lock)
+               GOTO(out, rc = INTERVAL_ITER_CONT);
+
+       /*
+        * This check is for lock taken in ofd_destroy_by_fid() that does
+        * not have l_glimpse_ast set. So the logic is: if there is a lock
+        * with no l_glimpse_ast set, this object is being destroyed already.
+        * Hence, if you are grabbing DLM locks on the server, always set
+        * non-NULL glimpse_ast (e.g., ldlm_request.c::ldlm_glimpse_ast()).
+        */
+       if (victim_lock->l_glimpse_ast == NULL) {
+               LDLM_DEBUG(victim_lock, "no l_glimpse_ast");
+               arg->no_glimpse_ast = true;
+               GOTO(out_release, rc = INTERVAL_ITER_STOP);
+       }
  
  
+       /* If NO_EXPANSION is not set, this is an active lock, and we don't need
+        * to glimpse any further once we've glimpsed the client holding this
+        * lock.  So set us up to stop.  See comment above this function. */
+       if (!(victim_lock->l_flags & LDLM_FL_NO_EXPANSION))
+               rc = INTERVAL_ITER_STOP;
+       else
+               rc = INTERVAL_ITER_CONT;
+
+       /* Check to see if we're already set up to send a glimpse to this
+        * client; if so, don't add this lock to the glimpse list - We need
+        * only glimpse each client once. (And if we know that client holds
+        * an active lock, we can stop glimpsing.  So keep the rc set in the
+        * check above.) */
+       list_for_each_entry(gl_work, &arg->gl_list, gl_list) {
+               if (gl_work->gl_lock->l_export == victim_lock->l_export)
+                       GOTO(out_release, rc);
+       }
+
+       if (!OBD_FAIL_CHECK(OBD_FAIL_OST_GL_WORK_ALLOC))
+               OBD_SLAB_ALLOC_PTR_GFP(gl_work, ldlm_glimpse_work_kmem,
+                                      GFP_ATOMIC);
+
+       if (!gl_work) {
+               arg->error = -ENOMEM;
+               GOTO(out_release, rc = INTERVAL_ITER_STOP);
+       }
+
+       /* Populate the gl_work structure. */
+       gl_work->gl_lock = victim_lock;
+       list_add_tail(&gl_work->gl_list, &arg->gl_list);
+       /* There is actually no need for a glimpse descriptor when glimpsing
+        * extent locks */
+       gl_work->gl_desc = NULL;
+       /* This tells ldlm_work_gl_ast_lock this was allocated from a slab and
+        * must be freed in a slab-aware manner. */
+       gl_work->gl_flags = LDLM_GL_WORK_SLAB_ALLOCATED;
+
+       GOTO(out, rc);
+
+out_release:
+       /* If the victim doesn't go on the glimpse list, we must release it */
+       LDLM_LOCK_RELEASE(victim_lock);
+
+out:
+       return rc;
+}
  /**
   * OFD lock intent policy
   *
  /**
   * OFD lock intent policy
   *
@@ -124,20 +230,20 @@ static enum interval_iter ofd_intent_cb(struct interval_node *n, void *args)
   * \retval             ELDLM_LOCK_REPLACED if already granted lock was found
   *                     and placed in \a lockp
   * \retval             ELDLM_LOCK_ABORTED in other cases except error
   * \retval             ELDLM_LOCK_REPLACED if already granted lock was found
   *                     and placed in \a lockp
   * \retval             ELDLM_LOCK_ABORTED in other cases except error
- * \retval             negative value on error
+ * \retval             negative errno on error
   */
  int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                       void *req_cookie, enum ldlm_mode mode, __u64 flags,
                       void *data)
  {
         struct ptlrpc_request *req = req_cookie;
   */
  int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                       void *req_cookie, enum ldlm_mode mode, __u64 flags,
                       void *data)
  {
         struct ptlrpc_request *req = req_cookie;
-       struct ldlm_lock *lock = *lockp, *l = NULL;
+       struct ldlm_lock *lock = *lockp;
         struct ldlm_resource *res = lock->l_resource;
         ldlm_processing_policy policy;
         struct ost_lvb *res_lvb, *reply_lvb;
         struct ldlm_reply *rep;
         enum ldlm_error err;
         struct ldlm_resource *res = lock->l_resource;
         ldlm_processing_policy policy;
         struct ost_lvb *res_lvb, *reply_lvb;
         struct ldlm_reply *rep;
         enum ldlm_error err;
-       int idx, rc, only_liblustre = 1;
+       int idx, rc;
         struct ldlm_interval_tree *tree;
         struct ofd_intent_args arg;
         __u32 repsize[3] = {
         struct ldlm_interval_tree *tree;
         struct ofd_intent_args arg;
         __u32 repsize[3] = {
@@ -145,11 +251,12 @@ int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                 [DLM_LOCKREPLY_OFF]   = sizeof(*rep),
                 [DLM_REPLY_REC_OFF]   = sizeof(*reply_lvb)
         };
                 [DLM_LOCKREPLY_OFF]   = sizeof(*rep),
                 [DLM_REPLY_REC_OFF]   = sizeof(*reply_lvb)
         };
-       struct ldlm_glimpse_work gl_work = {};
-       struct list_head gl_list;
+       struct ldlm_glimpse_work *pos, *tmp;
         ENTRY;
  
         ENTRY;
  
-       INIT_LIST_HEAD(&gl_list);
+       INIT_LIST_HEAD(&arg.gl_list);
+       arg.no_glimpse_ast = false;
+       arg.error = 0;
         lock->l_lvb_type = LVB_T_OST;
         policy = ldlm_get_processing_policy(res);
         LASSERT(policy != NULL);
         lock->l_lvb_type = LVB_T_OST;
         policy = ldlm_get_processing_policy(res);
         LASSERT(policy != NULL);
@@ -195,13 +302,7 @@ int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
  
         /* The lock met with no resistance; we're finished. */
         if (rc == LDLM_ITER_CONTINUE) {
  
         /* The lock met with no resistance; we're finished. */
         if (rc == LDLM_ITER_CONTINUE) {
-               /* do not grant locks to the liblustre clients: they cannot
-                * handle ASTs robustly.  We need to do this while still
-                * holding ns_lock to avoid the lock remaining on the res_link
-                * list (and potentially being added to l_pending_list by an
-                * AST) when we are going to drop this lock ASAP. */
-               if (lock->l_export->exp_libclient ||
-                   OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_GLIMPSE, 2)) {
+               if (OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_GLIMPSE, 2)) {
                         ldlm_resource_unlink_lock(lock);
                         err = ELDLM_LOCK_ABORTED;
                 } else {
                         ldlm_resource_unlink_lock(lock);
                         err = ELDLM_LOCK_ABORTED;
                 } else {
@@ -233,74 +334,48 @@ int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
          *  res->lr_lvb_sem.
          */
         arg.size = reply_lvb->lvb_size;
          *  res->lr_lvb_sem.
          */
         arg.size = reply_lvb->lvb_size;
-       arg.victim = &l;
-       arg.liblustre = &only_liblustre;
  
  
+       /* Check for PW locks beyond the size in the LVB, build the list
+        * of locks to glimpse (arg.gl_list) */
         for (idx = 0; idx < LCK_MODE_NUM; idx++) {
                 tree = &res->lr_itree[idx];
                 if (tree->lit_mode == LCK_PR)
                         continue;
  
                 interval_iterate_reverse(tree->lit_root, ofd_intent_cb, &arg);
         for (idx = 0; idx < LCK_MODE_NUM; idx++) {
                 tree = &res->lr_itree[idx];
                 if (tree->lit_mode == LCK_PR)
                         continue;
  
                 interval_iterate_reverse(tree->lit_root, ofd_intent_cb, &arg);
+               if (arg.error) {
+                       unlock_res(res);
+                       GOTO(out, rc = arg.error);
+               }
         }
         unlock_res(res);
  
         /* There were no PW locks beyond the size in the LVB; finished. */
         }
         unlock_res(res);
  
         /* There were no PW locks beyond the size in the LVB; finished. */
-       if (l == NULL) {
-               if (only_liblustre) {
-                       /* If we discovered a liblustre client with a PW lock,
-                        * however, the LVB may be out of date!  The LVB is
-                        * updated only on glimpse (which we don't do for
-                        * liblustre clients) and cancel (which the client
-                        * obviously has not yet done).  So if it has written
-                        * data but kept the lock, the LVB is stale and needs
-                        * to be updated from disk.
-                        *
-                        * Of course, this will all disappear when we switch to
-                        * taking liblustre locks on the OST. */
-                       ldlm_res_lvbo_update(res, NULL, 1);
-               }
+       if (list_empty(&arg.gl_list))
                 RETURN(ELDLM_LOCK_ABORTED);
                 RETURN(ELDLM_LOCK_ABORTED);
-       }
  
  
-       /*
-        * This check is for lock taken in ofd_destroy_by_fid() that does
-        * not have l_glimpse_ast set. So the logic is: if there is a lock
-        * with no l_glimpse_ast set, this object is being destroyed already.
-        * Hence, if you are grabbing DLM locks on the server, always set
-        * non-NULL glimpse_ast (e.g., ldlm_request.c::ldlm_glimpse_ast()).
-        */
-       if (l->l_glimpse_ast == NULL) {
+       if (arg.no_glimpse_ast) {
                 /* We are racing with unlink(); just return -ENOENT */
                 rep->lock_policy_res1 = ptlrpc_status_hton(-ENOENT);
                 /* We are racing with unlink(); just return -ENOENT */
                 rep->lock_policy_res1 = ptlrpc_status_hton(-ENOENT);
-               goto out;
+               GOTO(out, ELDLM_LOCK_ABORTED);
         }
  
         }
  
-       /* Populate the gl_work structure.
-        * Grab additional reference on the lock which will be released in
-        * ldlm_work_gl_ast_lock() */
-       gl_work.gl_lock = LDLM_LOCK_GET(l);
-       /* The glimpse callback is sent to one single extent lock. As a result,
-        * the gl_work list is just composed of one element */
-       list_add_tail(&gl_work.gl_list, &gl_list);
-       /* There is actually no need for a glimpse descriptor when glimpsing
-        * extent locks */
-       gl_work.gl_desc = NULL;
-       /* the ldlm_glimpse_work structure is allocated on the stack */
-       gl_work.gl_flags = LDLM_GL_WORK_NOFREE;
-
-       rc = ldlm_glimpse_locks(res, &gl_list); /* this will update the LVB */
-
-       if (!list_empty(&gl_list))
-               LDLM_LOCK_RELEASE(l);
+       /* this will update the LVB */
+       ldlm_glimpse_locks(res, &arg.gl_list);
  
         lock_res(res);
         *reply_lvb = *res_lvb;
         unlock_res(res);
  
  out:
  
         lock_res(res);
         *reply_lvb = *res_lvb;
         unlock_res(res);
  
  out:
-       LDLM_LOCK_RELEASE(l);
+       /* If the list is not empty, we failed to glimpse some locks and
+        * must clean up.  Usually due to a race with unlink.*/
+       list_for_each_entry_safe(pos, tmp, &arg.gl_list, gl_list) {
+               list_del(&pos->gl_list);
+               LDLM_LOCK_RELEASE(pos->gl_lock);
+               OBD_SLAB_FREE_PTR(pos, ldlm_glimpse_work_kmem);
+       }
  
  
-       RETURN(ELDLM_LOCK_ABORTED);
+       RETURN(rc < 0 ? rc : ELDLM_LOCK_ABORTED);
  }
  
  }
  
diff --git a/lustre/ofd/ofd_internal.h b/lustre/ofd/ofd_internal.h

index bab0381..611c8db 100644 (file)
--- a/lustre/ofd/ofd_internal.h
+++ b/lustre/ofd/ofd_internal.h
@@ -418,6 +418,9 @@ int ofd_fid_fini(const struct lu_env *env, struct ofd_device *ofd);
  extern struct ldlm_valblock_ops ofd_lvbo;
  
  /* ofd_dlm.c */
  extern struct ldlm_valblock_ops ofd_lvbo;
  
  /* ofd_dlm.c */
+extern struct kmem_cache *ldlm_glimpse_work_kmem;
+int ofd_dlm_init(void);
+void ofd_dlm_exit(void);
  int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                       void *req_cookie, enum ldlm_mode mode, __u64 flags,
                       void *data);
  int ofd_intent_policy(struct ldlm_namespace *ns, struct ldlm_lock **lockp,
                       void *req_cookie, enum ldlm_mode mode, __u64 flags,
                       void *data);
diff --git a/lustre/osc/osc_internal.h b/lustre/osc/osc_internal.h

index 60965c7..9d00c5c 100644 (file)
--- a/lustre/osc/osc_internal.h
+++ b/lustre/osc/osc_internal.h
@@ -55,7 +55,8 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                      struct ost_lvb *lvb, int kms_valid,
                      osc_enqueue_upcall_f upcall,
                      void *cookie, struct ldlm_enqueue_info *einfo,
                      struct ost_lvb *lvb, int kms_valid,
                      osc_enqueue_upcall_f upcall,
                      void *cookie, struct ldlm_enqueue_info *einfo,
-                    struct ptlrpc_request_set *rqset, int async, int agl);
+                    struct ptlrpc_request_set *rqset, int async,
+                    bool speculative);
  
  int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                    enum ldlm_type type, union ldlm_policy_data *policy,
  
  int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                    enum ldlm_type type, union ldlm_policy_data *policy,
diff --git a/lustre/osc/osc_lock.c b/lustre/osc/osc_lock.c

index b17ebcc..6fd75da 100644 (file)
--- a/lustre/osc/osc_lock.c
+++ b/lustre/osc/osc_lock.c
@@ -160,11 +160,13 @@ static __u64 osc_enq2ldlm_flags(__u32 enqflags)
  {
         __u64 result = 0;
  
  {
         __u64 result = 0;
  
+       CDEBUG(D_DLMTRACE, "flags: %x\n", enqflags);
+
         LASSERT((enqflags & ~CEF_MASK) == 0);
  
         if (enqflags & CEF_NONBLOCK)
                 result |= LDLM_FL_BLOCK_NOWAIT;
         LASSERT((enqflags & ~CEF_MASK) == 0);
  
         if (enqflags & CEF_NONBLOCK)
                 result |= LDLM_FL_BLOCK_NOWAIT;
-       if (enqflags & CEF_ASYNC)
+       if (enqflags & CEF_GLIMPSE)
                 result |= LDLM_FL_HAS_INTENT;
         if (enqflags & CEF_DISCARD_DATA)
                 result |= LDLM_FL_AST_DISCARD_DATA;
                 result |= LDLM_FL_HAS_INTENT;
         if (enqflags & CEF_DISCARD_DATA)
                 result |= LDLM_FL_AST_DISCARD_DATA;
@@ -172,6 +174,10 @@ static __u64 osc_enq2ldlm_flags(__u32 enqflags)
                 result |= LDLM_FL_TEST_LOCK;
         if (enqflags & CEF_LOCK_MATCH)
                 result |= LDLM_FL_MATCH_LOCK;
                 result |= LDLM_FL_TEST_LOCK;
         if (enqflags & CEF_LOCK_MATCH)
                 result |= LDLM_FL_MATCH_LOCK;
+       if (enqflags & CEF_LOCK_NO_EXPAND)
+               result |= LDLM_FL_NO_EXPANSION;
+       if (enqflags & CEF_SPECULATIVE)
+               result |= LDLM_FL_SPECULATIVE;
         return result;
  }
  
         return result;
  }
  
@@ -350,8 +356,9 @@ static int osc_lock_upcall(void *cookie, struct lustre_handle *lockh,
         RETURN(rc);
  }
  
         RETURN(rc);
  }
  
-static int osc_lock_upcall_agl(void *cookie, struct lustre_handle *lockh,
-                              int errcode)
+static int osc_lock_upcall_speculative(void *cookie,
+                                      struct lustre_handle *lockh,
+                                      int errcode)
  {
         struct osc_object       *osc = cookie;
         struct ldlm_lock        *dlmlock;
  {
         struct osc_object       *osc = cookie;
         struct ldlm_lock        *dlmlock;
@@ -374,7 +381,7 @@ static int osc_lock_upcall_agl(void *cookie, struct lustre_handle *lockh,
         lock_res_and_lock(dlmlock);
         LASSERT(dlmlock->l_granted_mode == dlmlock->l_req_mode);
  
         lock_res_and_lock(dlmlock);
         LASSERT(dlmlock->l_granted_mode == dlmlock->l_req_mode);
  
-       /* there is no osc_lock associated with AGL lock */
+       /* there is no osc_lock associated with speculative locks */
         osc_lock_lvb_update(env, osc, dlmlock, NULL);
  
         unlock_res_and_lock(dlmlock);
         osc_lock_lvb_update(env, osc, dlmlock, NULL);
  
         unlock_res_and_lock(dlmlock);
@@ -817,7 +824,7 @@ static bool osc_lock_compatible(const struct osc_lock *qing,
         struct cl_lock_descr *qed_descr = &qed->ols_cl.cls_lock->cll_descr;
         struct cl_lock_descr *qing_descr = &qing->ols_cl.cls_lock->cll_descr;
  
         struct cl_lock_descr *qed_descr = &qed->ols_cl.cls_lock->cll_descr;
         struct cl_lock_descr *qing_descr = &qing->ols_cl.cls_lock->cll_descr;
  
-       if (qed->ols_glimpse)
+       if (qed->ols_glimpse || qed->ols_speculative)
                 return true;
  
         if (qing_descr->cld_mode == CLM_READ && qed_descr->cld_mode == CLM_READ)
                 return true;
  
         if (qing_descr->cld_mode == CLM_READ && qed_descr->cld_mode == CLM_READ)
@@ -935,6 +942,7 @@ static int osc_lock_enqueue(const struct lu_env *env,
         struct osc_io                   *oio   = osc_env_io(env);
         struct osc_object               *osc   = cl2osc(slice->cls_obj);
         struct osc_lock                 *oscl  = cl2osc_lock(slice);
         struct osc_io                   *oio   = osc_env_io(env);
         struct osc_object               *osc   = cl2osc(slice->cls_obj);
         struct osc_lock                 *oscl  = cl2osc_lock(slice);
+       struct obd_export               *exp   = osc_export(osc);
         struct cl_lock                  *lock  = slice->cls_lock;
         struct ldlm_res_id              *resname = &info->oti_resname;
         union ldlm_policy_data          *policy  = &info->oti_policy;
         struct cl_lock                  *lock  = slice->cls_lock;
         struct ldlm_res_id              *resname = &info->oti_resname;
         union ldlm_policy_data          *policy  = &info->oti_policy;
@@ -951,11 +959,22 @@ static int osc_lock_enqueue(const struct lu_env *env,
         if (oscl->ols_state == OLS_GRANTED)
                 RETURN(0);
  
         if (oscl->ols_state == OLS_GRANTED)
                 RETURN(0);
  
+       if ((oscl->ols_flags & LDLM_FL_NO_EXPANSION) &&
+           !(exp_connect_lockahead_old(exp) || exp_connect_lockahead(exp))) {
+               result = -EOPNOTSUPP;
+               CERROR("%s: server does not support lockahead/locknoexpand:"
+                      "rc = %d\n", exp->exp_obd->obd_name, result);
+               RETURN(result);
+       }
+
         if (oscl->ols_flags & LDLM_FL_TEST_LOCK)
                 GOTO(enqueue_base, 0);
  
         if (oscl->ols_flags & LDLM_FL_TEST_LOCK)
                 GOTO(enqueue_base, 0);
  
-       if (oscl->ols_glimpse) {
-               LASSERT(equi(oscl->ols_agl, anchor == NULL));
+       /* For glimpse and/or speculative locks, do not wait for reply from
+        * server on LDLM request */
+       if (oscl->ols_glimpse || oscl->ols_speculative) {
+               /* Speculative and glimpse locks do not have an anchor */
+               LASSERT(equi(oscl->ols_speculative, anchor == NULL));
                 async = true;
                 GOTO(enqueue_base, 0);
         }
                 async = true;
                 GOTO(enqueue_base, 0);
         }
@@ -981,25 +1000,31 @@ enqueue_base:
  
         /**
          * DLM lock's ast data must be osc_object;
  
         /**
          * DLM lock's ast data must be osc_object;
-        * if glimpse or AGL lock, async of osc_enqueue_base() must be true,
+        * if glimpse or speculative lock, async of osc_enqueue_base()
+        * must be true
+        *
+        * For non-speculative locks:
          * DLM's enqueue callback set to osc_lock_upcall() with cookie as
          * osc_lock.
          * DLM's enqueue callback set to osc_lock_upcall() with cookie as
          * osc_lock.
+        * For speculative locks:
+        * osc_lock_upcall_speculative & cookie is the osc object, since
+        * there is no osc_lock
          */
         ostid_build_res_name(&osc->oo_oinfo->loi_oi, resname);
         osc_lock_build_policy(env, lock, policy);
          */
         ostid_build_res_name(&osc->oo_oinfo->loi_oi, resname);
         osc_lock_build_policy(env, lock, policy);
-       if (oscl->ols_agl) {
+       if (oscl->ols_speculative) {
                 oscl->ols_einfo.ei_cbdata = NULL;
                 /* hold a reference for callback */
                 cl_object_get(osc2cl(osc));
                 oscl->ols_einfo.ei_cbdata = NULL;
                 /* hold a reference for callback */
                 cl_object_get(osc2cl(osc));
-               upcall = osc_lock_upcall_agl;
+               upcall = osc_lock_upcall_speculative;
                 cookie = osc;
         }
                 cookie = osc;
         }
-       result = osc_enqueue_base(osc_export(osc), resname, &oscl->ols_flags,
+       result = osc_enqueue_base(exp, resname, &oscl->ols_flags,
                                   policy, &oscl->ols_lvb,
                                   osc->oo_oinfo->loi_kms_valid,
                                   upcall, cookie,
                                   &oscl->ols_einfo, PTLRPCD_SET, async,
                                   policy, &oscl->ols_lvb,
                                   osc->oo_oinfo->loi_kms_valid,
                                   upcall, cookie,
                                   &oscl->ols_einfo, PTLRPCD_SET, async,
-                                 oscl->ols_agl);
+                                 oscl->ols_speculative);
         if (result == 0) {
                 if (osc_lock_is_lockless(oscl)) {
                         oio->oi_lockless = 1;
         if (result == 0) {
                 if (osc_lock_is_lockless(oscl)) {
                         oio->oi_lockless = 1;
@@ -1008,9 +1033,12 @@ enqueue_base:
                         LASSERT(oscl->ols_hold);
                         LASSERT(oscl->ols_dlmlock != NULL);
                 }
                         LASSERT(oscl->ols_hold);
                         LASSERT(oscl->ols_dlmlock != NULL);
                 }
-       } else if (oscl->ols_agl) {
+       } else if (oscl->ols_speculative) {
                 cl_object_put(env, osc2cl(osc));
                 cl_object_put(env, osc2cl(osc));
-               result = 0;
+               if (oscl->ols_glimpse) {
+                       /* hide error for AGL request */
+                       result = 0;
+               }
         }
  
  out:
         }
  
  out:
@@ -1178,10 +1206,15 @@ int osc_lock_init(const struct lu_env *env,
         INIT_LIST_HEAD(&oscl->ols_wait_entry);
         INIT_LIST_HEAD(&oscl->ols_nextlock_oscobj);
  
         INIT_LIST_HEAD(&oscl->ols_wait_entry);
         INIT_LIST_HEAD(&oscl->ols_nextlock_oscobj);
  
+       /* Speculative lock requests must be either no_expand or glimpse
+        * request (CEF_GLIMPSE).  non-glimpse no_expand speculative extent
+        * locks will break ofd_intent_cb. (see comment there)*/
+       LASSERT(ergo((enqflags & CEF_SPECULATIVE) != 0,
+               (enqflags & (CEF_LOCK_NO_EXPAND | CEF_GLIMPSE)) != 0));
+
         oscl->ols_flags = osc_enq2ldlm_flags(enqflags);
         oscl->ols_flags = osc_enq2ldlm_flags(enqflags);
-       oscl->ols_agl = !!(enqflags & CEF_AGL);
-       if (oscl->ols_agl)
-               oscl->ols_flags |= LDLM_FL_BLOCK_NOWAIT;
+       oscl->ols_speculative = !!(enqflags & CEF_SPECULATIVE);
+
         if (oscl->ols_flags & LDLM_FL_HAS_INTENT) {
                 oscl->ols_flags |= LDLM_FL_BLOCK_GRANTED;
                 oscl->ols_glimpse = 1;
         if (oscl->ols_flags & LDLM_FL_HAS_INTENT) {
                 oscl->ols_flags |= LDLM_FL_BLOCK_GRANTED;
                 oscl->ols_glimpse = 1;
diff --git a/lustre/osc/osc_request.c b/lustre/osc/osc_request.c

index 5526814..5f259e9 100644 (file)
--- a/lustre/osc/osc_request.c
+++ b/lustre/osc/osc_request.c
@@ -100,7 +100,7 @@ struct osc_enqueue_args {
         void                    *oa_cookie;
         struct ost_lvb          *oa_lvb;
         struct lustre_handle    oa_lockh;
         void                    *oa_cookie;
         struct ost_lvb          *oa_lvb;
         struct lustre_handle    oa_lockh;
-       unsigned int            oa_agl:1;
+       bool                    oa_speculative;
  };
  
  static void osc_release_ppga(struct brw_page **ppga, size_t count);
  };
  
  static void osc_release_ppga(struct brw_page **ppga, size_t count);
@@ -2035,7 +2035,7 @@ static int osc_set_lock_data(struct ldlm_lock *lock, void *data)
  static int osc_enqueue_fini(struct ptlrpc_request *req,
                             osc_enqueue_upcall_f upcall, void *cookie,
                             struct lustre_handle *lockh, enum ldlm_mode mode,
  static int osc_enqueue_fini(struct ptlrpc_request *req,
                             osc_enqueue_upcall_f upcall, void *cookie,
                             struct lustre_handle *lockh, enum ldlm_mode mode,
-                           __u64 *flags, int agl, int errcode)
+                           __u64 *flags, bool speculative, int errcode)
  {
         bool intent = *flags & LDLM_FL_HAS_INTENT;
         int rc;
  {
         bool intent = *flags & LDLM_FL_HAS_INTENT;
         int rc;
@@ -2052,7 +2052,7 @@ static int osc_enqueue_fini(struct ptlrpc_request *req,
                         ptlrpc_status_ntoh(rep->lock_policy_res1);
                 if (rep->lock_policy_res1)
                         errcode = rep->lock_policy_res1;
                         ptlrpc_status_ntoh(rep->lock_policy_res1);
                 if (rep->lock_policy_res1)
                         errcode = rep->lock_policy_res1;
-               if (!agl)
+               if (!speculative)
                         *flags |= LDLM_FL_LVB_READY;
         } else if (errcode == ELDLM_OK) {
                 *flags |= LDLM_FL_LVB_READY;
                         *flags |= LDLM_FL_LVB_READY;
         } else if (errcode == ELDLM_OK) {
                 *flags |= LDLM_FL_LVB_READY;
@@ -2102,7 +2102,7 @@ static int osc_enqueue_interpret(const struct lu_env *env,
         /* Let CP AST to grant the lock first. */
         OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_ENQ_RACE, 1);
  
         /* Let CP AST to grant the lock first. */
         OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_ENQ_RACE, 1);
  
-       if (aa->oa_agl) {
+       if (aa->oa_speculative) {
                 LASSERT(aa->oa_lvb == NULL);
                 LASSERT(aa->oa_flags == NULL);
                 aa->oa_flags = &flags;
                 LASSERT(aa->oa_lvb == NULL);
                 LASSERT(aa->oa_flags == NULL);
                 aa->oa_flags = &flags;
@@ -2114,7 +2114,7 @@ static int osc_enqueue_interpret(const struct lu_env *env,
                                    lockh, rc);
         /* Complete osc stuff. */
         rc = osc_enqueue_fini(req, aa->oa_upcall, aa->oa_cookie, lockh, mode,
                                    lockh, rc);
         /* Complete osc stuff. */
         rc = osc_enqueue_fini(req, aa->oa_upcall, aa->oa_cookie, lockh, mode,
-                             aa->oa_flags, aa->oa_agl, rc);
+                             aa->oa_flags, aa->oa_speculative, rc);
  
          OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_CANCEL_RACE, 10);
  
  
          OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_CP_CANCEL_RACE, 10);
  
@@ -2137,7 +2137,8 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                      struct ost_lvb *lvb, int kms_valid,
                      osc_enqueue_upcall_f upcall, void *cookie,
                      struct ldlm_enqueue_info *einfo,
                      struct ost_lvb *lvb, int kms_valid,
                      osc_enqueue_upcall_f upcall, void *cookie,
                      struct ldlm_enqueue_info *einfo,
-                    struct ptlrpc_request_set *rqset, int async, int agl)
+                    struct ptlrpc_request_set *rqset, int async,
+                    bool speculative)
  {
         struct obd_device *obd = exp->exp_obd;
         struct lustre_handle lockh = { 0 };
  {
         struct obd_device *obd = exp->exp_obd;
         struct lustre_handle lockh = { 0 };
@@ -2153,14 +2154,14 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
         policy->l_extent.start -= policy->l_extent.start & ~PAGE_MASK;
         policy->l_extent.end |= ~PAGE_MASK;
  
         policy->l_extent.start -= policy->l_extent.start & ~PAGE_MASK;
         policy->l_extent.end |= ~PAGE_MASK;
  
-        /*
-         * kms is not valid when either object is completely fresh (so that no
-         * locks are cached), or object was evicted. In the latter case cached
-         * lock cannot be used, because it would prime inode state with
-         * potentially stale LVB.
-         */
-        if (!kms_valid)
-                goto no_match;
+       /*
+        * kms is not valid when either object is completely fresh (so that no
+        * locks are cached), or object was evicted. In the latter case cached
+        * lock cannot be used, because it would prime inode state with
+        * potentially stale LVB.
+        */
+       if (!kms_valid)
+               goto no_match;
  
          /* Next, search for already existing extent locks that will cover us */
          /* If we're trying to read, we also search for an existing PW lock.  The
  
          /* Next, search for already existing extent locks that will cover us */
          /* If we're trying to read, we also search for an existing PW lock.  The
@@ -2177,7 +2178,10 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
          mode = einfo->ei_mode;
          if (einfo->ei_mode == LCK_PR)
                  mode |= LCK_PW;
          mode = einfo->ei_mode;
          if (einfo->ei_mode == LCK_PR)
                  mode |= LCK_PW;
-       if (agl == 0)
+       /* Normal lock requests must wait for the LVB to be ready before
+        * matching a lock; speculative lock requests do not need to,
+        * because they will not actually use the lock. */
+       if (!speculative)
                 match_flags |= LDLM_FL_LVB_READY;
         if (intent != 0)
                 match_flags |= LDLM_FL_BLOCK_GRANTED;
                 match_flags |= LDLM_FL_LVB_READY;
         if (intent != 0)
                 match_flags |= LDLM_FL_BLOCK_GRANTED;
@@ -2190,13 +2194,22 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
                         RETURN(ELDLM_OK);
  
                 matched = ldlm_handle2lock(&lockh);
                         RETURN(ELDLM_OK);
  
                 matched = ldlm_handle2lock(&lockh);
-               if (agl) {
-                       /* AGL enqueues DLM locks speculatively. Therefore if
-                        * it already exists a DLM lock, it wll just inform the
-                        * caller to cancel the AGL process for this stripe. */
+               if (speculative) {
+                       /* This DLM lock request is speculative, and does not
+                        * have an associated IO request. Therefore if there
+                        * is already a DLM lock, it wll just inform the
+                        * caller to cancel the request for this stripe.*/
+                       lock_res_and_lock(matched);
+                       if (ldlm_extent_equal(&policy->l_extent,
+                           &matched->l_policy_data.l_extent))
+                               rc = -EEXIST;
+                       else
+                               rc = -ECANCELED;
+                       unlock_res_and_lock(matched);
+
                         ldlm_lock_decref(&lockh, mode);
                         LDLM_LOCK_PUT(matched);
                         ldlm_lock_decref(&lockh, mode);
                         LDLM_LOCK_PUT(matched);
-                       RETURN(-ECANCELED);
+                       RETURN(rc);
                 } else if (osc_set_lock_data(matched, einfo->ei_cbdata)) {
                         *flags |= LDLM_FL_LVB_READY;
  
                 } else if (osc_set_lock_data(matched, einfo->ei_cbdata)) {
                         *flags |= LDLM_FL_LVB_READY;
  
@@ -2243,20 +2256,20 @@ no_match:
                         struct osc_enqueue_args *aa;
                         CLASSERT(sizeof(*aa) <= sizeof(req->rq_async_args));
                         aa = ptlrpc_req_async_args(req);
                         struct osc_enqueue_args *aa;
                         CLASSERT(sizeof(*aa) <= sizeof(req->rq_async_args));
                         aa = ptlrpc_req_async_args(req);
-                       aa->oa_exp    = exp;
-                       aa->oa_mode   = einfo->ei_mode;
-                       aa->oa_type   = einfo->ei_type;
+                       aa->oa_exp         = exp;
+                       aa->oa_mode        = einfo->ei_mode;
+                       aa->oa_type        = einfo->ei_type;
                         lustre_handle_copy(&aa->oa_lockh, &lockh);
                         lustre_handle_copy(&aa->oa_lockh, &lockh);
-                       aa->oa_upcall = upcall;
-                       aa->oa_cookie = cookie;
-                       aa->oa_agl    = !!agl;
-                       if (!agl) {
+                       aa->oa_upcall      = upcall;
+                       aa->oa_cookie      = cookie;
+                       aa->oa_speculative = speculative;
+                       if (!speculative) {
                                 aa->oa_flags  = flags;
                                 aa->oa_lvb    = lvb;
                         } else {
                                 aa->oa_flags  = flags;
                                 aa->oa_lvb    = lvb;
                         } else {
-                               /* AGL is essentially to enqueue an DLM lock
-                                * in advance, so we don't care about the
-                                * result of AGL enqueue. */
+                               /* speculative locks are essentially to enqueue
+                                * a DLM lock  in advance, so we don't care
+                                * about the result of the enqueue. */
                                 aa->oa_lvb    = NULL;
                                 aa->oa_flags  = NULL;
                         }
                                 aa->oa_lvb    = NULL;
                                 aa->oa_flags  = NULL;
                         }
@@ -2274,7 +2287,7 @@ no_match:
         }
  
         rc = osc_enqueue_fini(req, upcall, cookie, &lockh, einfo->ei_mode,
         }
  
         rc = osc_enqueue_fini(req, upcall, cookie, &lockh, einfo->ei_mode,
-                             flags, agl, rc);
+                             flags, speculative, rc);
         if (intent)
                 ptlrpc_req_finished(req);
  
         if (intent)
                 ptlrpc_req_finished(req);
  
diff --git a/lustre/ptlrpc/wiretest.c b/lustre/ptlrpc/wiretest.c

index 1f6afb7..1eb0078 100644 (file)
--- a/lustre/ptlrpc/wiretest.c
+++ b/lustre/ptlrpc/wiretest.c
@@ -1300,8 +1300,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_DIR_STRIPE);
         LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_SUBTREE);
                  OBD_CONNECT_DIR_STRIPE);
         LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_SUBTREE);
-       LASSERTF(OBD_CONNECT_LOCK_AHEAD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
-                OBD_CONNECT_LOCK_AHEAD);
+       LASSERTF(OBD_CONNECT_LOCKAHEAD_OLD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT_LOCKAHEAD_OLD);
         LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_BULK_MBITS);
         LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
         LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_BULK_MBITS);
         LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
@@ -1310,6 +1310,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_FLAGS2);
         LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT2_FILE_SECCTX);
                  OBD_CONNECT_FLAGS2);
         LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT2_FILE_SECCTX);
+       LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT2_LOCKAHEAD);
         LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
                 (unsigned)OBD_CKSUM_CRC32);
         LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
         LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
                 (unsigned)OBD_CKSUM_CRC32);
         LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/lustre/tests/Makefile.am b/lustre/tests/Makefile.am

index 33bb37a..33598d2 100644 (file)
--- a/lustre/tests/Makefile.am
+++ b/lustre/tests/Makefile.am
@@ -77,7 +77,7 @@ noinst_PROGRAMS += write_time_limit rwv lgetxattr_size_check checkfiemap
  noinst_PROGRAMS += listxattr_size_check check_fhandle_syscalls badarea_io
  noinst_PROGRAMS += llapi_layout_test orphan_linkea_check llapi_hsm_test
  noinst_PROGRAMS += group_lock_test llapi_fid_test sendfile_grouplock mmap_cat
  noinst_PROGRAMS += listxattr_size_check check_fhandle_syscalls badarea_io
  noinst_PROGRAMS += llapi_layout_test orphan_linkea_check llapi_hsm_test
  noinst_PROGRAMS += group_lock_test llapi_fid_test sendfile_grouplock mmap_cat
-noinst_PROGRAMS += swap_lock_test
+noinst_PROGRAMS += swap_lock_test lockahead_test
  
  bin_PROGRAMS = mcreate munlink
  testdir = $(libdir)/lustre/tests
  
  bin_PROGRAMS = mcreate munlink
  testdir = $(libdir)/lustre/tests
@@ -100,6 +100,7 @@ swap_lock_test_LDADD=$(LIBLUSTREAPI)
  statmany_LDADD=$(LIBLUSTREAPI)
  statone_LDADD=$(LIBLUSTREAPI)
  rwv_LDADD=$(LIBCFS)
  statmany_LDADD=$(LIBLUSTREAPI)
  statone_LDADD=$(LIBLUSTREAPI)
  rwv_LDADD=$(LIBCFS)
+lockahead_test_LDADD=$(LIBLUSTREAPI)
  
  ll_dirstripe_verify_SOURCES = ll_dirstripe_verify.c
  ll_dirstripe_verify_LDADD = $(LIBLUSTREAPI) $(LIBCFS) $(PTHREAD_LIBS)
  
  ll_dirstripe_verify_SOURCES = ll_dirstripe_verify.c
  ll_dirstripe_verify_LDADD = $(LIBLUSTREAPI) $(LIBCFS) $(PTHREAD_LIBS)
diff --git a/lustre/tests/lockahead_test.c b/lustre/tests/lockahead_test.c

new file mode 100644 (file)

index 0000000..11cb843
--- /dev/null
+++ b/lustre/tests/lockahead_test.c
@@ -0,0 +1,1204 @@
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+
+/*
+ * Copyright 2016 Cray Inc. All rights reserved.
+ * Authors: Patrick Farrell, Frank Zago
+ *
+ * A few portions are extracted from llapi_layout_test.c
+ *
+ * The purpose of this test is to exercise the lockahead advice of ladvise.
+ *
+ * The program will exit as soon as a test fails.
+ */
+
+#include <stdlib.h>
+#include <errno.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <poll.h>
+#include <time.h>
+
+#include <lustre/lustreapi.h>
+#include <linux/lustre/lustre_idl.h>
+
+#define ERROR(fmt, ...)                                                        \
+       fprintf(stderr, "%s: %s:%d: %s: " fmt "\n",                     \
+               program_invocation_short_name, __FILE__, __LINE__,      \
+               __func__, ## __VA_ARGS__);
+
+#define DIE(fmt, ...)                          \
+       do {                                    \
+               ERROR(fmt, ## __VA_ARGS__);     \
+               exit(-1);               \
+       } while (0)
+
+#define ASSERTF(cond, fmt, ...)                                                \
+       do {                                                            \
+               if (!(cond))                                            \
+                       DIE("assertion '%s' failed: "fmt,               \
+                           #cond, ## __VA_ARGS__);                     \
+       } while (0)
+
+#define PERFORM(testfn) \
+       do {                                                            \
+               cleanup();                                              \
+               fprintf(stderr, "Starting test " #testfn " at %lld\n",  \
+                       (unsigned long long)time(NULL));                \
+               rc = testfn();                                          \
+               fprintf(stderr, "Finishing test " #testfn " at %lld\n", \
+                       (unsigned long long)time(NULL));                \
+               cleanup();                                              \
+       } while (0)
+
+/* Name of file/directory. Will be set once and will not change. */
+static char mainpath[PATH_MAX];
+static const char *mainfile = "lockahead_test_654";
+
+static char fsmountdir[PATH_MAX];      /* Lustre mountpoint */
+static char *lustre_dir;               /* Test directory inside Lustre */
+static int single_test;                        /* Number of a single test to execute*/
+
+/* Cleanup our test file. */
+static void cleanup(void)
+{
+       unlink(mainpath);
+}
+
+/* Trivial helper for one advice */
+void setup_ladvise_lockahead(struct llapi_lu_ladvise *advice, int mode,
+                            int flags, size_t start, size_t end, bool async)
+{
+       advice->lla_advice = LU_LADVISE_LOCKAHEAD;
+       advice->lla_lockahead_mode = mode;
+       if (async)
+               advice->lla_peradvice_flags = flags | LF_ASYNC;
+       else
+               advice->lla_peradvice_flags = flags;
+       advice->lla_start = start;
+       advice->lla_end = end;
+       advice->lla_value3 = 0;
+       advice->lla_value4 = 0;
+}
+
+/* Test valid single lock ahead request */
+static int test10(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                                 write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+
+       close(fd);
+
+       return 0;
+}
+
+/* Get lock, wait until lock is taken */
+static int test11(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int enqueue_requests = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                                 write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       enqueue_requests++;
+
+       /* Ask again until we get the lock (status 1). */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice.lla_lockahead_result > 0)
+                       break;
+
+               enqueue_requests++;
+       }
+
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Again. This time it is always there. */
+       for (i = 0; i < 100; i++) {
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice.lla_lockahead_result > 0,
+                       "unexpected extent result: %d",
+                       advice.lla_lockahead_result);
+       }
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return enqueue_requests;
+}
+
+/* Test with several times the same extent */
+static int test12(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 10;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       for (i = 0; i < count; i++) {
+               setup_ladvise_lockahead(&(advice[i]), MODE_WRITE_USER, 0, 0,
+                                         write_size - 1, true);
+               advice[i].lla_lockahead_result = 98674;
+       }
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       for (i = 0; i < count; i++) {
+               ASSERTF(advice[i].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[i].lla_lockahead_result);
+       }
+       /* Since all the requests are for the same extent, we should only have
+        * one lock at the end. */
+       expected_lock_count = 1;
+
+       /* Ask again until we get the locks. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[count-1].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[count-1].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[count-1].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[count-1].lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       free(advice);
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Grow a lock forward */
+static int test13(void)
+{
+       struct llapi_lu_ladvise *advice = NULL;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       for (i = 0; i < 100; i++) {
+               if (advice)
+                       free(advice);
+               advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0,
+                                       i * write_size, (i+1)*write_size - 1,
+                                       true);
+               advice[0].lla_lockahead_result = 98674;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s' at offset %llu: %s",
+                       mainpath,
+                       advice[0].lla_end,
+                       strerror(errno));
+
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+
+               expected_lock_count++;
+       }
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Grow a lock backward */
+static int test14(void)
+{
+       struct llapi_lu_ladvise *advice = NULL;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       const int num_blocks = 100;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       for (i = 0; i < num_blocks; i++) {
+               size_t start = (num_blocks - i - 1) * write_size;
+               size_t end = (num_blocks - i) * write_size - 1;
+
+               if (advice)
+                       free(advice);
+               advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+               advice[0].lla_lockahead_result = 98674;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s' at offset %llu: %s",
+                       mainpath,
+                       advice[0].lla_end,
+                       strerror(errno));
+
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+
+               expected_lock_count++;
+       }
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Request many locks at 10MiB intervals */
+static int test15(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       for (i = 0; i < 5000; i++) {
+               /* The 'UL' designators are required to avoid undefined
+                * behavior which GCC turns in to an infinite loop */
+               __u64 start = i * 1024UL * 1024UL * 10UL;
+               __u64 end = start + 1;
+
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+
+               advice[0].lla_lockahead_result = 345678;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+
+               ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+               expected_lock_count++;
+       }
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* The write should cancel the first lock (which was too small)
+        * and create one of its own, so the net effect on lock count is 0. */
+
+       free(advice);
+
+       close(fd);
+
+       /* We have to map our expected return in to the range of valid return
+        * codes, 0-255. */
+       expected_lock_count = expected_lock_count/1000;
+
+       return expected_lock_count;
+}
+
+/* Use lockahead to verify behavior of ladvise locknoexpand */
+static int test16(void)
+{
+       struct llapi_lu_ladvise *advice;
+       struct llapi_lu_ladvise *advice_noexpand;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       __u64 start = 0;
+       __u64 end = write_size - 1;
+       int rc;
+       char buf[write_size];
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       advice_noexpand = malloc(sizeof(struct llapi_lu_ladvise));
+
+       /* First ask for a read lock, which will conflict with the write */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, false);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == 0,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Use an async request to verify we got the read lock we asked for */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, true);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Set noexpand */
+       advice_noexpand[0].lla_advice = LU_LADVISE_LOCKNOEXPAND;
+       advice_noexpand[0].lla_peradvice_flags = 0;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+
+       /* This write should generate a lock on exactly "write_size" bytes */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Now, disable locknoexpand and try writing again. */
+       advice_noexpand[0].lla_peradvice_flags = LF_UNSET;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       /* This write should get an expanded lock */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       /* Verify it didn't get a lock on just the bytes it wrote.*/
+       usleep(100000); /* 0.1 second, plenty of time to get the lock */
+
+       start = start + write_size;
+       end = end + write_size;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+               "unexpected extent result for extent %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Use lockahead to verify behavior of ladvise locknoexpand, with O_NONBLOCK.
+ * There should be no change in behavior. */
+static int test17(void)
+{
+       struct llapi_lu_ladvise *advice;
+       struct llapi_lu_ladvise *advice_noexpand;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       __u64 start = 0;
+       __u64 end = write_size - 1;
+       int rc;
+       char buf[write_size];
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC | O_NONBLOCK,
+                 S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       advice_noexpand = malloc(sizeof(struct llapi_lu_ladvise));
+
+       /* First ask for a read lock, which will conflict with the write */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, false);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == 0,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Use an async request to verify we got the read lock we asked for */
+       setup_ladvise_lockahead(advice, MODE_READ_USER, 0, start, end, true);
+       advice[0].lla_lockahead_result = 345678;
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Set noexpand */
+       advice_noexpand[0].lla_advice = LU_LADVISE_LOCKNOEXPAND;
+       advice_noexpand[0].lla_peradvice_flags = 0;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+
+       /* This write should generate a lock on exactly "write_size" bytes */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result for extent: %d",
+               advice[0].lla_lockahead_result);
+
+       /* Now, disable locknoexpand and try writing again. */
+       advice_noexpand[0].lla_peradvice_flags = LF_UNSET;
+       rc = llapi_ladvise(fd, 0, 1, advice_noexpand);
+
+       /* This write should get an expanded lock */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+       /* Write should create one LDLM lock */
+       expected_lock_count++;
+
+       /* Verify it didn't get a lock on just the bytes it wrote.*/
+       usleep(100000); /* 0.1 second, plenty of time to get the lock */
+
+       start = start + write_size;
+       end = end + write_size;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+
+       advice[0].lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, advice);
+
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+               "unexpected extent result for extent %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test overlapping requests */
+static int test18(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       int rc;
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       /* Overlapping locks - Should only end up with 1 */
+       for (i = 0; i < 10; i++) {
+               __u64 start = i;
+               __u64 end = start + 4096;
+
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+
+               advice[0].lla_lockahead_result = 345678;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+
+               ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice[0].lla_lockahead_result >= 0,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+       }
+       expected_lock_count = 1;
+
+       /* Ask again until we get the lock. */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice[0].lla_lockahead_result = 456789;
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, 0, 4096,
+                                       true);
+               rc = llapi_ladvise(fd, 0, count, advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice[0].lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice[0].lla_lockahead_result);
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test that normal request blocks lock ahead requests */
+static int test19(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+
+       /* This should create a lock on the whole file, which will block lock
+        * ahead requests. */
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       expected_lock_count = 1;
+
+       /* These should all be blocked. */
+       for (i = 0; i < 10; i++) {
+               __u64 start = i * 4096;
+               __u64 end = start + 4096;
+
+               setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start,
+                                       end, true);
+
+               advice[0].lla_lockahead_result = 345678;
+
+               rc = llapi_ladvise(fd, 0, count, advice);
+
+               ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+                       mainpath, strerror(errno));
+               ASSERTF(advice[0].lla_lockahead_result == LLA_RESULT_DIFFERENT,
+                       "unexpected extent result for extent %d: %d",
+                       i, advice[0].lla_lockahead_result);
+       }
+
+       free(advice);
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test sync requests, and matching with async requests */
+static int test20(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 1;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       /* Async request */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Ask again until we get the lock (status 1). */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice.lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Convert to a sync request on smaller range, should match and not
+        * cancel */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1 - write_size/2, false);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       /* Sync requests cannot give detailed results */
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Use an async request to test original lock is still present */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1, true);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test sync requests, and conflict with async requests */
+static int test21(void)
+{
+       struct llapi_lu_ladvise advice;
+       const int count = 1;
+       int fd;
+       size_t write_size = 1024 * 1024;
+       int rc;
+       char buf[write_size];
+       int i;
+       int expected_lock_count = 1;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       /* Async request */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size - 1, true);
+
+       /* Manually set the result so we can verify it's being modified */
+       advice.lla_lockahead_result = 345678;
+
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0,
+               "cannot lockahead '%s': %s", mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Ask again until we get the lock (status 1). */
+       for (i = 1; i < 100; i++) {
+               usleep(100000); /* 0.1 second */
+               advice.lla_lockahead_result = 456789;
+               rc = llapi_ladvise(fd, 0, count, &advice);
+               ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+                       mainpath, strerror(errno));
+
+               if (advice.lla_lockahead_result > 0)
+                       break;
+       }
+
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Convert to a sync request on larger range, should cancel existing
+        * lock */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size*2 - 1, false);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       /* Sync requests cannot give detailed results */
+       ASSERTF(advice.lla_lockahead_result == 0,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       /* Use an async request to test new lock is there */
+       setup_ladvise_lockahead(&advice, MODE_WRITE_USER, 0, 0,
+                               write_size*2 - 1, true);
+
+       advice.lla_lockahead_result = 456789;
+       rc = llapi_ladvise(fd, 0, count, &advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s': %s",
+               mainpath, strerror(errno));
+       ASSERTF(advice.lla_lockahead_result == LLA_RESULT_SAME,
+               "unexpected extent result: %d",
+               advice.lla_lockahead_result);
+
+       memset(buf, 0xaa, write_size);
+       rc = write(fd, buf, write_size);
+       ASSERTF(rc == sizeof(buf), "write failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       close(fd);
+
+       return expected_lock_count;
+}
+
+/* Test various valid and invalid inputs */
+static int test22(void)
+{
+       struct llapi_lu_ladvise *advice;
+       const int count = 1;
+       int fd;
+       int rc;
+       size_t start = 0;
+       size_t end = 0;
+
+       fd = open(mainpath, O_CREAT | O_RDWR | O_TRUNC, S_IRUSR | S_IWUSR);
+       ASSERTF(fd >= 0, "open failed for '%s': %s",
+               mainpath, strerror(errno));
+
+       /* A valid async request first */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024*1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       free(advice);
+
+       /* Valid request sync request */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024*1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, false);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == 0, "cannot lockahead '%s' : %s",
+               mainpath, strerror(errno));
+       free(advice);
+
+       /* No actual block */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 0;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for no block lock: %d %s",
+               rc, strerror(errno));
+       free(advice);
+
+       /* end before start */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 1024 * 1024;
+       end = 0;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for reversed block: %d %s",
+               rc, strerror(errno));
+       free(advice);
+
+       /* bogus lock mode - 0x65464 */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, 0x65464, 0, start, end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus lock mode: %d %s",
+               rc, strerror(errno));
+       free(advice);
+
+       /* bogus flags, 0x80 */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       start = 0;
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0x80, start, end,
+                               true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus flags: %u %d %s",
+               0x80, rc, strerror(errno));
+       free(advice);
+
+       /* bogus flags, 0xff - CEF_MASK */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0xff, start, end,
+                               true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus flags: %u %d %s",
+               0xff, rc, strerror(errno));
+       free(advice);
+
+       /* bogus flags, 0xffffffff */
+       advice = malloc(sizeof(struct llapi_lu_ladvise)*count);
+       end = 1024 * 1024;
+       setup_ladvise_lockahead(advice, MODE_WRITE_USER, 0xffffffff, start,
+                               end, true);
+       rc = llapi_ladvise(fd, 0, count, advice);
+       ASSERTF(rc == -1 && errno == EINVAL,
+               "unexpected return for bogus flags: %u %d %s",
+               0xffffffff, rc, strerror(errno));
+       free(advice);
+
+       close(fd);
+
+       return 0;
+}
+
+static void usage(char *prog)
+{
+       fprintf(stderr, "Usage: %s [-d lustre_dir], [-t single_test]\n", prog);
+       exit(-1);
+}
+
+static void process_args(int argc, char *argv[])
+{
+       int c;
+
+       while ((c = getopt(argc, argv, "d:t:")) != -1) {
+               switch (c) {
+               case 'd':
+                       lustre_dir = optarg;
+                       break;
+               case 't':
+                       single_test = atoi(optarg);
+                       break;
+               case '?':
+               default:
+                       fprintf(stderr, "Invalid option '%c'\n", optopt);
+                       usage(argv[0]);
+                       break;
+               }
+       }
+}
+
+int main(int argc, char *argv[])
+{
+       char fsname[8];
+       int rc;
+
+       process_args(argc, argv);
+       if (lustre_dir == NULL)
+               lustre_dir = "/mnt/lustre";
+
+       rc = llapi_search_mounts(lustre_dir, 0, fsmountdir, fsname);
+       if (rc != 0) {
+               fprintf(stderr, "Error: '%s': not a Lustre filesystem\n",
+                       lustre_dir);
+               return -1;
+       }
+
+       /* Play nice with Lustre test scripts. Non-line buffered output
+        * stream under I/O redirection may appear incorrectly. */
+       setvbuf(stdout, NULL, _IOLBF, 0);
+
+       /* Create a test filename and reuse it. Remove possibly old files. */
+       rc = snprintf(mainpath, sizeof(mainpath), "%s/%s", lustre_dir,
+                     mainfile);
+       ASSERTF(rc > 0 && rc < sizeof(mainpath), "invalid name for mainpath");
+       cleanup();
+
+       atexit(cleanup);
+
+       switch (single_test) {
+       case 0:
+               PERFORM(test10);
+               PERFORM(test11);
+               PERFORM(test12);
+               PERFORM(test13);
+               PERFORM(test14);
+               PERFORM(test15);
+               PERFORM(test16);
+               PERFORM(test17);
+               PERFORM(test18);
+               PERFORM(test19);
+               PERFORM(test20);
+               PERFORM(test21);
+               PERFORM(test22);
+               /* When running all the test cases, we can't use the return
+                * from the last test case, as it might be non-zero to return
+                * info, rather than for an error.  Test cases assert and exit
+                * if an error occurs. */
+               rc = 0;
+               break;
+       case 10:
+               PERFORM(test10);
+               break;
+       case 11:
+               PERFORM(test11);
+               break;
+       case 12:
+               PERFORM(test12);
+               break;
+       case 13:
+               PERFORM(test13);
+               break;
+       case 14:
+               PERFORM(test14);
+               break;
+       case 15:
+               PERFORM(test15);
+               break;
+       case 16:
+               PERFORM(test16);
+               break;
+       case 17:
+               PERFORM(test17);
+               break;
+       case 18:
+               PERFORM(test18);
+               break;
+       case 19:
+               PERFORM(test19);
+               break;
+       case 20:
+               PERFORM(test20);
+               break;
+       case 21:
+               PERFORM(test21);
+               break;
+       case 22:
+               PERFORM(test22);
+               break;
+       default:
+               fprintf(stderr, "impossible value of single_test %d\n",
+                       single_test);
+               rc = -1;
+               break;
+       }
+
+       return rc;
+}
diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh

index e9ed875..a9c7b1e 100755 (executable)
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -14837,6 +14837,87 @@ test_255b() {
  }
  run_test 255b "check 'lfs ladvise -a dontneed'"
  
  }
  run_test 255b "check 'lfs ladvise -a dontneed'"
  
+test_255c() {
+       local count
+       local new_count
+       local difference
+       local i
+       local rc
+       test_mkdir -p $DIR/$tdir
+       $SETSTRIPE -i 0 $DIR/$tdir
+
+       #test 10 returns only success/failure
+       i=10
+       lockahead_test -d $DIR/$tdir -t $i
+       rc=$?
+       if [ $rc -eq 255 ]; then
+               error "Ladvise test${i} failed, ${rc}"
+       fi
+
+       #test 11 counts lock enqueue requests, all others count new locks
+       i=11
+       count=$(do_facet ost1 \
+               $LCTL get_param -n ost.OSS.ost.stats)
+       count=$(echo "$count" | grep ldlm_extent_enqueue | awk '{ print $2 }')
+
+       lockahead_test -d $DIR/$tdir -t $i
+       rc=$?
+       if [ $rc -eq 255 ]; then
+               error "Ladvise test${i} failed, ${rc}"
+       fi
+
+       new_count=$(do_facet ost1 \
+               $LCTL get_param -n ost.OSS.ost.stats)
+       new_count=$(echo "$new_count" | grep ldlm_extent_enqueue | \
+                  awk '{ print $2 }')
+
+       difference="$((new_count - count))"
+       if [ $difference -ne $rc ]; then
+               error "Ladvise test${i}, bad enqueue count, returned " \
+                     "${rc}, actual ${difference}"
+       fi
+
+       for i in $(seq 12 21); do
+               # If we do not do this, we run the risk of having too many
+               # locks and starting lock cancellation while we are checking
+               # lock counts.
+               cancel_lru_locks osc
+
+               count=$($LCTL get_param -n \
+                      ldlm.namespaces.$FSNAME-OST0000*osc-f*.lock_unused_count)
+
+               lockahead_test -d $DIR/$tdir -t $i
+               rc=$?
+               if [ $rc -eq 255 ]; then
+                       error "Ladvise test ${i} failed, ${rc}"
+               fi
+
+               new_count=$($LCTL get_param -n \
+                      ldlm.namespaces.$FSNAME-OST0000*osc-f*.lock_unused_count)
+               difference="$((new_count - count))"
+
+               # Test 15 output is divided by 1000 to map down to valid return
+               if [ $i -eq 15 ]; then
+                       rc="$((rc * 1000))"
+               fi
+
+               if [ $difference -ne $rc ]; then
+                       error "Ladvise test ${i}, bad lock count, returned " \
+                             "${rc}, actual ${difference}"
+               fi
+       done
+
+       #test 22 returns only success/failure
+       i=22
+       lockahead_test -d $DIR/$tdir -t $i
+       rc=$?
+       if [ $rc -eq 255 ]; then
+               error "Ladvise test${i} failed, ${rc}"
+       fi
+
+}
+run_test 255c "suite of ladvise lockahead tests"
+
  test_256() {
         local cl_user
         local cat_sl
  test_256() {
         local cl_user
         local cat_sl
diff --git a/lustre/utils/lfs.c b/lustre/utils/lfs.c

index 44ce851..dc07fb9 100644 (file)
--- a/lustre/utils/lfs.c
+++ b/lustre/utils/lfs.c
@@ -400,8 +400,9 @@ command_t cmdlist[] = {
         {"ladvise", lfs_ladvise, 0,
          "Provide servers with advice about access patterns for a file.\n"
          "usage: ladvise [--advice|-a ADVICE] [--start|-s START[kMGT]]\n"
         {"ladvise", lfs_ladvise, 0,
          "Provide servers with advice about access patterns for a file.\n"
          "usage: ladvise [--advice|-a ADVICE] [--start|-s START[kMGT]]\n"
-        "               [--background|-b]\n"
+        "               [--background|-b] [--unset|-u]\n\n"
          "               {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}\n"
          "               {[--end|-e END[kMGT]] | [--length|-l LENGTH[kMGT]]}\n"
+        "               {[--mode|-m [READ,WRITE]}\n"
          "               <file> ...\n"},
         {"help", Parser_help, 0, "help"},
         {"exit", Parser_quit, 0, "quit"},
          "               <file> ...\n"},
         {"help", Parser_help, 0, "help"},
         {"exit", Parser_quit, 0, "quit"},
@@ -5218,6 +5219,28 @@ static int lfs_swap_layouts(int argc, char **argv)
  
  static const char *const ladvise_names[] = LU_LADVISE_NAMES;
  
  
  static const char *const ladvise_names[] = LU_LADVISE_NAMES;
  
+static const char *const lock_mode_names[] = LOCK_MODE_NAMES;
+
+static const char *const lockahead_results[] = {
+       [LLA_RESULT_SENT] = "Lock request sent",
+       [LLA_RESULT_DIFFERENT] = "Different matching lock found",
+       [LLA_RESULT_SAME] = "Matching lock on identical extent found",
+};
+
+int lfs_get_mode(const char *string)
+{
+       enum lock_mode_user mode;
+
+       for (mode = 0; mode < ARRAY_SIZE(lock_mode_names); mode++) {
+               if (lock_mode_names[mode] == NULL)
+                       continue;
+               if (strcmp(string, lock_mode_names[mode]) == 0)
+                       return mode;
+       }
+
+       return -EINVAL;
+}
+
  static enum lu_ladvise_type lfs_get_ladvice(const char *string)
  {
         enum lu_ladvise_type advice;
  static enum lu_ladvise_type lfs_get_ladvice(const char *string)
  {
         enum lu_ladvise_type advice;
@@ -5240,9 +5263,11 @@ static int lfs_ladvise(int argc, char **argv)
         { .val = 'b',   .name = "background",   .has_arg = no_argument },
         { .val = 'e',   .name = "end",          .has_arg = required_argument },
         { .val = 'l',   .name = "length",       .has_arg = required_argument },
         { .val = 'b',   .name = "background",   .has_arg = no_argument },
         { .val = 'e',   .name = "end",          .has_arg = required_argument },
         { .val = 'l',   .name = "length",       .has_arg = required_argument },
+       { .val = 'm',   .name = "mode",         .has_arg = required_argument },
         { .val = 's',   .name = "start",        .has_arg = required_argument },
         { .val = 's',   .name = "start",        .has_arg = required_argument },
+       { .val = 'u',   .name = "unset",        .has_arg = no_argument },
         { .name = NULL } };
         { .name = NULL } };
-       char                     short_opts[] = "a:be:l:s:";
+       char                     short_opts[] = "a:be:l:m:s:u";
         int                      c;
         int                      rc = 0;
         const char              *path;
         int                      c;
         int                      rc = 0;
         const char              *path;
@@ -5254,6 +5279,7 @@ static int lfs_ladvise(int argc, char **argv)
         unsigned long long       length = 0;
         unsigned long long       size_units;
         unsigned long long       flags = 0;
         unsigned long long       length = 0;
         unsigned long long       size_units;
         unsigned long long       flags = 0;
+       int                      mode = 0;
  
         optind = 0;
         while ((c = getopt_long(argc, argv, short_opts,
  
         optind = 0;
         while ((c = getopt_long(argc, argv, short_opts,
@@ -5282,6 +5308,9 @@ static int lfs_ladvise(int argc, char **argv)
                 case 'b':
                         flags |= LF_ASYNC;
                         break;
                 case 'b':
                         flags |= LF_ASYNC;
                         break;
+               case 'u':
+                       flags |= LF_UNSET;
+                       break;
                 case 'e':
                         size_units = 1;
                         rc = llapi_parse_size(optarg, &end,
                 case 'e':
                         size_units = 1;
                         rc = llapi_parse_size(optarg, &end,
@@ -5312,6 +5341,15 @@ static int lfs_ladvise(int argc, char **argv)
                                 return CMD_HELP;
                         }
                         break;
                                 return CMD_HELP;
                         }
                         break;
+               case 'm':
+                       mode = lfs_get_mode(optarg);
+                       if (mode < 0) {
+                               fprintf(stderr, "%s: bad mode '%s', valid "
+                                                "modes are READ or WRITE\n",
+                                       argv[0], optarg);
+                               return CMD_HELP;
+                       }
+                       break;
                 case '?':
                         return CMD_HELP;
                 default:
                 case '?':
                         return CMD_HELP;
                 default:
@@ -5334,6 +5372,13 @@ static int lfs_ladvise(int argc, char **argv)
                 return CMD_HELP;
         }
  
                 return CMD_HELP;
         }
  
+       if (advice_type == LU_LADVISE_LOCKNOEXPAND) {
+               fprintf(stderr, "%s: Lock no expand advice is a per file "
+                                "descriptor advice, so when called from lfs, "
+                                "it does nothing.\n", argv[0]);
+               return CMD_HELP;
+       }
+
         if (argc <= optind) {
                 fprintf(stderr, "%s: please give one or more file names\n",
                         argv[0]);
         if (argc <= optind) {
                 fprintf(stderr, "%s: please give one or more file names\n",
                         argv[0]);
@@ -5355,6 +5400,18 @@ static int lfs_ladvise(int argc, char **argv)
                 return CMD_HELP;
         }
  
                 return CMD_HELP;
         }
  
+       if (advice_type != LU_LADVISE_LOCKAHEAD && mode != 0) {
+               fprintf(stderr, "%s: mode is only valid with lockahead\n",
+                       argv[0]);
+               return CMD_HELP;
+       }
+
+       if (advice_type == LU_LADVISE_LOCKAHEAD && mode == 0) {
+               fprintf(stderr, "%s: mode is required with lockahead\n",
+                       argv[0]);
+               return CMD_HELP;
+       }
+
         while (optind < argc) {
                 int rc2;
  
         while (optind < argc) {
                 int rc2;
  
@@ -5375,6 +5432,11 @@ static int lfs_ladvise(int argc, char **argv)
                 advice.lla_value2 = 0;
                 advice.lla_value3 = 0;
                 advice.lla_value4 = 0;
                 advice.lla_value2 = 0;
                 advice.lla_value3 = 0;
                 advice.lla_value4 = 0;
+               if (advice_type == LU_LADVISE_LOCKAHEAD) {
+                       advice.lla_lockahead_mode = mode;
+                       advice.lla_peradvice_flags = flags;
+               }
+
                 rc2 = llapi_ladvise(fd, flags, 1, &advice);
                 close(fd);
                 if (rc2 < 0) {
                 rc2 = llapi_ladvise(fd, flags, 1, &advice);
                 close(fd);
                 if (rc2 < 0) {
@@ -5382,7 +5444,10 @@ static int lfs_ladvise(int argc, char **argv)
                                 "'%s': %s\n", argv[0],
                                 ladvise_names[advice_type],
                                 path, strerror(errno));
                                 "'%s': %s\n", argv[0],
                                 ladvise_names[advice_type],
                                 path, strerror(errno));
+
+                       goto next;
                 }
                 }
+
  next:
                 if (rc == 0 && rc2 < 0)
                         rc = rc2;
  next:
                 if (rc == 0 && rc2 < 0)
                         rc = rc2;
diff --git a/lustre/utils/liblustreapi_ladvise.c b/lustre/utils/liblustreapi_ladvise.c

index 445c145..c098889 100644 (file)
--- a/lustre/utils/liblustreapi_ladvise.c
+++ b/lustre/utils/liblustreapi_ladvise.c
@@ -52,8 +52,9 @@
  int llapi_ladvise(int fd, unsigned long long flags, int num_advise,
                   struct llapi_lu_ladvise *ladvise)
  {
  int llapi_ladvise(int fd, unsigned long long flags, int num_advise,
                   struct llapi_lu_ladvise *ladvise)
  {
-       int rc;
         struct llapi_ladvise_hdr *ladvise_hdr;
         struct llapi_ladvise_hdr *ladvise_hdr;
+       int rc;
+       int i;
  
         if (num_advise < 1 || num_advise >= LAH_COUNT_MAX) {
                 errno = EINVAL;
  
         if (num_advise < 1 || num_advise >= LAH_COUNT_MAX) {
                 errno = EINVAL;
@@ -79,6 +80,18 @@ int llapi_ladvise(int fd, unsigned long long flags, int num_advise,
                 llapi_error(LLAPI_MSG_ERROR, -errno, "cannot give advice");
                 return -1;
         }
                 llapi_error(LLAPI_MSG_ERROR, -errno, "cannot give advice");
                 return -1;
         }
+
+       /* Copy results back in to caller provided structs */
+       for (i = 0; i < num_advise; i++) {
+               struct llapi_lu_ladvise *ladvise_iter;
+
+               ladvise_iter = &ladvise_hdr->lah_advise[i];
+
+               if (ladvise_iter->lla_advice == LU_LADVISE_LOCKAHEAD)
+                       ladvise[i].lla_lockahead_result =
+                                       ladvise_iter->lla_lockahead_result;
+       }
+
         return 0;
  }
  
         return 0;
  }
  
diff --git a/lustre/utils/wirecheck.c b/lustre/utils/wirecheck.c

index 91afaee..dc3e115 100644 (file)
--- a/lustre/utils/wirecheck.c
+++ b/lustre/utils/wirecheck.c
@@ -589,11 +589,12 @@ check_obd_connect_data(void)
         CHECK_DEFINE_64X(OBD_CONNECT_MULTIMODRPCS);
         CHECK_DEFINE_64X(OBD_CONNECT_DIR_STRIPE);
         CHECK_DEFINE_64X(OBD_CONNECT_SUBTREE);
         CHECK_DEFINE_64X(OBD_CONNECT_MULTIMODRPCS);
         CHECK_DEFINE_64X(OBD_CONNECT_DIR_STRIPE);
         CHECK_DEFINE_64X(OBD_CONNECT_SUBTREE);
-       CHECK_DEFINE_64X(OBD_CONNECT_LOCK_AHEAD);
+       CHECK_DEFINE_64X(OBD_CONNECT_LOCKAHEAD_OLD);
         CHECK_DEFINE_64X(OBD_CONNECT_BULK_MBITS);
         CHECK_DEFINE_64X(OBD_CONNECT_OBDOPACK);
         CHECK_DEFINE_64X(OBD_CONNECT_FLAGS2);
         CHECK_DEFINE_64X(OBD_CONNECT2_FILE_SECCTX);
         CHECK_DEFINE_64X(OBD_CONNECT_BULK_MBITS);
         CHECK_DEFINE_64X(OBD_CONNECT_OBDOPACK);
         CHECK_DEFINE_64X(OBD_CONNECT_FLAGS2);
         CHECK_DEFINE_64X(OBD_CONNECT2_FILE_SECCTX);
+       CHECK_DEFINE_64X(OBD_CONNECT2_LOCKAHEAD);
  
         CHECK_VALUE_X(OBD_CKSUM_CRC32);
         CHECK_VALUE_X(OBD_CKSUM_ADLER);
  
         CHECK_VALUE_X(OBD_CKSUM_CRC32);
         CHECK_VALUE_X(OBD_CKSUM_ADLER);
diff --git a/lustre/utils/wiretest.c b/lustre/utils/wiretest.c

index 1788512..3851c24 100644 (file)
--- a/lustre/utils/wiretest.c
+++ b/lustre/utils/wiretest.c
@@ -1319,8 +1319,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_DIR_STRIPE);
         LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_SUBTREE);
                  OBD_CONNECT_DIR_STRIPE);
         LASSERTF(OBD_CONNECT_SUBTREE == 0x800000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_SUBTREE);
-       LASSERTF(OBD_CONNECT_LOCK_AHEAD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
-                OBD_CONNECT_LOCK_AHEAD);
+       LASSERTF(OBD_CONNECT_LOCKAHEAD_OLD == 0x1000000000000000ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT_LOCKAHEAD_OLD);
         LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_BULK_MBITS);
         LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
         LASSERTF(OBD_CONNECT_BULK_MBITS == 0x2000000000000000ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT_BULK_MBITS);
         LASSERTF(OBD_CONNECT_OBDOPACK == 0x4000000000000000ULL, "found 0x%.16llxULL\n",
@@ -1329,6 +1329,8 @@ void lustre_assert_wire_constants(void)
                  OBD_CONNECT_FLAGS2);
         LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT2_FILE_SECCTX);
                  OBD_CONNECT_FLAGS2);
         LASSERTF(OBD_CONNECT2_FILE_SECCTX == 0x1ULL, "found 0x%.16llxULL\n",
                  OBD_CONNECT2_FILE_SECCTX);
+       LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
+                OBD_CONNECT2_LOCKAHEAD);
         LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
                 (unsigned)OBD_CKSUM_CRC32);
         LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
         LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
                 (unsigned)OBD_CKSUM_CRC32);
         LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
author	Patrick Farrell <paf@cray.com>
	Thu, 14 Sep 2017 15:24:50 +0000 (10:24 -0500)
committer	Oleg Drokin <oleg.drokin@intel.com>
	Thu, 21 Sep 2017 06:12:44 +0000 (06:12 +0000)
Documentation/ladvise_lockahead.txt	[new file with mode: 0644]	patch \| blob
lustre/contrib/wireshark/lustre_dlm_flags_wshark.c		patch \| blob \| history
lustre/doc/lfs-ladvise.1		patch \| blob \| history
lustre/include/cl_object.h		patch \| blob \| history
lustre/include/lustre_dlm.h		patch \| blob \| history
lustre/include/lustre_dlm_flags.h		patch \| blob \| history
lustre/include/lustre_export.h		patch \| blob \| history
lustre/include/lustre_osc.h		patch \| blob \| history
lustre/include/obd_support.h		patch \| blob \| history
lustre/include/uapi/linux/lustre/lustre_idl.h		patch \| blob \| history
lustre/include/uapi/linux/lustre/lustre_user.h		patch \| blob \| history
lustre/ldlm/ldlm_extent.c		patch \| blob \| history
lustre/ldlm/ldlm_lib.c		patch \| blob \| history
lustre/ldlm/ldlm_lock.c		patch \| blob \| history
lustre/llite/file.c		patch \| blob \| history
lustre/llite/glimpse.c		patch \| blob \| history
lustre/llite/llite_internal.h		patch \| blob \| history
lustre/llite/llite_lib.c		patch \| blob \| history
lustre/llite/vvp_io.c		patch \| blob \| history
lustre/lov/lov_io.c		patch \| blob \| history
lustre/obdclass/cl_lock.c		patch \| blob \| history
lustre/obdclass/lprocfs_status.c		patch \| blob \| history
lustre/ofd/ofd_dev.c		patch \| blob \| history
lustre/ofd/ofd_dlm.c		patch \| blob \| history
lustre/ofd/ofd_internal.h		patch \| blob \| history
lustre/osc/osc_internal.h		patch \| blob \| history
lustre/osc/osc_lock.c		patch \| blob \| history
lustre/osc/osc_request.c		patch \| blob \| history
lustre/ptlrpc/wiretest.c		patch \| blob \| history
lustre/tests/Makefile.am		patch \| blob \| history
lustre/tests/lockahead_test.c	[new file with mode: 0644]	patch \| blob
lustre/tests/sanity.sh		patch \| blob \| history
lustre/utils/lfs.c		patch \| blob \| history
lustre/utils/liblustreapi_ladvise.c		patch \| blob \| history
lustre/utils/wirecheck.c		patch \| blob \| history
lustre/utils/wiretest.c		patch \| blob \| history