Whamcloud - gitweb
fs/lustre-release.git
3 years agoLU-14051 utils: flush alr batch file in thread 10/40310/7
John L. Hammond [Thu, 17 Sep 2020 20:56:55 +0000 (15:56 -0500)]
LU-14051 utils: flush alr batch file in thread

In ofd_access_log_reader, move flushing of the batch file thread to
the sort and print thread.

Test-Parameters: trivial
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: Id1e008ede6c05e24ea2e2459520d6585007acc7d
Reviewed-on: https://review.whamcloud.com/40310
Tested-by: Maloo <maloo@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14050 utils: fix fraction output logic 34/39934/6
Alex Zhuravlev [Wed, 16 Sep 2020 19:14:42 +0000 (22:14 +0300)]
LU-14050 utils: fix fraction output logic

so that it doesn't suppress single line reports

Test-Parameters: trivial
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I9e374f8e769afcadc5cecbf529fa403deb544544
Reviewed-on: https://review.whamcloud.com/39934
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-12546 mdt: abort recovery between MDTs 27/36027/10
Hongchao Zhang [Sun, 12 Apr 2020 12:40:27 +0000 (20:40 +0800)]
LU-12546 mdt: abort recovery between MDTs

Add an option to abort recovery between MDTs in case there is a
problem during recovery (e.g. MDT is missing or has broken logs),
but don't abort recovery between MDT and clients.

Change-Id: Id88f2b2ebae5cfa722dcac67c087b9b9a448721e
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36027
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
3 years agoLU-13765 osd-ldiskfs: Rename dt_declare_falloc to dt_declare_fallocate 09/40509/2
Arshad Hussain [Sun, 1 Nov 2020 03:33:13 +0000 (09:03 +0530)]
LU-13765 osd-ldiskfs: Rename dt_declare_falloc to dt_declare_fallocate

This patch is the follow up of the patch: 93f700ca24
(LU-13765 osd-ldiskfs: Extend credit correctly for fallocate) and
it makes these changes:

01. Rename dt_declare_falloc() to dt_declare_fallocate()
for better readability.

02. Removes fallocate mode check under osd_fallocate()
as mode check is already done under declare phase.

03. Minor space/tabs changes

Test-Parameters: trivial testlist=sanity ostsizegb=12 env=ONLY="150e"
Test-Parameters: testlist=sanity-quota
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Change-Id: If911a59a9c944e660e9926f4c436a4aeb2919284
Reviewed-on: https://review.whamcloud.com/40509
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-10810 test: test lseek support in tools 02/40502/10
Mikhail Pershin [Sat, 31 Oct 2020 09:03:33 +0000 (12:03 +0300)]
LU-10810 test: test lseek support in tools

Check that SEEK_HOLE/SEEK_DATA are preforming in external tools
as expected.

Need 'cp' version 8.33+ and 'tar' version 1.29+, so check tools
version and measure runtime of sparse file handling if applicable

Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I1424bf57c88f69d054c1646be66e10dd7fde8a1a
Reviewed-on: https://review.whamcloud.com/40502
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-10810 clio: SEEK_HOLE/SEEK_DATA on client side 08/39708/16
Mikhail Pershin [Fri, 21 Aug 2020 16:22:38 +0000 (19:22 +0300)]
LU-10810 clio: SEEK_HOLE/SEEK_DATA on client side

Patch introduces basic support for lseek SEEK_HOLE/SEEK_DATA
parameters in lustre client.

- introduce new IO type CIT_LSEEK in CLIO stack
- LOV splits request to all stripes involved and merges
  results back.
- OSC sends OST LSEEK RPC asynchronously
- if target doesn't support LSEEK RPC then OSC assumes
  whole related object is data with virtual hole at the end
- lseek restores released files assuming it is done prior
  the file copying.
- tool is added to request needed lseek on file
- basic tests are added in sanity, sanityn and sanity-hsm

Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I0728329d4bce71c441de581a439cde1aa873fd46
Reviewed-on: https://review.whamcloud.com/39708
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Tested-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-10810 ptlrpc: introduce OST_SEEK RPC 07/39707/9
Mikhail Pershin [Mon, 17 Aug 2020 11:06:30 +0000 (14:06 +0300)]
LU-10810 ptlrpc: introduce OST_SEEK RPC

For the purposes of SEEK_HOLE/SEEK_DATA support introduce
new OST_SEEK RPC.

Patch add RPC layout, unified handler and connect flag for
compatibility needs.

Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I1580902b6b773d9a6d6f9beaa1ee1da60fbc20f8
Reviewed-on: https://review.whamcloud.com/39707
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14116 autoconf: check if DES3 enctype is supported 54/40554/3
Jian Yu [Fri, 6 Nov 2020 06:31:28 +0000 (22:31 -0800)]
LU-14116 autoconf: check if DES3 enctype is supported

krb5 releases 1.18 and later completely remove support for
all DES3 enctypes (des3-cbc-raw, des3-hmac-sha1, des3-cbc-sha1-kd).

This patch adds HAVE_DES3_SUPPORT to check if DES3 enctype
is supported.

Change-Id: Ibb51ec7961e8c775ea92dec6119f4de01e2d9b1d
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40554
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13182 llite: Avoid eternel retry loops with MAP_POPULATE 21/40221/6
Oleg Drokin [Mon, 12 Oct 2020 20:10:15 +0000 (16:10 -0400)]
LU-13182 llite: Avoid eternel retry loops with MAP_POPULATE

Kernels 5.4+ have an infinite retry loop from MAP_POPULATE mmap
option. Use the FAULT_FLAG_RETRY_NOWAIT to instruct filemap_fault
to not drop the mmap_sem so if the call fails, we could use
the slow path and break the loop from forming.
(Idea by Neil Brown)

Test-Parameters: testlist=sanity-hsm env=ONLY=1 clientdistro=ubuntu2004
Change-Id: I320ab9ca447282aea15ef2030ef8671c4260d895
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40221
Reviewed-by: Neil Brown <neilb@suse.de>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-13651 hsm: call hsm_find_compatible_cb() only for cancel 67/38867/18
Kirill Malkin [Sun, 17 May 2020 03:17:43 +0000 (20:17 -0700)]
LU-13651 hsm: call hsm_find_compatible_cb() only for cancel

The HSM action queue is scanned linearly in hsm_find_compatible_cb()
for existing requests on the same file so that duplicate or
conflicting requests are not added and cancel requests are assigned
the correct cookie, but this can cause a large delay in adding new
requests when the action queue is very large, as access to it is
locked for the duration of the search. Scanning the queue does not
guarantee that duplicate or conflicting requests are not added as
scanning (in hsm_find_compatible_cb()) and adding requests (in
mdt_agent_record_add()) are distinct operations that are not
serialized by a lock and so a race window exists between these two
function calls within which duplicate or conflicting requests can be
added. This is hopefully not a big problem though, as the CDT thread
will not send duplicate archive requests to a copytool serving a
different HSM backend (and it could probably be prevented from sending
duplicate archive requests to a copytool serving the same backend with
a small change in mdt_hsm_is_action_compat()) and duplicate restore
requests are serialized by taking the layout lock on the file before
being added to the action queue, which effectively serializes
them (although this blocks the caller, e.g. lfs, so it might not be
ideal). Since calling hsm_compatible_cb() does not protect completely
against this issue and can cause large delays in adding new requests,
we skip calling it for all requests apart from cancel requests that
don't specify a cookie (which should be all cancel requests in current
code), hopefully safely.

Test-Parameters: testlist=sanity-hsm
Signed-off-by: Kirill Malkin <kirill.malkin@hpe.com>
Signed-off-by: Nikitas Angelinas <nikitas.angelinas@hpe.com>
Signed-off-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Cray-bug-id: LUS-8717
Change-Id: Id82b2a0720e46a9c12c4d9df323ce5a7bd7aff37
Reviewed-on: https://review.whamcloud.com/38867
Reviewed-by: Ben Evans <beevans@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Nathan Rutman <nrutman@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13949 build: add autogen.sh into distribution tarball 25/40425/2
Jian Yu [Tue, 27 Oct 2020 17:33:03 +0000 (10:33 -0700)]
LU-13949 build: add autogen.sh into distribution tarball

This patch adds autogen.sh and config/lustre-version.m4 into
Lustre distribution tarball so that customers can regenerate
aclocal.m4, config.h.in, autoMakefile.in and configure in
their build environments.

Change-Id: Ic6c5430b9a8b504ebc6a7618e141f1ea23b046a2
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40425
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shuichi Ihara <sihara@ddn.com>
3 years agoLU-14073 build: fix autoconf test for clean_bdev_aliases() 95/40395/3
Mr NeilBrown [Mon, 26 Oct 2020 02:38:21 +0000 (13:38 +1100)]
LU-14073 build: fix autoconf test for clean_bdev_aliases()

From 5.9, buffer_head.h no longer provides a declaration for
'struct block_device' so the code fragment fails because the compiler
doesn't know the size of that structure.

Instead, simple pass NULL rather than the address of a real structure.

Test-Parameters: trivial
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: I1775572fbd56d22822b6e440fe95bd105042e7b8
Reviewed-on: https://review.whamcloud.com/40395
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Aurelien Degremont <degremoa@amazon.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-14070 tgt: check obd_recovering in tgt_brw_unlock() 82/40382/4
Mikhail Pershin [Fri, 23 Oct 2020 09:30:58 +0000 (12:30 +0300)]
LU-14070 tgt: check obd_recovering in tgt_brw_unlock()

Since tgt_brw_lock() never takes a lock during recovery,
the tgt_brw_unlock() should check that also to prevent
false-positive triggering of assertion.

Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: Ic6b6f6fa16678622460101d26df14f523e56a47a
Reviewed-on: https://review.whamcloud.com/40382
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14057 ptlrpc: don't log connection 'restored' inappropriately 31/40331/3
Aurelien Degremont [Fri, 16 Oct 2020 13:42:42 +0000 (13:42 +0000)]
LU-14057 ptlrpc: don't log connection 'restored' inappropriately

Reverse imports maintain a target->client connection which
does not support recovery as client don't run a recovery.
At every connection, the reverse import state goes from
NEW to RECOVER to FULL which triggers a `Connection restored`
log message, even if this is the first connection from
this client.

Suppress this log message for reverse import to avoid
this wrong logging.

Test-Parameters: trivial
Signed-off-by: Aurelien Degremont <degremoa@amazon.com>
Change-Id: I6f35b8d916a4ae535d55ba39b491f57e1553986c
Reviewed-on: https://review.whamcloud.com/40331
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13728 utils: add missing global parameters 98/40298/2
Cyril Bordage [Mon, 19 Oct 2020 16:10:23 +0000 (18:10 +0200)]
LU-13728 utils: add missing global parameters

lnetctl export shows the complete set of global parameters as
with lnetctl global.

Signed-off-by: Cyril Bordage <cbordage@whamcloud.com>
Change-Id: I4d864fb4734679106ac6c49ec7f57f5e00ba3434
Reviewed-on: https://review.whamcloud.com/40298
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-10728 utils: fix str length in error string 68/40268/3
Cyril Bordage [Fri, 16 Oct 2020 13:39:06 +0000 (15:39 +0200)]
LU-10728 utils: fix str length in error string

sizeof on pointers was used to get the length of the string. Use
instead string length from function inputs. Also remove useless uses
of snprintf and terminating null bytes.

Test-Parameters: @lnet
Signed-off-by: Cyril Bordage <cbordage@whamcloud.com>
Change-Id: I7053f39828ababd5782b360ef5c27c607ddb740d
Reviewed-on: https://review.whamcloud.com/40268
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14029 kernel: new kernel [SLES15 SP2 5.3.18-24.24.1] 65/40265/3
Jian Yu [Thu, 15 Oct 2020 19:58:01 +0000 (12:58 -0700)]
LU-14029 kernel: new kernel [SLES15 SP2 5.3.18-24.24.1]

This patch makes changes to support new SLES15 SP2 release
with kernel 5.3.18-24.24.1 for Lustre client.

Test-Parameters: trivial \
env=SANITY_EXCEPT="100 130 136 817" \
clientdistro=sles15sp2 serverdistro=el7.8 \
testlist=sanity

Change-Id: Icf97678ebb0c6495d956f13d57e0cea65a20b108
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40265
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13989 ldlm: BL AST vs failed lock enqueue race 46/40046/3
Andriy Skulysh [Tue, 11 Feb 2020 12:00:18 +0000 (14:00 +0200)]
LU-13989 ldlm: BL AST vs failed lock enqueue race

failed_lock_cleanup() marks the lock with LDLM_FL_LOCAL_ONLY,
so cancel request isn't sent.

Mark failed lock with LDLM_FL_LOCAL_ONLY only
if BL AST wasn't received.
Add server's lock handle to BL AST RPC.
So client will be able to cancel the lock
even if enqueue fails.

Change-Id: I3201bc29abd877cddc334ca27a9d208cb55c5d8f
HPE-bug-id: LUS-8493, LUS-8830
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Vladimir Saveliev <c17830@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-on: https://review.whamcloud.com/40046
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13984 ptlrpc: throttle RPC resend if network error 20/40020/9
Aurelien Degremont [Wed, 23 Sep 2020 19:20:08 +0000 (19:20 +0000)]
LU-13984 ptlrpc: throttle RPC resend if network error

When sending a callback AST to a non-responding client, the server
retries endlessly until the client is eventually evicted. When using
ksocklnd, it will retry after each AST timeout, until the socket is
eventually closed, after sock_timeout sec, where the retry will fail
immediately, returning -110, as no socket could be established.

The thread will spin on retrying and failing, until eventual client
eviction. This will cause high thread CPU usage and possible resource
denial.

To workaround that, this patch avoids re-trying callback resend if:
 - the request is flagged with network error and timeout
 - last try was less than 1 sec ago

In worst case, retry will happen after a timeout based on req->rq_deadline.
If there is nothing else to handle, thread will be sleeping during that
time, removing CPU overhead.

Signed-off-by: Aurelien Degremont <degremoa@amazon.com>
Change-Id: Ie5028761c978b26e833fd0a5d30d313addf57984
Reviewed-on: https://review.whamcloud.com/40020
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13961 kernel: kernel update RHEL8.2 [4.18.0-193.19.1.el8_2] 87/39987/4
Jian Yu [Tue, 27 Oct 2020 22:52:32 +0000 (15:52 -0700)]
LU-13961 kernel: kernel update RHEL8.2 [4.18.0-193.19.1.el8_2]

Update RHEL8.2 kernel to 4.18.0-193.19.1.el8_2.

Test-Parameters: trivial \
clientdistro=el8.2 serverdistro=el8.2 \
testlist=sanity

Change-Id: I32d65790adcd5829cdc4447e9b116a83bf1efd63
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39987
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14049 utils: manage thread resources in alr_batch_print() 09/40309/5
John L. Hammond [Mon, 14 Sep 2020 17:58:31 +0000 (12:58 -0500)]
LU-14049 utils: manage thread resources in alr_batch_print()

In alr_batch_print(), create the sort and print thread with the
detached attribute. Protect against concurrent write the the batch
output file. Ensure that memory is freed in error cases.

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: Ibf1b299bd15f5d189a2302ce476bf2ef986a85b1
Reviewed-on: https://review.whamcloud.com/40309
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13665 tests: skip sanity subtests for new features 90/39890/10
Andreas Dilger [Fri, 11 Sep 2020 23:01:57 +0000 (17:01 -0600)]
LU-13665 tests: skip sanity subtests for new features

Skip sanity.sh test_165 (OAL) and part of test_56oc (btime) during
interop testing for features that were added recently.

Skip test_56oc timestamp parsing test to avoid timezone issues in
test environment.

Fixes: 3f7853b31ef6 ("LU-10934 llite: integrate statx() API with Lustre")
Fixes: 66172e3274ca ("LU-13238 ofd: add OFD access logs")
Test-Parameters: trivial
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ib09b60dccb563fcedadd1da55eea11ddca6ecde5
Reviewed-on: https://review.whamcloud.com/39890
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14031 ptlrpc: decrease time between reconnection 44/40244/4
Alexander Boyko [Wed, 14 Oct 2020 08:20:58 +0000 (04:20 -0400)]
LU-14031 ptlrpc: decrease time between reconnection

When a connection get a timeout or get an error reply from a sever,
the next attempt happens after PING_INTERVAL. It is equal to
obd_timeout/4. When a first reconnection fails, a second go to
failover pair. And a third connection go to a original server.
Only 3 reconnection before server evicts client base on blocking
ast timeout. Some times a first failed and the last is a bit late,
so client is evicted. It is better to try reconnect with a timeout
equal to a connection request deadline, it would increase a number
of attempts in 5 times for a large obd_timeout. For example,
    obd_timeout=200
     - [ 1597902357, CONNECTING ]
     - [ 1597902357, FULL ]
     - [ 1597902422, DISCONN ]
     - [ 1597902422, CONNECTING ]
     - [ 1597902433, DISCONN ]
     - [ 1597902473, CONNECTING ]
     - [ 1597902473, DISCONN ] <- ENODEV from a failover pair
     - [ 1597902523, CONNECTING ]
     - [ 1597902539, DISCONN ]

The patch adds a logic to wakeup pinger for failed connection request
with ETIMEDOUT or ENODEV. It adds imp_next_ping processing for
ptlrpc_pinger_main() time_to_next_wake calculation, and fixes setting
of imp_next_ping value.

HPE-bug-id: LUS-8520
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: Ia0891a8ead1922810037f7d71092cd57c061dab9
Reviewed-on: https://review.whamcloud.com/40244
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-by: Vitaly Fertman <vitaly.fertman@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13645 ldlm: extra checks for DOM locks 78/39878/6
Vitaly Fertman [Wed, 2 Sep 2020 17:14:06 +0000 (20:14 +0300)]
LU-13645 ldlm: extra checks for DOM locks

a couple of checks are added:
- only DOM lock can be a group lock;
- DOM bit must be the only mandatory one, or optional;

Signed-off-by: Vitaly Fertman <c17818@cray.com>
HPE-bug-id: LUS-8987
Change-Id: Iaf7a14a66eb0f125d2f6f7d06f5de0add387e101
Reviewed-on: https://review.whamcloud.com/39878
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13645 ldlm: group locks for DOM IBIT lock 06/39406/7
Vitaly Fertman [Tue, 4 Aug 2020 12:12:04 +0000 (15:12 +0300)]
LU-13645 ldlm: group locks for DOM IBIT lock

Group lock is supposed to be taken on such operations as layout swap
used for e.g. HSM, and is to be taken for DOM locks as well.

HPE-bug-id: LUS-8987
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Change-Id: I97888e1aee853d7fe04548681b2ed6805cb494ae
Reviewed-on: https://review.whamcloud.com/39406
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13666 lod: update .do_index_ops on layout detach 26/39226/2
Lai Siyao [Wed, 1 Jul 2020 14:03:19 +0000 (22:03 +0800)]
LU-13666 lod: update .do_index_ops on layout detach

Directory migration detaches stripes from source, and then attaches
them to target if source is a striped directory. This will convert
source from striped directory to plain directory, it needs update
.do_index_ops from lod_striped_index_ops to lod_index_ops to avoid
trigger assertion in index ops.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: Ia8f66a8a3fd5e96f0dba4d60eb2443107d320418
Reviewed-on: https://review.whamcloud.com/39226
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-11631 mdd: migrate symlink for cross-MDT rename 97/39897/3
Lai Siyao [Wed, 9 Sep 2020 21:50:11 +0000 (05:50 +0800)]
LU-11631 mdd: migrate symlink for cross-MDT rename

If symlink rename is cross MDTs, it's ineffective to turn this
symlink into a remote object, instead migrate it to where target
MDT is. The following changes are made:
* change migration code to allow source and target have different
  name.
* if symlink is renamed to other MDT, and it doesn't have other
  hard link and target doesn't exist, migrate it to target MDT.
* remove mdd_rename_order() which is obsolete.

Add sanity 24G.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: Ib5fafe3122172ac582bbcc907c72a9f391baf0e1
Reviewed-on: https://review.whamcloud.com/39897
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
3 years agoLU-13692 ldlm: Ensure we reprocess the resource on ast error 98/39598/8
Oleg Drokin [Fri, 7 Aug 2020 07:38:51 +0000 (03:38 -0400)]
LU-13692 ldlm: Ensure we reprocess the resource on ast error

When we are trying to grant a lock and met an AST error, rerunning
the policy is pointless since it cannot grant a potentially now eligible
lock and our lock is already in all the queues, just be like all the other
handlers for ERESTART return and run a full resource reprocess instead.

Change-Id: I3edb37bf084b2e26ba03cf2079d3358779c84b6e
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39598
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
3 years agoLU-14069 ldlm: Fix unbounded OBD_FAIL_LDLM_CANCEL_BL_CB_RACE wait 75/40375/3
Oleg Drokin [Fri, 23 Oct 2020 06:56:04 +0000 (02:56 -0400)]
LU-14069 ldlm: Fix unbounded OBD_FAIL_LDLM_CANCEL_BL_CB_RACE wait

in ldlm_handle_cp_callback the while loop is clearly supposed
to be limited by the "to" value of 1 second, but is not.
Seems to have been broken by all the Solaris porting in HEAD
all the way back in 2008.
Restore the to assignment to make it not hang indefinitely.

Change-Id: I449bfd7f585ab7db475fb3fd4cbbd876126ff789
Fixes: adde80ff ("Land b_head_libcfs onto HEAD")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40375
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-1742 o2iblnd: 'Timed out tx' error message 22/3622/4
Brian Behlendorf [Mon, 13 Aug 2012 23:58:20 +0000 (16:58 -0700)]
LU-1742 o2iblnd: 'Timed out tx' error message

Trivial fix to report the total RDMA time outstanding rather
than the number of seconds past the deadline.

Change-Id: I0ef9b7b9b31a4d27adf4f3f33da46c503e5ca49e
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-on: https://review.whamcloud.com/3622
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14010 build: Ensure dkms installs all Lustre modules 25/40225/2
Amin Tootoonchian [Mon, 12 Oct 2020 21:07:50 +0000 (16:07 -0500)]
LU-14010 build: Ensure dkms installs all Lustre modules

Add --force to dkms install in:
  debian/lustre-client-modules-dkms.postinst

Without it older than available modules are skipped.

Signed-off-by: Amin Tootoonchian <amint@openai.com>
Change-Id: I1d549e7d48d60294810e11ed2588a512f1527eda
Reviewed-on: https://review.whamcloud.com/40225
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13437 llite: pass name in getattr by FID 19/40219/6
Lai Siyao [Mon, 12 Oct 2020 14:22:07 +0000 (22:22 +0800)]
LU-13437 llite: pass name in getattr by FID

Now parent FID is packed in getattr_by_FID request
(see https://review.whamcloud.com/39290), it should also pass in name
from llite, so that lmv can replace fid1 with stripe FID, otherwise
MDS may treat sub files under striped directory as remote object.

Note, the name is not packed in request, because if it's packed, MDS
will getattr by name instead of FID.

Fixes: 5f2c44bf6 ("LU-13437 llite: pack parent FID in getattr")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: If8215667bcb10ea3c4c5cd2c9034d81fd1cda3b5
Reviewed-on: https://review.whamcloud.com/40219
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13437 mdc: remote object support getattr from cache 18/40218/4
Lai Siyao [Sat, 10 Oct 2020 14:34:19 +0000 (22:34 +0800)]
LU-13437 mdc: remote object support getattr from cache

For historical reason, IT_GETATTR lock revalidate matches
LOOKUP|UPDATE|PERM lock bits because for MDS < 2.4, permission is
protected by LOOKUP lock, but this will cause remote object not
able to match the cached lock because LOOKUP and UPDATE lock are
fetched separately.

Add sanity 803b, and rename 803 to 803a.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I3ac38fe34472736849307bb7f1eebb5de9343a5c
Reviewed-on: https://review.whamcloud.com/40218
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14016 libcfs: use atomic64_t for libcfs_kmem 68/40168/5
Amir Shehata [Wed, 7 Oct 2020 21:27:14 +0000 (14:27 -0700)]
LU-14016 libcfs: use atomic64_t for libcfs_kmem

libcfs_kmem keeps track of LNet's memory usage. It uses an
int type, so it could wrap around if usage grows beyond 2.14 GB.
Use atomic64_t to avoid this issue.

Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Change-Id: If96fb8391c6ffb1924e47cef3dfca02eabc5f912
Reviewed-on: https://review.whamcloud.com/40168
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13498 tests: remove tests from ALWAYS_EXCEPT with SSK 61/40161/8
Sebastien Buisson [Wed, 7 Oct 2020 06:36:49 +0000 (08:36 +0200)]
LU-13498 tests: remove tests from ALWAYS_EXCEPT with SSK

A number of tests had previously been added to ALWAYS_EXCEPT
when SHARED_KEY was in use.
These tests are now passing with SSK, so remove them from the
exception list.

Test-Parameters: trivial
Test-Parameters: env=SHARED_KEY=true mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,replay-dual,replay-single,sanity-hsm,conf-sanity
Test-Parameters: env=SHARED_KEY=true mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,replay-dual,replay-single,sanity-hsm,conf-sanity
Test-Parameters: env=SHARED_KEY=true mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,replay-dual,replay-single,sanity-hsm,conf-sanity
Test-Parameters: env=SHARED_KEY=true mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,replay-dual,replay-single,sanity-hsm,conf-sanity
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: If72b212a23b915afdb723acf7254908e1c043e07
Reviewed-on: https://review.whamcloud.com/40161
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-13824 test: test sanity 230q with fewer files on ZFS 28/39528/4
Lai Siyao [Tue, 28 Jul 2020 13:44:09 +0000 (21:44 +0800)]
LU-13824 test: test sanity 230q with fewer files on ZFS

Sanity 230q may timeout on ZFS backend, test with fewer files.

Test-Parameters: trivial fstype=zfs testlist=sanity mdscount=2 mdtcount=4 env=ONLY=230q,ONLY_REPEAT=100
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: Iaf9e4e6d68244937819305af72df33e59df19f1f
Reviewed-on: https://review.whamcloud.com/39528
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14031 ptlrpc: remove unused code at pinger 43/40243/3
Alexander Boyko [Wed, 14 Oct 2020 07:45:21 +0000 (03:45 -0400)]
LU-14031 ptlrpc: remove unused code at pinger

The timeout_list was previously used for grant shrinking,
but right now is dead code.

HPE-bug-id: LUS-8520
Fixes: fc915a43786e ("LU-8708 osc: depart grant shrinking from pinger")
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: Ia7a77b4ac19da768ebe1b0879d7123941f4490b5
Reviewed-on: https://review.whamcloud.com/40243
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Aurelien Degremont <degremoa@amazon.com>
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13740 tests: improve sanity-sec test_45 46/40146/2
Sebastien Buisson [Tue, 6 Oct 2020 09:43:19 +0000 (11:43 +0200)]
LU-13740 tests: improve sanity-sec test_45

Improve sanity-sec test_45 by referencing the entire mmap-ed region
thanks to multiop.
Also make sure encryption tests are passing on newly supported
Ubuntu 20.04 distro.

Test-Parameters: trivial
Test-Parameters: testlist=sanity-sec envdefinitions=ONLY="36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54" clientdistro=ubuntu2004 fstype=ldiskfs mdscount=2 mdtcount=4
Test-Parameters: testlist=sanity-sec envdefinitions=ONLY="36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54" clientdistro=ubuntu2004 fstype=zfs mdscount=2 mdtcount=4
Test-Parameters: testlist=sanity-sec envdefinitions=ONLY="36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54" clientdistro=el8.1 fstype=ldiskfs mdscount=2 mdtcount=4
Test-Parameters: testlist=sanity-sec envdefinitions=ONLY="36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54" clientdistro=el8.1 fstype=zfs mdscount=2 mdtcount=4
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: I983b6bf94d9f51486fd6b688267af46ed4188a98
Reviewed-on: https://review.whamcloud.com/40146
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13765 osd-ldiskfs: Extend credit correctly for fallocate 42/39342/18
Arshad Hussain [Wed, 9 Sep 2020 23:18:13 +0000 (04:48 +0530)]
LU-13765 osd-ldiskfs: Extend credit correctly for fallocate

In OSD layer, before call ->fallocate(), Lustre has already
created journal handle for the fallocate transcation. In
ldiskfs/ext4, for very large range fallocate, the operation
may split into multiple transaction and call journal start/stop
multiple times inside fallocate. However, nested journal will
ignore requested credits, this result in running out of credits
at the end.

As we can not predict the total number of credits needed in
advance especially for large fallocate, thus in this patch, we
move fallocate logic into Lustre OSD, so that it could reserve
credits correctly. It extends credits for the current transaction
when found the left buffer credits is less than needed, and then
restart the transaction.

Testcase sanity/150e and sanity-quota/1h added to verify the
issue.

Test-Parameters: trivial testlist=sanity ostsizegb=12 env=ONLY="150e"
Test-Parameters: testlist=sanity-quota
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: Qian Yingjin <qian@ddn.com>
Change-Id: Ib7565ed2c1ae72eef4832fbcb710e0ee70c53aec
Reviewed-on: https://review.whamcloud.com/39342
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13719 lov: doesn't check lov_refcount 02/39702/2
Hongchao Zhang [Fri, 21 Aug 2020 10:17:12 +0000 (18:17 +0800)]
LU-13719 lov: doesn't check lov_refcount

In lov_cleanup, the check of each OSC is protected by
lov_tgt_getrefs, which will increment the "lov_refcount",
so the "lov_refcount" shouldn't be checked inside because
it is always larger than 0.

Change-Id: I21423d4345190b3e02eb00734c127e35cbc9b1af
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39702
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-12214 build: fix suse require krb5 14/40214/6
Minh Diep [Sun, 11 Oct 2020 22:50:46 +0000 (15:50 -0700)]
LU-12214 build: fix suse require krb5

Test-Parameters: trivial clientdistro=sles15sp1

Change-Id: If5bbe77bda84381b363c733f763cfc81e29aedb7
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40214
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-13745 tests: skip sanity test_426 for 4.15+ 66/40366/2
John L. Hammond [Thu, 22 Oct 2020 22:09:03 +0000 (17:09 -0500)]
LU-13745 tests: skip sanity test_426 for 4.15+

Add sanity test_426 to ALWAYS_EXCEPT for newer client kernels because
it is crashing 100% since "LU-13745 test: add splice test for lustre"
was landed.

Test-Parameters: trivial clientdistro=ubuntu1804
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: I84a722a27c3e8a572c20b46ca9daaf44e8720b54
Reviewed-on: https://review.whamcloud.com/40366
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-14067 test: skip compile tests on aarch64 65/40365/3
John L. Hammond [Thu, 22 Oct 2020 21:57:47 +0000 (16:57 -0500)]
LU-14067 test: skip compile tests on aarch64

aarch64 gcc segfaults trying to compile our headers so skip sanity
400a, 400b and sanity-lnet 300 on aarch64.

Test-Parameters: trivial clientdistro=el8.1 clientarch=aarch64
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: I8322107919084c86a0cb6fc15730a49f96c03b22
Reviewed-on: https://review.whamcloud.com/40365
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-13636 osd: create agent inode with explicit owner 42/38842/5
Alex Zhuravlev [Fri, 5 Jun 2020 05:16:32 +0000 (08:16 +0300)]
LU-13636 osd: create agent inode with explicit owner

to avoid quota misaccounting.

Test-Parameters: fstype=ldiskfs
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I5a02e6e7de71821a10704ac3516ee087998c9c21
Reviewed-on: https://review.whamcloud.com/38842
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13498 gss: update sequence in case of target disconnect 22/40122/5
Sebastien Buisson [Fri, 2 Oct 2020 12:05:55 +0000 (21:05 +0900)]
LU-13498 gss: update sequence in case of target disconnect

Client to OST connections can go idle, leading to target disconnect.
In this event, maintaining correct sequence number ensures that GSS
does not erroneously consider requests as replays.
Sequence is normally updated on export destroy, but this can occur too
late, ie after a new target connect request has been processed. So
explicitly update sec context at disconnect time.

Test-Parameters: env=SHARED_KEY=true,SK_FLAVOR=skn mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,sanity-sec
Test-Parameters: env=SHARED_KEY=true,SK_FLAVOR=ska mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,sanity-sec
Test-Parameters: env=SHARED_KEY=true,SK_FLAVOR=ski mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,sanity-sec
Test-Parameters: env=SHARED_KEY=true,SK_FLAVOR=skpi mdscount=2 mdtcount=4 osscount=1 ostcount=8 clientcount=2 testlist=sanity,recovery-small,sanity-sec
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: I65c27e1ab459b2a29670580121ef6e1a00f18918
Reviewed-on: https://review.whamcloud.com/40122
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13745 tests: skip sanity test_426 for 4.18+ 26/40326/3
Andreas Dilger [Wed, 21 Oct 2020 02:42:29 +0000 (20:42 -0600)]
LU-13745 tests: skip sanity test_426 for 4.18+

Add sanity test_426 to ALWAYS_EXCEPT for newer client kernels because
it is crashing 100% since "LU-13745 test: add splice test for lustre"
was landed.

 Unable to handle NULL pointer dereference at address 0000000000000004
 user pgtable: 64k pages, 48-bit VAs, pgdp = 000000009f14b2d0
 Internal error: Oops: 96000005 [#1] SMP
 CPU: 1 PID: 11273 Comm: ptlrpcd_01_01 4.18.0-147.8.1.el8_1.aarch64 #1
 Process ptlrpcd_01_01 (pid: 11273, stack limit = 0x00000000f9135a93)
 Call trace:
  mempool_free+0x24/0xe0
  llcrypt_free_bounce_page.part.1+0x38/0x48 [libcfs]
  llcrypt_free_bounce_page+0x24/0x30 [libcfs]
  brw_interpret+0x124/0x10c8 [osc]
  ptlrpc_check_set+0x688/0x3318 [ptlrpc]
  ptlrpcd_check+0x470/0x820 [ptlrpc]
  ptlrpcd+0x3d4/0x5c8 [ptlrpc]
  kthread+0x130/0x138

Test-Parameters: trivial clientdistro=el8.1 clientarch=aarch64
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I8f7b1d5e3ee69a3e0a6dfe3944949741a74cb62a
Reviewed-on: https://review.whamcloud.com/40326
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Faccini Bruno <bruno.faccini@intel.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13699 mdt: Improve message reporting for mdt_identity.c 03/39203/5
Arshad Hussain [Sat, 27 Jun 2020 11:27:06 +0000 (16:57 +0530)]
LU-13699 mdt: Improve message reporting for mdt_identity.c

This patch Improves Error handling and message reporting
for file lustre/mdt/mdt_identity.c

This patch also replaces ERR_PTR(PTR_ERR()) with ERR_CAST()
which is reported by coccinelle

Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Change-Id: I59e262e743e7b1926fe75920239d6086c183b30f
Reviewed-on: https://review.whamcloud.com/39203
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-12542 osd: consolidate RCU handling 04/40204/2
Andreas Dilger [Thu, 3 Sep 2020 22:37:47 +0000 (16:37 -0600)]
LU-12542 osd: consolidate RCU handling

Consolidate lu_object_header_fini() and kfree_rcu() into a
single lu_object_header_free() function so that callers
which do not need a container for lu_object_header can
avoid duplicating common RCU and OBD_FREE handling code.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I0fa68f11b5008ede5a498d38b69ccaeecf3ebbe5
Reviewed-on: https://review.whamcloud.com/40204
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Ben Evans <beevans@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14020 ldiskfs: SUSE 15 SP2 and mainline 5.4.21 and newer 79/40179/2
Shaun Tancheff [Thu, 8 Oct 2020 20:09:51 +0000 (15:09 -0500)]
LU-14020 ldiskfs: SUSE 15 SP2 and mainline 5.4.21 and newer

The updated pdirop for Ubuntu 5.4.0-42 kernel also aligns
with Linux LTS 5.4.21 through to current 5.4.60

In addition this series is also suitable for the current
SUSE 15 SP2 kernel, tested with 5.3.18-24.15.

Test-Parameters: trivial
HPE-bug-id: LUS-9092
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: I75ee53c4096b31c3a2d13748401ed1a12306215b
Reviewed-on: https://review.whamcloud.com/40179
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14012 lod: properly initialize lcm in lod_layout_convert() 53/40153/2
John L. Hammond [Tue, 6 Oct 2020 19:14:29 +0000 (14:14 -0500)]
LU-14012 lod: properly initialize lcm in lod_layout_convert()

In lod_layout_convert() zero out lcm and lcme before constructing the
converted layout.

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: I40f96d51cb63816a9bfc34217f02ff7c450de974
Reviewed-on: https://review.whamcloud.com/40153
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-14010 build: Add missing deps for dkms client 38/40138/3
Amin Tootoonchian [Mon, 5 Oct 2020 08:42:54 +0000 (03:42 -0500)]
LU-14010 build: Add missing deps for dkms client

dkms installs also need libyaml-dev and zlib1g to build and
install successfully.

Test-Parameters: trivial
HPE-bug-id: LUS-9369
Signed-off-by: Amin Tootoonchian <amint@openai.com>
Change-Id: Idbd79295590987b7dace07b4da3ee731d0714a8d
Reviewed-on: https://review.whamcloud.com/40138
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-10753 osd-zfs: initialize obj attr correctly 62/40062/4
Lai Siyao [Thu, 24 Sep 2020 16:05:15 +0000 (00:05 +0800)]
LU-10753 osd-zfs: initialize obj attr correctly

mdt_thread_info.mti_attr is used to initialize object attr in create,
currently it's copied to object.oo_attr directly, but some fields
in mti_attr may contain bogus data because it's not cleared in each
use, though la_valid is correctly set, but la_flags is used without
checking la_valid in __mdd_permission_internal().

Another minor fix in osd_create(): set size/nlink to zero since they
are set in valid.

Test-Parameters: testlist=sanity env=ONLY=300,ONLY_REPEAT=100

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I64816b66a0b3c7aa50e62680d5251141697a8e0f
Reviewed-on: https://review.whamcloud.com/40062
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13983 llite: rmdir releases inode on client 11/40011/4
Lai Siyao [Tue, 22 Sep 2020 11:05:51 +0000 (19:05 +0800)]
LU-13983 llite: rmdir releases inode on client

Same as file unlink, rmdir should release inode on client, to achieve
this, ll_rmdir() update inode i_nlink after rmdir, then the last
iput() will release the inode.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I9181151de1830b48986afec30c83120e9f112a85
Reviewed-on: https://review.whamcloud.com/40011
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13975 sec: require enc key in case of O_CREAT only 83/39983/6
Sebastien Buisson [Mon, 21 Sep 2020 12:45:49 +0000 (12:45 +0000)]
LU-13975 sec: require enc key in case of O_CREAT only

In ll_atomic_open(), do not return -ENOKEY when trying to open
either a directory or a file without the encryption key, unless
O_CREAT flag is specified.
Indeed, listing directory content is allowed even without the key.
And in case of regular file, ll_file_open() already checks for the
presence of an encryption key.

Improve sanity-sec test_54 to verify this is working properly.

Test-Parameters: testlist=sanity-sec envdefinitions=ONLY="36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54" clientdistro=el8.1 fstype=ldiskfs mdscount=2 mdtcount=4
Test-Parameters: testlist=sanity-sec envdefinitions=ONLY="36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 53 54" clientdistro=el8.1 fstype=zfs mdscount=2 mdtcount=4
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: I813d4ec938e00c8463b1d3ee9766d180806b40ba
Reviewed-on: https://review.whamcloud.com/39983
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13972 o2iblnd: Don't retry indefinitely 81/39981/6
Amir Shehata [Sat, 19 Sep 2020 08:38:07 +0000 (01:38 -0700)]
LU-13972 o2iblnd: Don't retry indefinitely

If peer is down don't retry indefinitely. Use the retry_count
parameter to restrict the number of retries. After which the
connection fails and error is propagated up.

This prevents long timeouts when mounting a file system with
nodes which might have their NIDs configured in the FS, but the
nodes have been taken offline.

Test-Parameters: trivial
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Change-Id: I04faf690ed13357e3ed50c2adaadee265db269c7
Reviewed-on: https://review.whamcloud.com/39981
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13946 build: OpenZFS 2.0 compatibility 22/39822/5
Brian Behlendorf [Thu, 3 Sep 2020 16:23:17 +0000 (09:23 -0700)]
LU-13946 build: OpenZFS 2.0 compatibility

* Update zfs_refcount_add() configure check.  OpenZFS 2.0 has renamed
  the sys/refcount.h header to sys/zfs_refcount.h to avoid a conflict
  with this header on FreeBSD.  Since Lustre never directly includes
  this header by name adjust the configure check to indirectly include
  it through sys/dnode.h.  This was we don't need a separate check to
  determine the expected header name.

* Add db->db_dirty_records check.  The db->db_last_dirty field was
  replaced by a proper db->db_dirty_records list_t.  Detect the code
  change and add a osd_db_dirty_txg() helper function which returns
  the largest dirty txg for the dbuf or zero when clean.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Change-Id: Ifc9573ec410f50f46f2c601368639453e78b291d
Reviewed-on: https://review.whamcloud.com/39822
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
3 years agoLU-13498 sec: fix credentials with nodemap and SSK 40/40140/2
Sebastien Buisson [Mon, 5 Oct 2020 12:14:09 +0000 (21:14 +0900)]
LU-13498 sec: fix credentials with nodemap and SSK

When SSK is enabled, credentials are evaluated in new_init_ucred().
In case a nodemap entry is defined with squash UID/GID, it must
prevail over normally mapped UID/GID.

Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: I1adfd98759e5b98ec78f0477846e1820fed5d8b3
Reviewed-on: https://review.whamcloud.com/40140
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13740 build: announce Ubuntu20 support 88/40088/3
James Simmons [Tue, 29 Sep 2020 14:25:24 +0000 (10:25 -0400)]
LU-13740 build: announce Ubuntu20 support

Now that Ubuntu20 is supported for both servers and clients lets
list it in the Changelog

Test-Parameters: trivial
Change-Id: Id413dc5bf5f8e418f379b8b9c8efc4e2d5521311
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/40088
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Peter Jones <pjones@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Tested-by: Sebastien Buisson <sbuisson@ddn.com>
3 years agoLU-13745 test: add splice test for lustre 95/39695/6
Wang Shilong [Sun, 6 Sep 2020 21:56:56 +0000 (17:56 -0400)]
LU-13745 test: add splice test for lustre

copied from xfstests with adjustment with PAGE SIZE
alignement for DIO, and codes style cleanup.

Test-Parameters: trivial envdefinitions=ONLY=425 testlist=sanity
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Change-Id: I28e83abb4a1181db20a2564a10f40ca208bb2756
Reviewed-on: https://review.whamcloud.com/39695
Reviewed-by: James Simmons <jsimmons@infradead.org>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13845 utils: Quota id 0xFFFFFFFF is invalid 59/39559/5
Etienne AUJAMES [Fri, 31 Jul 2020 18:29:43 +0000 (20:29 +0200)]
LU-13845 utils: Quota id 0xFFFFFFFF is invalid

"lfs setquota" and "lfs quota" should consider as invalid quota id
value 0xFFFFFFFF (aka. (uid_t)-1)

Fixes: 3d9900e78e ("LU-12549 utils: Check range of quota ID for lfs")
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Change-Id: Idbd5970a6f53a544c15bdf22bcf24a7aeba772a8
Reviewed-on: https://review.whamcloud.com/39559
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-12931 libcfs: skip cfs_time_seconds() indirection 45/40145/3
Andreas Dilger [Thu, 3 Sep 2020 22:45:02 +0000 (16:45 -0600)]
LU-12931 libcfs: skip cfs_time_seconds() indirection

Avoid one level of indirection when calling nsecs_to_jiffies64().

Test-Parameters: trivial
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ib5ff2aae5352bc86d75a8ae2a2a9f1b406887376
Reviewed-on: https://review.whamcloud.com/40145
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13590 kernel: RHEL 7.9 server support 60/40160/2
Jian Yu [Wed, 7 Oct 2020 01:25:11 +0000 (18:25 -0700)]
LU-13590 kernel: RHEL 7.9 server support

This patch makes changes to support new RHEL 7.9 release
for Lustre server (kernel 3.10.0-1160.2.1.el7).

Test-Parameters: trivial clientdistro=el7.9 serverdistro=el7.9

Change-Id: I7653091f2bd6a579447edb12045984d2829a8235
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40160
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13590 kernel: new kernel [RHEL 7.9 3.10.0-1160.2.1.el7] 42/40142/3
Jian Yu [Wed, 7 Oct 2020 00:09:32 +0000 (17:09 -0700)]
LU-13590 kernel: new kernel [RHEL 7.9 3.10.0-1160.2.1.el7]

This patch makes changes to support new RHEL 7.9 release
for Lustre client.

Test-Parameters: trivial clientdistro=el7.9

Change-Id: I7a2846de48a6710d6d720d6ccc3176dba4afc6bb
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40142
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13614 ldlm: revert LU-11762 32/39532/7
Vladimir Saveliev [Tue, 28 Jul 2020 22:28:22 +0000 (01:28 +0300)]
LU-13614 ldlm: revert LU-11762

Commit fe5c801657 introduced a problem for recovery.

When recovery timeout reaches hard recovery timeout
target_recovery_overseer() leaves obd_recovery_expired flag set. That
makes check_for_next_transno() to not wait until next replay request
arrives which leads to assertion:
LASSERT(atomic_read(&obd->obd_req_replay_clients) == 0);

Test to illustrace the issue is added.

replay-single.sh:test_59 is added to EXCEPT_ALWAYS list:
  it was broken harmlessly before this patch and this patch made that
  test really fail due to that defect.

Fixes: fe5c80165 ("LU-11762 ldlm: ensure the recovery timer is armed")
HPE-bug-id: LUS-8299
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Change-Id: Ia694a519b5d73620be3014e92fd671d388550979
Reviewed-on: https://review.whamcloud.com/39532
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-14006 o2ib: raise bind cap before resolving address 27/40127/2
John L. Hammond [Fri, 2 Oct 2020 18:55:01 +0000 (13:55 -0500)]
LU-14006 o2ib: raise bind cap before resolving address

In kiblnd_resolve_addr(), ensure that the current task has
CAP_NET_BIND_SERVICE before calling rdma_resolve_addr() with a
protected source port.

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: I0552fdd64648ccb8c74667bd93852697f99f0c33
Reviewed-on: https://review.whamcloud.com/40127
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Tested-by: Amir Shehata <ashehata@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13998 tests: sanityn test_104 incorrect MDT times shown 61/40061/3
Serguei Smirnov [Sat, 26 Sep 2020 17:23:21 +0000 (10:23 -0700)]
LU-13998 tests: sanityn test_104 incorrect MDT times shown

sanityn test_104 prints STAT times instead of MDT times.
Fix it so it prints both for comparison.

Test-Parameters: trivial testlist=sanityn
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I4ae0c1fe704c02c3463043830783cd1d6cd46b98
Reviewed-on: https://review.whamcloud.com/40061
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-13992 llite: ASSERTION( last_oap_count > 0 ) failed 50/40050/2
Andriy Skulysh [Tue, 16 Jun 2020 09:49:07 +0000 (12:49 +0300)]
LU-13992 llite: ASSERTION( last_oap_count > 0 ) failed

Punch uses o_blocks to send end of a region. So it
can be mixed with real blocks count on error.

Update blocks count only on success.

Change-Id: I86241c4e5723079b20401805b853d356130f58d9
HPE-bug-id: LUS-7407
Test-Parameters: fstype=zfs
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Tested-by: Elena Gryaznova <c17455@cray.com>
Reviewed-on: https://review.whamcloud.com/40050
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13986 target: fix possible liveloop in distribute_txn thd 43/40043/3
Mr NeilBrown [Fri, 25 Sep 2020 05:02:42 +0000 (15:02 +1000)]
LU-13986 target: fix possible liveloop in distribute_txn thd

A recent patch to update_trans.c changed how distribute_txn_thread()
waited for more work to do.

It previously had an explicit "wait_event()" which listed all the
conditions to wait for.  It would then recheck each condition and
possibly perform an appropriate action.

It was changed to check each condition only once (per loop).  If the
condition was true, the action would be performed and a flag set.  If
no conditions were true (indicated by flag), it would wait, otherwise
it would loop and recheck all condition.

One of the "if (condition) { do work }" stanzas in the loop tested a
condition that was *not* a condition that should wake up the loop.
"batchid" was not tested at all in the wait_event().  The flag
mentioned above was, however, set when that condition tested true.
This can cause the loop to spin indefinitely.

So remove the "__set_current_state(TASK_RUNNING);" so that the value
of batchid cannot stop the loop from sleeping (calling 'schedule()').

Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: I124ee3e8250dc63fa927f72dc4d29ed3e7b53005
Reviewed-on: https://review.whamcloud.com/40043
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13701 utils: fix usage of getopt_long in portals 36/40036/4
Serguei Smirnov [Thu, 24 Sep 2020 23:49:05 +0000 (16:49 -0700)]
LU-13701 utils: fix usage of getopt_long in portals

Due to char being unsigned by default, using this type to store
the int return value of getopt_long causes parsing issues on arm
platform, seen with lctl net_drop_add options. Fix it.

Test-Parameters: trivial testgroup=review-ldiskfs-arm
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: Ibef6ba347a90d9ed3cd4bd9346ed9f79f0120b87
Reviewed-on: https://review.whamcloud.com/40036
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13985 lustre: seq_file .next functions must update *pos 35/40035/3
Mr NeilBrown [Thu, 24 Sep 2020 23:46:24 +0000 (09:46 +1000)]
LU-13985 lustre: seq_file .next functions must update *pos

A seq_file .next function must update *pos on EOF to a value which
will cause a subsequent ->start to also return EOF.
If it doesn't the last record of the file can be returned
twice to a 'read()'.  Also the seq_file code will generate
a warning.

This patch fixes various ->next functions to always update
*pos.

Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: Ia0c432cd50550ecde6b308cbc554b316fa03adae
Reviewed-on: https://review.whamcloud.com/40035
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13982 tests: fix infinite loop in sanity test_184c 07/40007/4
Mr NeilBrown [Wed, 23 Sep 2020 05:23:46 +0000 (15:23 +1000)]
LU-13982 tests: fix infinite loop in sanity test_184c

If the dd in test_184c fails to create the file, for example due to
ENOSPC, the subsequent "while" loops indefinitely.

So add a loop-count to ensure it stops eventually.

Also change the test from "-f" to "-s" so we wait for the file to be
non-empty.

Test-Parameters: trivial
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: I5051adf101f08856b97fa994f687b976fda84df4
Reviewed-on: https://review.whamcloud.com/40007
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-13981 tests: use $TRUNCATE consistently and correctly. 05/40005/2
Mr NeilBrown [Wed, 23 Sep 2020 05:11:53 +0000 (15:11 +1000)]
LU-13981 tests: use $TRUNCATE consistently and correctly.

Two changes:
 1/ tests should should always use $TRUNCATE rather than just
    "truncate" to ensure the tests/truncate programs is used
    rather than anything in /bin /usr/bin.  This might not
    be needed in practice, but it is good to be consistent.

 2/ The arguments provided should match what tests/truncate
    expects - "filename size".  Using "-s size filename"
    as expected by /usr/bin/truncate is confusing and
    error prone.

Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: Ic0adc0148ea0ff625695dbfc4cf339d0eca70241
Reviewed-on: https://review.whamcloud.com/40005
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13810 tests: increase limit for 1g 17/39917/2
Sergey Cheremencev [Tue, 15 Sep 2020 11:34:29 +0000 (14:34 +0300)]
LU-13810 tests: increase limit for 1g

With wide striping file has objects at all
OSTs. As ZFS acquires several Kb for each
inode, each OST reserves minimum qunit 1M
even without write. At clean system with
8 OSTs, it acquires 8MB after file creation.
Increase pool limit from 10M to 20M to
make sanity-quota_1g pass at ZFS with 8 OSTs.

HPE-bug-id: LUS-9349
Test-Parameters: env=ONLY=1g testlist=sanity-quota
Change-Id: I1286e57aadebdd665c51d965220961b1f758c6f5
Signed-off-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/39917
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13948 tests: load modules after reboot_node 29/39829/3
Elena Gryaznova [Fri, 4 Sep 2020 11:47:04 +0000 (14:47 +0300)]
LU-13948 tests: load modules after reboot_node

LOAD_MODULES_REMOTE should be taken into account
for FAILURE_MODE=HARD.

Test-parameters: envdefinitions="LOAD_MODULES_REMOTE=true"

Signed-off-by: Elena Gryaznova <elena.gryaznova@hpe.com>
HPE-bug-id:  LUS-9283
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Vladimir Saveliev <c17830@cray.com>
Change-Id: I85f0a2812ac3be4ac9645d3b165b7371504969f0
Reviewed-on: https://review.whamcloud.com/39829
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13056 utils: Default MGS device llog_print/llog_catlist 40/38940/8
Etienne AUJAMES [Fri, 12 Jun 2020 16:06:43 +0000 (18:06 +0200)]
LU-13056 utils: Default MGS device llog_print/llog_catlist

Add the default device "MGS" for lctl llog_catlist, llog_print,
llog_check, llog_remove, llog_cancel and llog_info.

Example:
The two lines below are equivalent.

$ lctl --device MGS llog_catlist
$ lctl llog_catlist

Tests 123xx of conf_sanity have been modified to mix between the old
and new syntax.

lctl-llog_* man page have been modified to match the new syntax.

Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Change-Id: If988dfc5d8171b2ca47bb68dd06f9d2953548cb2
Reviewed-on: https://review.whamcloud.com/38940
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-12232 test: call dt_sync after dd 72/36772/2
Hongchao Zhang [Mon, 18 Nov 2019 02:56:12 +0000 (21:56 -0500)]
LU-12232 test: call dt_sync after dd

In test_6 of replay-ost-single, calling wait_mds_ost_sync
after "dd" to sync the write transactions.

Test-Parameters: trivial alwaysuploadlogs \
mdtfilesystemtype=zfs ostfilesystemtype=zfs \
testlist=replay-ost-single,replay-ost-single,replay-ost-single

Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Change-Id: I26f6b1fe903c8b7b3f4e5fb698baffc4c3b85130
Reviewed-on: https://review.whamcloud.com/36772
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Reviewed-by: Wei Liu <sarah@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-8465 tests: enable pfsck if possible 70/40070/4
Wang Shilong [Mon, 28 Sep 2020 06:55:14 +0000 (14:55 +0800)]
LU-8465 tests: enable pfsck if possible

To test pfsck widely, try enable pfsck by default
for Lustre tests.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Change-Id: I271422cf9d1f9cd0cc25c228c1f2df003e4f73f9
Reviewed-on: https://review.whamcloud.com/40070
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-13017 tests: disable statahead_agl for sanity test_56ra 67/39667/5
Mr NeilBrown [Thu, 13 Aug 2020 05:26:54 +0000 (15:26 +1000)]
LU-13017 tests: disable statahead_agl for sanity test_56ra

The sanity test_56ra can fail because statahead_agl can cause extra
glimpse request.

If a stat() systemcall is made after an AGL glimpse request is sent,
but before the reply has been received, the code handling the stat
cannot see that glimpse request and so will send another.  This
elevates the number of requests counted.

There is a parameter (statahead_agl) which make it easy to disable the
AGL, but it isn't implemented properly.  Specifically, inodes can
still be added to the sai_agls list when agl is disabled.  They will
never be removed, which causes an assertion to fail.

To clean this up, remove the sai_agl_valid flag, and use a test on
sai_task being non-NULL instead.  Also check agl_should_run() while
locked against ->sai_task changing, and before adding anything
to lli_agl_list.

We don't need the 'added' variable.  It is perfectly OK to wake_up the
sai_agl_task *before* adding to the list as long is that is all done
under the lock.  The task will wait for the lock before checking the
list, so it won't see it being empty.

Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: I12c4e447104a86b3f48eaf57b6cf7ce4b41cc6de
Reviewed-on: https://review.whamcloud.com/39667
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13920 hsm: process hsm_actions only after mdd setup 28/40028/3
Sergey Cheremencev [Thu, 24 Sep 2020 12:57:47 +0000 (15:57 +0300)]
LU-13920 hsm: process hsm_actions only after mdd setup

There is no guarantee that MDD setup is finished
at the moment when coordinator is started by
config params processing. If MDD setup is not finished,
hsm actions llog is not inited(mdd_hsm_actions_llog_init).
Hence hsm_pending_restore will not be called, i.e.
RESTORE requests will be sent to agents without taken
layout locks. I believe it may cause different problems.
I faced at least a kernel panic when llog includes
RESTORE request in ARS_WAITING that hasn't bee sent to
agent before failover. And the 2nd one RESTORE request
to the same fid was resent after recovery. Finally
agent handles to RESTORE to the same FID in parallel
that resulted in a panic with following bt:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffffc0b03bec>] thandle_get_sub_by_dt+0x14c/0x420 [ptlrpc]
...
[<ffffffffc1202732>] lod_sub_get_thandle+0x2f2/0x400 [lod]
[<ffffffffc1205021>] lod_sub_declare_xattr_set+0x61/0x300 [lod]
[<ffffffffc11db0d5>] lod_obj_stripe_replace_parent_fid_cb+0x245/0x450 [lod]
[<ffffffffc11eae0e>] lod_obj_for_each_stripe+0x11e/0x2d0 [lod]
[<ffffffffc11ebfe2>] lod_replace_parent_fid+0x2a2/0x390 [lod]
[<ffffffffc11dae90>] ? lod_attr_get+0x110/0x110 [lod]
[<ffffffffc11f8faf>] lod_declare_xattr_set+0x24f/0xf70 [lod]
[<ffffffffc077b251>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
[<ffffffffc1091ff4>] mdo_declare_xattr_set+0x74/0x2b0 [mdd]
[<ffffffffc077b129>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[<ffffffffc0fa2f7b>] ? osd_trans_create+0xbb/0x620 [osd_ldiskfs]
[<ffffffffc1094903>] mdd_declare_xattr_set+0x33/0x70 [mdd]
[<ffffffffc1094b4e>] mdd_object_pfid_replace+0x7e/0x1e0 [mdd]
[<ffffffffc109c2c6>] mdd_swap_layouts+0xa76/0x1dc0 [mdd]
[<ffffffffc10a5e1a>] ? mdd_trans_stop+0x3a/0x174 [mdd]
[<ffffffffc114b489>] hsm_cdt_request_completed.isra.14+0xc89/0xf50 [mdt]
[<ffffffffc077b129>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[<ffffffffc114d844>] mdt_hsm_update_request_state+0x544/0x7b0 [mdt]
[<ffffffffc0a82277>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
[<ffffffffc10fec92>] ? ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt]
[<ffffffffc112b98f>] mdt_hsm_progress+0x1ef/0x3f0 [mdt]
[<ffffffffc0aefcfa>] tgt_request_handle+0x96a/0x1640 [ptlrpc]
[<ffffffffc06cca9e>] ? libcfs_nid2str_r+0xfe/0x130 [lnet]
[<ffffffffc0a91466>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[<ffffffffc0a95fbc>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc]
[<ffffffffc0a95480>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[<ffffffffa04c1c31>] kthread+0xd1/0xe0
[<ffffffffa04c1b60>] ? insert_kthread_work+0x40/0x40
[<ffffffffa0b76c37>] ret_from_fork_nospec_begin+0x21/0x21
[<ffffffffa04c1b60>] ? insert_kthread_work+0x40/0x40
Code: 74 29 4c 3b a0 50 ff ff ff 75 e4 4d 85 ed 74 1b bf 01 00 00 00 e8 c5 b8 ff ff 85 c0 0f 85 98 00 00 00 49 8b 45 00 e9 04 ff ff ff <49> 8b 44 24 40 48 8b 40 08 48 85 c0 0f 84 b3 02 00 00 4c 89 e6
RIP  [<ffffffffc0b03bec>] thandle_get_sub_by_dt+0x14c/0x420 [ptlrpc]

Note, I faced this panic while testing https://review.whamcloud.com/#/c/38867/,
however I believe the same issue may exist even without 38867.

Patch makes mdt_hsm_cdt_start to wait until MDT initialization becomes
finished. Without this fix you should see below error in dmesg
each time after MDS restart if HSM is enbaled.
mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2

Test-Parameters: testlist=sanity-hsm
Change-Id: I4c4edaa72a562576ea71d89a4b60618d099ec4f5
Signed-off-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/40028
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Ben Evans <beevans@whamcloud.com>
Reviewed-by: Nathan Rutman <nrutman@gmail.com>
Reviewed-by: Nikitas Angelinas <nikitas.angelinas@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13006 jobid: enhance tests to check per-session jobids. 56/39656/4
Mr NeilBrown [Mon, 10 Aug 2020 22:45:44 +0000 (08:45 +1000)]
LU-13006 jobid: enhance tests to check per-session jobids.

Lustre allows per-session jobids by writing to "jobid_this_session".

The upstream-linux client does *not* support jobids using the process
environment.

Update the sanity tests to recognize this.

1/ Allow setting jobid_var=USER to fail - if it fails, don't
   test use of envionment variables
2/ Check if "jobid_this_session tests - if it does, check
   that it is used for generating jobids.
3/ Some tests use jobid_var to test general assignment
   to config values.  Change those to use jobid_name if
   it is available.

Also if jobid_var is set to 'session' and jobid_name contains %j -
use jobid_name as is done when jobid_var is a variable name.

Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: Ia7d01b8ffb9c6c910d2ce8f0615c802485604bf9
Reviewed-on: https://review.whamcloud.com/39656
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13977 tests: fix float comparison in sanity test_255a 01/40001/2
Mr NeilBrown [Wed, 23 Sep 2020 01:16:06 +0000 (11:16 +1000)]
LU-13977 tests: fix float comparison in sanity test_255a

  [ $a -gt $b ]
compares integers.

in ladvise_willread_performance(), average_ladvise and lowest_speedup
are calculated to 2 decimals, so they probably aren't integers.  So
this (nearly) always fails, but as the failure is not reported when a
VM is detected, the failure goes unnoticed.

The bash [[ ]] command can be used instead.  Its '>' operator compares
floats.

Test-Parameters: trivial
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: Ifc674d5804b1269cebce82893a08eace9ffd9be4
Reviewed-on: https://review.whamcloud.com/40001
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
3 years agoLU-13919 kernel: kernel update RHEL7.8 [3.10.0-1127.19.1.el7] 64/39964/2
Jian Yu [Thu, 17 Sep 2020 17:35:00 +0000 (10:35 -0700)]
LU-13919 kernel: kernel update RHEL7.8 [3.10.0-1127.19.1.el7]

Update RHEL7.8 kernel to 3.10.0-1127.19.1.el7.

The extent status tree shrinker patches are removed from the
RHEL 7.8 kernel patch series because the following bug fix has
been included since version 3.10.0-1127.18.2.el7:

ext4: change LRU to round-robin in extent status tree shrinker (BZ#1847343)

Test-Parameters: clientdistro=el7.8 serverdistro=el7.8

Change-Id: I8172e79de239681c540ab0644b63f8172790a027
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39964
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13511 obdclass: don't initialize obj for zero FID 92/39792/2
Lai Siyao [Sun, 30 Aug 2020 16:59:40 +0000 (00:59 +0800)]
LU-13511 obdclass: don't initialize obj for zero FID

Object with zero FID is used in stripe allocation, and it's
meaningless to initialize such object via lu_object_find_at(),
return error early to avoid assertion in lu_object_put().

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: Ia1bda3d01ff7552e94f31a9c928868652937d559
Reviewed-on: https://review.whamcloud.com/39792
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Stephane Thiell <sthiell@stanford.edu>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-11548 llite: increase readahead default values 00/33400/6
Andreas Dilger [Fri, 19 Oct 2018 07:31:41 +0000 (01:31 -0600)]
LU-11548 llite: increase readahead default values

It is commonly recommended to increase the readahead tunables
for clients to increase performance, since the current defaults
are too small, having been set several years ago for slower
networks and servers.

Increase the readahead defaults to better match values that are
recommended today:
- read_ahead_max_mb increased from 64MB to 1GB by default,
  or 1/32 RAM, whichever is less
- read_ahead_per_file_max_mb is increased from 64MB to 256MB,
  or 1/4 of read_ahead_max_mb, whichever is less

Modify the constant names to better match the variable and /proc
filenames.

Fix sanity test_101g to allow readahead to generate extra read
RPCs, as long as they are the expected size or larger.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Iec864788fa1979c27adad42e613d1bf03f3ebbe5
Reviewed-on: https://review.whamcloud.com/33400
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13745 llite: switch generic_file_splice_read() to use of ->read_iter() 72/39272/13
Al Viro [Sun, 6 Sep 2020 21:38:17 +0000 (17:38 -0400)]
LU-13745 llite: switch generic_file_splice_read() to use of ->read_iter()

... and kill the ->splice_read() instances that can be switched to it

* Special Note * Using generic_file_splice_read() breaks Lustre
but in my testing on RHEL7 not provding a ->splice_read, which
results in default_file_splice_read() to be used seems to work.

Linux-commit: 82c156f853840645604acd7c2cebcb75ed1b6652

Change-Id: I28bd2fe43fec25e881c78b41c58c59ac5e74eb49
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-on: https://review.whamcloud.com/39272
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoNew tag 2.13.56 2.13.56 v2_13_56
Oleg Drokin [Fri, 25 Sep 2020 17:27:52 +0000 (13:27 -0400)]
New tag 2.13.56

Change-Id: Iea1ae97f08b59719917e57bc1b2a554a010f76c4

3 years agoLU-13969 - Updates to lustre-release yaml.sh 52/39952/5
Lee Ochoa [Wed, 16 Sep 2020 21:02:33 +0000 (15:02 -0600)]
LU-13969 - Updates to lustre-release yaml.sh

Updated output of release() function to standarize node.yml file
os_distribution parameter. Changes as follows:

RHEL   - use redhat-release first and os-release as backup as the latter may
         not include the full version (major/minor)
CENTOS - use centos-release first and os-release as backup, same as RHEL
SUSE   - use os-release instead of suse-release as the latter is deprecated
UBUNTU - use os-release

Removed parsing system-release and *-release as neither
option correctly outputs desired info

Removed "lustre_" references in node.yml file attributes, the default in Maloo
is to look for non-lustre prefixes first

Change-Id: Ia011f944aae53f31fcd3a539e846ea5aba7ec7c4
Signed-off-by: Lee Ochoa <lochoa@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/39952
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13941 test: Restrict create_count to valid range 98/39798/3
Shaun Tancheff [Tue, 15 Sep 2020 21:35:08 +0000 (16:35 -0500)]
LU-13941 test: Restrict create_count to valid range

create_count range is restricted to:
  [OST_MIN_PRECREATE=32, OST_MAX_PRECREATE=20000]

Enforce this restriction rather that halt with an error.

Test-Parameters: trivial testlist=parallel-scale
HPE-bug-id: LUS-5960
Fixes: 21501dedf64e ("LU-9780 tests: Testing Round-Robin allocation")
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: I125f2121a57d16a9031ed21938ce2f0b99b9dea8
Reviewed-on: https://review.whamcloud.com/39798
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
3 years agoLU-13935 ofd: object removal is not handled properly 65/39765/3
Andrew Perepechko [Mon, 31 Aug 2020 08:49:59 +0000 (11:49 +0300)]
LU-13935 ofd: object removal is not handled properly

We should not check for object existence in
ofd_version_get_check() when we haven't got the lock yet.
ofd_attr_set() will check it later.

In ofd_destroy(), we should unlock if the object
does not exist, otherwise oti_w_locks assertion will fail.

Change-Id: I00f67d15c3268bcf55aafa88c088f2dbf55a470c
Signed-off-by: Andrew Perepechko <andrew.perepechko@hpe.com>
HPE-bug-id: LUS-9063,LUS-9282
Reviewed-on: https://review.whamcloud.com/39765
Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-1538 tests: standardize test script init – full group 59/36359/9
James Nunez [Thu, 3 Oct 2019 15:42:18 +0000 (09:42 -0600)]
LU-1538 tests: standardize test script init – full group

Standardize the initial Lustre test script initialization for
clarity and consistency for test suites in the full test group.

The LUSTRE path is already normalized in init_test_env(), so this
doesn't need to be done in the caller.  Use $(...) subshells instead
of `...` in the affected lines.  Remove NAME, SRCDIR, PATH, MULTIOP,
SETUP, CLEANUP, CHECKSTAT, TMP, SAVE_PWD, variable initialization,
since it is already done in init_test_env() or not needed in the test
scripts.

Move all definitions of ALWAYS_EXCEPT and SLOW to after
init_test_env() and init_logging() and call build_test_filter()
immediately after the ALWAYS_EXCEPT and SLOW definitions.

Test-Parameters: trivial
Test-Parameters: envdefinitions=SLOW=yes,ENABLE_QUOTA=yes clientcount=2 osscount=1 ostcount=7 mdscount=1 mdtcount=1 austeroptions=-R iscsi=1 testlist=metadata-updates,performance-sanity,racer
Test-Parameters: envdefinitions=SLOW=yes,ENABLE_QUOTA=yes clientcount=2 osscount=1 ostcount=7 mdscount=1 mdtcount=1 austeroptions=-R testlist=obdfilter-survey,sanity-benchmark,sanity-lsnapshot

Signed-off-by: James Nunez <jnunez@whamcloud.com>
Change-Id: I6e21a4a5d9e9215d5b452c4fd30467d9c007b5a5
Reviewed-on: https://review.whamcloud.com/36359
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-12233 lnet: deadlock on LNet shutdown 33/39933/3
Serguei Smirnov [Wed, 16 Sep 2020 16:46:48 +0000 (09:46 -0700)]
LU-12233 lnet: deadlock on LNet shutdown

Release ln_api_mutex during LNet shutdown while waiting
for zombie LNI to allow other threads to read the LNet
state updated by the shutdown and fall through, avoiding
the deadlock

Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: Iaba11624d5b79bd0acb4add39f6153c55770440a
Reviewed-on: https://review.whamcloud.com/39933
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13967 quota: change warning to cdebug 21/39921/2
Sergey Cheremencev [Tue, 15 Sep 2020 18:42:41 +0000 (21:42 +0300)]
LU-13967 quota: change warning to cdebug

Changing CWARN to CDEBUG as it doesn't
point to any known problem.

Change-Id: Ia05a316a6ce1cc61899dd6662a9895946eee46aa
Signed-off-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/39921
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
3 years agoLU-11631 obdclass: nlink is not set in struct obdo 96/39896/2
Lai Siyao [Sat, 12 Sep 2020 02:23:00 +0000 (10:23 +0800)]
LU-11631 obdclass: nlink is not set in struct obdo

Traditionally the nlink field is not set in obdo_from_la() and vice
versa, because it's used for communication between MDT and OST, and
this field is not used, but when DNE is enabled, this function is
used to getattr between MDTs, which will always see zero nlink.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I2a8e6a7b3f98864ec11bca7f7c2070af90b64ade
Reviewed-on: https://review.whamcloud.com/39896
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13740 ldiskfs: add Ubuntu 20.04 LTS support 79/39779/7
James Simmons [Fri, 11 Sep 2020 13:59:13 +0000 (09:59 -0400)]
LU-13740 ldiskfs: add Ubuntu 20.04 LTS support

The latest Ubuntu LTS is based on a 5.4 Linux kernel. We can use
the previous work of generic 5.4 ldiskfs support with a small
change to the ext-pdirop.patch to make ldiskfs available to
this latest Ubuntu LTS.

Change-Id: I6d1106c3da984b508ccc77364e13adc692bddb60
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/39779
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
3 years agoLU-13840 utils: --pool instead of -o with lfs setquota 15/39715/5
Sergey Cheremencev [Mon, 24 Aug 2020 18:28:04 +0000 (21:28 +0300)]
LU-13840 utils: --pool instead of -o with lfs setquota

The use of "lfs setquota -o" as the short option for
specifying the quota pool is confusing, because "lfs quota -o"
is the short option for getting quota on an OST. Use only
long option --pool for lfs setquota.

Patch also changes -o with --pool for setquota in
sanity-quota.sh

Change-Id: I18a695dcbbcd668611b1301c48f4e10d354b2686
Signed-off-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/39715
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-12397 osp: always set opd_new_connection 78/35078/6
Sergey Cheremencev [Mon, 25 Mar 2019 22:06:00 +0000 (01:06 +0300)]
LU-12397 osp: always set opd_new_connection

Flag opd_got_disconnected could be set back to 0
due to a race osp_precreate_thread vs osp_import_event.
Next ACTIVE event doesn't set opd_new_connection as
opd_got_disconnected also 0(i.e. import hasn't disconnected).
Such race is causing osp_precreate_thread to infinitly sleep
in wait despite сonnection state is FULL.

The patch always sets opd_new_connection flag on ACTIVE event
regardless value of opd_got_disconnected.

Patch is adding conf-sanity_101b test to race DISCON and ACTIVE
events. Without a fix the test causes to hung osp_precreate_thread
and as a result osp_precreate_reserve threads.

Change-Id: Iff41a2743f108679d5f70aca8e1c2108e979ac09
Cray-bug-id: LUS-7178
Signed-off-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-on: https://es-gerrit.dev.cray.com/154883
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Tested-by: Elena Gryaznova <c17455@cray.com>
Reviewed-on: https://review.whamcloud.com/35078
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-11518 osc: Do ELC on locks with no OSC object 84/34584/5
Patrick Farrell [Wed, 3 Apr 2019 14:19:44 +0000 (10:19 -0400)]
LU-11518 osc: Do ELC on locks with no OSC object

Currently, osc_ldlm_weigh_ast weighs locks with no OSC
object in their ast data as "1", meaning the lock is not
considered for ELC.

This doesn't make much sense, since if there is no OSC
object, it's unlikely there's any data under the lock, so
it's actually a good candidate for ELC.

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: Ie832afbf2479f3a348e44d2c21992696830000ae
Reviewed-on: https://review.whamcloud.com/34584
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-13922 osd-ldiskfs: no need to add OI cache in readdir 82/39782/3
Lai Siyao [Sat, 29 Aug 2020 21:53:18 +0000 (05:53 +0800)]
LU-13922 osd-ldiskfs: no need to add OI cache in readdir

It's a waste of time to call osd_add_oi_cache() in osd_it_ea_rec(),
because each dirent read will override it.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: Iec701bf66153fdf2ba7a3f3b89565381215abf33
Reviewed-on: https://review.whamcloud.com/39782
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Stephane Thiell <sthiell@stanford.edu>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
3 years agoLU-6142 lustre: don't take spinlock to read a 'long'. 43/39743/2
Mr NeilBrown [Thu, 27 Aug 2020 06:30:57 +0000 (16:30 +1000)]
LU-6142 lustre: don't take spinlock to read a 'long'.

Reading a 'long' (or unsigned long) is always an atomic operation.
There is never a need to take a spinlock to just read a single 'long'.

There are several procfs/debugfs/sysfs handlers which needlessly take
a spinlock for this purpose.

This patch:
 - removes the taking of the spinlock
 - changes the printf to scnprintf() as appropriate
 - directly returns the value returned by scnprintf rather than
   storing it in a variable
 - accesses the 'long' as an arg to the scnprintf(), rather than
   introducing a variabe to hold it.

Test-Parameters: trivial
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Change-Id: If4a6454b46844864e1177536a9c7b91e4c97de86
Reviewed-on: https://review.whamcloud.com/39743
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
3 years agoLU-11518 ldlm: pool recalc forceful call 64/39564/6
Vitaly Fertman [Tue, 4 Aug 2020 17:45:12 +0000 (20:45 +0300)]
LU-11518 ldlm: pool recalc forceful call

Let pool recalc to be able to be called forcefully independently of
the last recalc time;

Call the pool recalc forcefully on the lock decref instead of LRU
cancel to take into account the fresh SLV obtained from the server.

Call LRU recalc from after_reply if a significant SLV change occurs.
Add a sysfs attribute to control what 'a significant SLV change' is.

Signed-off-by: Vitaly Fertman <c17818@cray.com>
Change-Id: Iffeb8d73effdfc494f412422f285921aa4eb9811
HPE-bug-id: LUS-8678
Reviewed-on: https://es-gerrit.dev.cray.com/157134
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Tested-by: Jenkins Build User <nssreleng@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/39564
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>