Whamcloud - gitweb
fs/lustre-release.git
2 years agoLU-16184 o2iblnd: fix deadline for tx on peer queue
Serguei Smirnov [Fri, 23 Sep 2022 19:29:59 +0000 (12:29 -0700)]
LU-16184 o2iblnd: fix deadline for tx on peer queue

In o2iblnd, deadline is checked for txs on peer queue,
but not set prior to adding the tx to the queue. This
may cause the tx to be dropped unnecessarily with
"Timed out tx for ..." warning.

Fix it by setting the tx_deadline when adding tx to peer queue.

Lustre-change: https://review.whamcloud.com/48640
Lustre-commit: 4c89ee7d7b098c7f1e6566f49fa2940db577518d

Test-Parameters: trivial
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: Ie7cf5590b440b60f71527049953a64bb31d53578
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48641
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
2 years agoLU-16160 osc: take ldlm lock when queue sync pages
Bobi Jam [Thu, 15 Sep 2022 06:46:34 +0000 (14:46 +0800)]
LU-16160 osc: take ldlm lock when queue sync pages

osc_queue_sync_pages() add osc_extent to osc_object's IO extent
list without taking ldlm locks, and then it calls
osc_io_unplug_async() to queue the IO work for the client.

This patch make sync page queuing take ldlm lock in the
osc_extent.

Lustre-change: https://review.whamcloud.com/48557
Lustre-commit: 67aca1fcc6bed20794832decdba590a758d67d8fp

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Change-Id: Idefa2981e62a2a6e10d8b8a7692c0337b61b9052
Reviewed-on: https://review.whamcloud.com/48597
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5932 lipe: stratagem-hp-config.sh has wrong MDTLIST
Alexandre Ioffe [Wed, 21 Sep 2022 19:15:03 +0000 (12:15 -0700)]
EX-5932 lipe: stratagem-hp-config.sh has wrong MDTLIST

stratagem-hp-config.sh doesn't pick up proper MDTLIST
if snapshot agents are running. Fix MDTLIST which is used
to configure lpurge

Test-Parameters: trivial
Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Change-Id: Ic1d58d56f1acae140122d0b582410c140759e89e
Reviewed-on: https://review.whamcloud.com/48619
Reviewed-by: Shuichi Ihara <sihara@ddn.com>
Reviewed-by: Colin Faber <cfaber@ddn.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16154 obdclass: free inst_name correctly
Emoly Liu [Thu, 15 Sep 2022 01:42:47 +0000 (09:42 +0800)]
LU-16154 obdclass: free inst_name correctly

In functon class_config_llog_handler(), inst_name should be freed
correctly before break.

Lustre-change: https://review.whamcloud.com/48542
Lustre-commit: e7f17c5e0c95dba3b80e192e4ca3628cc42e64b9

Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Change-Id: I6adc0ed62c3c637237834b799f25666d0e7e1ecb
Reviewed-on: https://review.whamcloud.com/48670
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16050 build: replace ofed_info with dpkg/rpm
Jian Yu [Mon, 26 Sep 2022 18:22:56 +0000 (11:22 -0700)]
LU-16050 build: replace ofed_info with dpkg/rpm

After installing MLNX_OFED by running mlnxofedinstall command,
mlnx-ofed-kernel-modules package is not listed by ofed_info,
which causes Lustre configure fail as follows:

checking whether to use Compat RDMA... /usr/bin/ofed_info
dpkg-query: error: --listfiles needs at least one package name argument

This patch fixes the above issue by replacing ofed_info with
"dpkg -l" and "rpm -qa" commands to find OFED package.

Lustre-change: https://review.whamcloud.com/48047
Lustre-commit: 3a7930e63c15b0fbe51ac73db81a1186939115bb

Test-Parameters: trivial
Fixes: ec03c9628cae ("LU-15417 build: find the new path for MOFED 5.5")
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Change-Id: Ia3c2d6bf10e147ca2761221741eff6f93008556c
Reviewed-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48662
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-6014 tests: Revert "EX-4093 tests: hot-pools don't recreate pools"
Jian Yu [Wed, 28 Sep 2022 16:54:33 +0000 (09:54 -0700)]
EX-6014 tests: Revert "EX-4093 tests: hot-pools don't recreate pools"

This reverts commit 116cbacc52d8 to resolve the hot-pools
regression test failures.

After running sub-test 1, the OST pools were destroyed by
the following stack_trap in create_pool():

  stack_trap "destroy_test_pools $fsname" EXIT

If the pools are not recreated in the successive sub-tests,
then they will fail. We have to revert commit 116cbacc52d8
before we find out a way to avoid triggering the stack_trap
between sub-tests.

Test-Parameters: trivial mdscount=2 mdtcount=4 \
testlist=parallel-scale-nfsv4,hot-pools

Fixes: 116cbacc52d8 ("EX-4093 tests: hot-pools don't recreate pools")
Change-Id: I464a1f9f380c55e70b78a0dd7e52723d5b0a298d
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48690
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoRM-620 build: New tag 2.14.0-ddn62
Andreas Dilger [Fri, 23 Sep 2022 22:24:58 +0000 (16:24 -0600)]
RM-620 build: New tag 2.14.0-ddn62

New tag 2.14.0-ddn62

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I21b71b04905a70acbaada6d5a7fbab6c9184ca51

2 years agoRevert "EX-4141 lipe: lamigo should detect dead OST and restart ALR"
Andreas Dilger [Fri, 23 Sep 2022 19:36:53 +0000 (19:36 +0000)]
Revert "EX-4141 lipe: lamigo should detect dead OST and restart ALR"

This reverts commit 028bee14d2c6d8feb5eb418302f8751643e731c6 due to build error.

Change-Id: I6193f3e99192b618a3e6616524e28b230659fc0b
Reviewed-on: https://review.whamcloud.com/48639
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoRM-620 build: New tag 2.14.0-ddn61
Andreas Dilger [Fri, 23 Sep 2022 17:19:23 +0000 (11:19 -0600)]
RM-620 build: New tag 2.14.0-ddn61

New tag 2.14.0-ddn61

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I34c78bc6ce2fbac65e4e8b017cad1da05c78d53a

2 years agoLU-16183 tests: sanity-hsm/70 should detect python
Minh Diep [Thu, 15 Sep 2022 03:41:37 +0000 (20:41 -0700)]
LU-16183 tests: sanity-hsm/70 should detect python

Check for python2 and python3 explicitly, since the
generic python command does not exist in newer distros.

Test-Parameters: env=SLOW=yes,ENABLE_QUOTA=yes \
clientdistro=sles15sp3 testlist=sanity-hsm
Test-Parameters: env=SLOW=yes,ENABLE_QUOTA=yes \
clientdistro=el7.9 testlist=sanity-hsm
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Change-Id: I2251be461129310868868277bf9d46015545ffe2
Reviewed-on: https://review.whamcloud.com/48577
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-4141 lipe: lamigo should detect dead OST and restart ALR
Alexandre Ioffe [Tue, 29 Mar 2022 07:48:35 +0000 (00:48 -0700)]
EX-4141 lipe: lamigo should detect dead OST and restart ALR

Use #keepalive message and ssh read with timeout
to detect OST is down and restart ALR.
Add stats for ALR last seen message
Duplicate ofd_access_log_reader from lustre/utils into
lipe/src/es_ofd_access_log_reader
Use common lamigo_hash.h for lamigo and
es_ofd_access_log_reader

Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Test-Parameters: trivial testlist=hot-pools
Change-Id: I26dc631a8663046821e049fc6e091108b2a62f87
Reviewed-on: https://review.whamcloud.com/46944
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: John Hammond <jhammond@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
2 years agoLU-14962 lnet: Check for -ESHUTDOWN in lnet_parse
Chris Horn [Tue, 24 Aug 2021 16:16:17 +0000 (11:16 -0500)]
LU-14962 lnet: Check for -ESHUTDOWN in lnet_parse

The fix for LU-8106, http://review.whamcloud.com/19993, no longer
works because rc does not have the return value from
lnet_nid2peerni_locked(). Use PTR_ERR to get the return value and
restore the LU-8106 fix.

Lustre-change: https://review.whamcloud.com/44743
Lustre-commit: cce82630cbf2c7badbbdd16a8ca9c8c0065ded13

Test-Parameters: trivial
HPE-bug-id: LUS-10333
Fixes: fa8b4e6357 ("LU-7734 lnet: peer/peer_ni handling adjustments")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I9cc2bc2d6e675d38cf06d99c524bdd95110bf0e9
Reviewed-on: https://review.whamcloud.com/48487
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15618 lnet: Return ESHUTDOWN in lnet_parse()
Chris Horn [Thu, 3 Mar 2022 07:12:32 +0000 (01:12 -0600)]
LU-15618 lnet: Return ESHUTDOWN in lnet_parse()

If the peer NI lookup in lnet_parse() fails with ESHUTDOWN then we
should return that value back to the LNDs so that they can treat the
failed call the same way as other lnet_parse() failures.

Returning zero results in at least one bug in socklnd where a
reference on a ksock_conn can be leaked which prevents socklnd from
shutting down.

Lustre-change: https://review.whamcloud.com/46711
Lustre-commit: 4fbd0705a3d25bbc85e953f81e697e5006b215ce

Fixes: 47b7b31978 ("LU-8106 lnet: Do not drop message when shutting down LNet")
Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-15794
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ic403619c6dccf3921c46a674808c404adad7a30e
Reviewed-on: https://review.whamcloud.com/48485
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15616 lnet: ln_api_mutex deadlocks
Chris Horn [Mon, 7 Mar 2022 17:03:50 +0000 (11:03 -0600)]
LU-15616 lnet: ln_api_mutex deadlocks

LNetNIFini() acquires the ln_api_mutex and holds onto it throughout
various shutdown routines. Meanwhile, LND threads (via
lnet_nid2peerni_locked()) or the discovery thread (via
lnet_peer_data_present()) may need to acquire this mutex in order to
progress.

Address these potential deadlocks by setting the_lnet.ln_state to
LNET_STATE_STOPPING earlier in LNetNIFini(), and release the mutex
prior to any call into LND module or before any wait.

LNetNIInit() is modified to return -ESHUTDOWN if it finds that there
is a concurrent shutdown in progress.

Lustre-change: https://review.whamcloud.com/46727
Lustre-commit: 22de0bd145b649768b16dd42559d326af3c13200

Test-Parameters: trivial
HPE-bug-id: LUS-10681
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ia8b28cc95ff71e66a0f99aed4f2c22ec9d44ce1e
Reviewed-on: https://review.whamcloud.com/48384
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13806 lnet: Ensure proper peer, peer NI, peer net hierarchy
Chris Horn [Fri, 11 Dec 2020 18:04:32 +0000 (12:04 -0600)]
LU-13806 lnet: Ensure proper peer, peer NI, peer net hierarchy

The MR design dictates that the peer nets and peer NIs are ordered
such that the peer net and peer NI for a peer's primary NID appears
first, followed by other peer NIs in the primary NID's peer net,
followed by other peer nets/NIs. This ordering is broken and it can
result in tripping an assertion if the primary NID of a peer is
deleted. Modify lnet_peer_attach_peer_ni() to check whether the
NI being attached is the peer's primary, and place it, and its
associated peer net, appropriately.

Modify lnet_peer_set_primary_nid() so that it updates the
lp_primary_nid before calling lnet_peer_add_nid() so that
lnet_peer_attach_peer_ni() can detect the situation where the
primary is changing and act appropriately.

Finally, modify lnet_peer_merge_data() to enforce the hierarchy
after it has finished merging the contents of the ping buffer. This
ensures we maintain the correct hierarchy in certain edge cases where
we've needed to reconcile two peers. e.g. if a peer adds a new
interface, the discovery push may arrive from that new interface
which will result in a second peer object being created which will
need to be reconciled with the original peer object.

Lustre-change: https://review.whamcloud.com/40985
Lustre-commit: 9eb9474c41c823c70f34e6bb102a8861ca21a3d1

HPE-bug-id: LUS-9630
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I8397a24ba1ba0bba33846e7e97b8d60a8f26a1be
Reviewed-on: https://review.whamcloud.com/48508
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15538 lnet: DLC sets map_on_demand incorrectly
Chris Horn [Sat, 5 Feb 2022 23:15:30 +0000 (23:15 +0000)]
LU-15538 lnet: DLC sets map_on_demand incorrectly

When any NET or LND tunable is specified via CLI or yaml, then the
whole tunables struct gets memset to 0, or in the case of yaml config,
0 gets assigned to any tunable that isn't specified in the yaml. This
causes a problem for map_on_demand because 0 is a valid value for that
parameter, and ko2iblnd cannot know whether the user specified that 0
should be used or if DLC is specifying that the parameter was unset.

Rather than setting this parameter to 0 in the LND tunables struct,
have DLC set it to UINT_MAX to indicate that ko2iblnd should use the
value of the kernel module parameter.

Lustre-change: https://review.whamcloud.com/46492
Lustre-commit: 896f4a082b93453f5e7168f685faff4fba594ff3

Test-Parameters: trivial
HPE-bug-id: LUS-10740
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I303e64d4d402ba61b5ae3e3910873f192a4a2845
Reviewed-on: https://review.whamcloud.com/48491
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-4093 tests: hot-pools don't recreate pools
Alex Zhuravlev [Wed, 21 Sep 2022 00:40:46 +0000 (17:40 -0700)]
EX-4093 tests: hot-pools don't recreate pools

the test can save some time skipping pools recreating in every
subtest.

before: 1371 seconds
after:  1058 seconds

Test-Parameters: trivial testlist=hot-pools

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I9304e29b6fc59dd68626b44844dc81500009a80f
Reviewed-on: https://review.whamcloud.com/48614
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5824 test: hot-pools test_57: data copy failed: mirror failed
Alexandre Ioffe [Thu, 8 Sep 2022 08:37:31 +0000 (01:37 -0700)]
EX-5824 test: hot-pools test_57: data copy failed: mirror failed

Add debug prints in hot-pools test_57

Test-Parameters: trivial env=FAIL_ON_ERROR=false,ONLY=56-57 testlist=hot-pools

Change-Id: I863b580f5483c14c24c6f79ebdddbc782b65e945
Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-on: https://review.whamcloud.com/48477
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
2 years agoLU-14992 tests: sanity/replay-vbr mkdir on MDT0
James Nunez [Mon, 13 Sep 2021 16:35:30 +0000 (10:35 -0600)]
LU-14992 tests: sanity/replay-vbr mkdir on MDT0

Replace mkdir with mkdir_on_mdt0() for sanity test 133a
and relay-vbr test 7a.  These tests expect the newly
created directory is on MDT0.

Lustre-change: https://review.whamcloud.com/44902/
Lustre-commit: TBD

Test-Parameters: trivial mdscount=2 mdtcount=4 testlist=sanity
Test-Parameters: env=SLOW=yes mdscount=2 mdtcount=4 testlist=replay-vbr
Change-Id: Icea2923a8d8d3a3aa0ddf0401f0a025480b2f6f0
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48606
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13358 libcfs: add timeout to cfs_race() to fix race
Alex Zhuravlev [Tue, 30 Mar 2021 05:57:14 +0000 (08:57 +0300)]
LU-13358 libcfs: add timeout to cfs_race() to fix race

there is no guarantee for the branches in cfs_race() to be executed
in strict order, thus it's possible that the second branch (with
cfs_race_state=1) is executed before the first branch and then another
thread executing the first branch gets stuck.

this construction is used for testing only and as a
workaround it's enough to timeout.

Lustre-change: https://review.whamcloud.com/43161
Lustre-commit: 2d2d381f35ee004319a20f5d2d8e70d13480d6c7

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: Ie1cc0accedb3e1a198d4b17d1ab00ce298c560f2
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48553
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14875 import: fix bad CPT read
Cyril Bordage [Thu, 17 Feb 2022 11:49:16 +0000 (12:49 +0100)]
LU-14875 import: fix bad CPT read

When importing, CPT was read from tunables field but in fact, it is in
the same level in the YAML file generated during export.

Lustre-change: https://review.whamcloud.com/46541
Lustre-commit: 9ad5c43f4a53f8679cfa1a60f8161b08d3dcfa66

Test-parameters: trivial testlist=sanity-lnet

Signed-off-by: Cyril Bordage <cbordage@whamcloud.com>
Change-Id: Iea7b6189ad1a25b95ae6416d75ee2cbe4dca2fbf
Reviewed-on: https://review.whamcloud.com/48490
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5798 tests: add a version check to conf-sanity.sh test_133
Emoly Liu [Fri, 9 Sep 2022 10:18:24 +0000 (18:18 +0800)]
EX-5798 tests: add a version check to conf-sanity.sh test_133

The patch at https://review.whamcloud.com/47334 has been ported
to b_es6_0 since 2.14.0-ddn46, a version check is added to
conf-sanity.sh test_133 to avoid interop failure.

Test-Parameters: trivial testlist=conf-sanity serverversion=2.14.0-ddn23

Change-Id: I4bfc2986abddfd3a5a606f5586a29311582fca42
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48501
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Sarah Liu <sarah@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16131 build: Do not depend on libmount during --enable-dist
Shaun Tancheff [Wed, 7 Sep 2022 04:35:51 +0000 (21:35 -0700)]
LU-16131 build: Do not depend on libmount during --enable-dist

Defer the libmount requirement when using --enable-dist to
generate the lustre-src.rpm.

This allows mock and/or yum build-deps to resolve resolve
dependencies and pickup the libmount requirement without changing
the existing minimal build.

Lustre-change: https://review.whamcloud.com/48407
Lustre-commit: 819c8b169325045ae8bac9c4f38a58c75e22d099

Test-Parameters: trivial
HPE-bug-id: LUS-11091
Fixes: f21b944127 ("LU-15940 build: add a required dependency for libmount")
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: I20a7a097f9b651b6ea5519f79efda6c96b6f2199
Reviewed-on: https://review.whamcloud.com/48448
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16085 llite: fix stat attributes_mask
Sebastien Buisson [Fri, 12 Aug 2022 07:59:02 +0000 (09:59 +0200)]
LU-16085 llite: fix stat attributes_mask

Fix stat attributes_mask to return STATX_ATTR_ENCRYPTED whenever it is
possible. Also fix sanityn test_106c to expect at least the 0x30 flag
for attributes_mask.

Lustre-change: https://review.whamcloud.com/48208
Lustre-commit: 0e48653c27eacad29dbff1589da771ad4f5d1014
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
LU-16085 tests: fix sanityn test_106c

Fix sanityn test_106c after modification introduced when fixing
stat attributes_mask.

Lustre-change: https://review.whamcloud.com/48435
Lustre-commit: b843e8f89fe9b697ceec4657dde445aa60c200d0

Test-Parameters: trivial testlist=sanityn env=ONLY=106c
Fixes: 0e48653c27 ("LU-16085 llite: fix stat attributes_mask")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: Icd16beff058c42d77e9b04ad1a287ec2ac04dfed
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48520
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoLU-16052 llog: handle -EBADR for catalog processing
Mikhail Pershin [Fri, 29 Jul 2022 08:24:15 +0000 (11:24 +0300)]
LU-16052 llog: handle -EBADR for catalog processing

Llog catalog processing might retry to get the last llog block
to check for new records if any. That might return -EBADR code
which should be considered as valid. Previously -EIO was
returned in all cases.

Run conf-sanity test_106 several times as specific test

Lustre-change: https://review.whamcloud.com/48070
Lustre-commit: e260f751f2a21fa126eeb4bc9e94250ba3e815f1

Test-Parameters: testlist=conf-sanity env=ONLY=106,SLOW=yes,ONLY_REPEAT=10
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I30e04ba2c91c8bdce72c95675a1209639e9f0570
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
Reviewed-on: https://review.whamcloud.com/48540
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoLU-16084 tests: fix lustre-patched filefrag check
Andreas Dilger [Wed, 10 Aug 2022 18:27:56 +0000 (12:27 -0600)]
LU-16084 tests: fix lustre-patched filefrag check

Fix sanity test_130b thru test_130g to check for "filefrag -l"
instead of "filefrag -e", since the "-e" option has been in
upstream e2fsprogs since commit v1.42.6-50-g2508eaa7.  The "-l"
option (logical extent ordering) is really what is needed to
handle Lustre-striped files anyway.

While there, fix the code style in these subtests:
- use "local" and lower-case names for local variables
- use $(...) for subshells
- use (( ... )) for numeric comparisons
- use preferred "check || action" style checks
- use "skip_env" for environment configuration checks (e2fsprogs)
- use "skip" for test-related checks that can't be "fixed"
- use pre-defined $ost1_FSTYPE for checking OST filesystem type

Lustre-change: https://review.whamcloud.com/48188
Lustre-commit: fef1db004c4230e1051f9266f34a658501bf5d03

Test-Parameters: trivial
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I8eb7f17a9532796ab0274247194dd52cbc8a141c
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48555
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6
Andreas Dilger [Tue, 20 Sep 2022 18:58:35 +0000 (11:58 -0700)]
LU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6

Add the rhel8/ext4-old_ea_inodes_handling_fix.patch to the ldiskfs
series for el8.5 and el8.6 kernels.

Lustre-change: https://review.whamcloud.com/48496
Lustre-commit: ba9845274c8ea5c55f57b7fa0e839f18d76031ea

Test-Parameters: trivial testlist=sanity clientdistro=el8.6 serverdistro=el8.6
Fixes: 76c3fa96dc30 ("LU-16082 ldiskfs: old-style EA inode handling fix")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ifb66a0b7d78e5153d7897bee45fbf1d0e58fbc5c
Reviewed-on: https://review.whamcloud.com/48612
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoEX-5978 scripts: remove zfsobj2fid
Jian Yu [Wed, 21 Sep 2022 20:28:43 +0000 (13:28 -0700)]
EX-5978 scripts: remove zfsobj2fid

The zfsobj2fid utility is not needed on EXA cluster.

Test-Parameters: trivial clientdistro=el9.0 \
env=SANITY_EXCEPT="101j 130 244a" testlist=sanity

Change-Id: I40993c7c4ddef3f389c002076f5c118a9f610758
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48621
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5975 build: check OS type before using dpkg
Jian Yu [Wed, 21 Sep 2022 07:41:33 +0000 (00:41 -0700)]
EX-5975 build: check OS type before using dpkg

Bright cluster manager by default installs dpkg
on it's centos/rhel installation - presumably to
allow provisioning debian nodes in the cluster,
so dpkg is in the path and can't be removed.

This patch fixes LB_USES_DPKG to check OS type
before checking if dpkg is installed.

Test-Parameters: trivial clientdistro=el8.6
Test-Parameters: trivial clientdistro=ubuntu2204 env=SANITY_EXCEPT="130 244a"

Change-Id: Idc9f6edc91f9c89b40f259421b088287e08bfe9c
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48616
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16090 build: Module.symvers lookup by flavor on SUSE
Shaun Tancheff [Wed, 14 Sep 2022 07:48:16 +0000 (00:48 -0700)]
LU-16090 build: Module.symvers lookup by flavor on SUSE

When multiple kernel flavors are found we need to select only
the Module.symvers for the flavor that is being built.

Lustre-change: https://review.whamcloud.com/48195
Lustre-commit: f3a9921ae4f9c3e48328f2c682e0c7e61221e0d3

HPE-bug-id: LUS-11149
Test-Parameters: trivial
Fixes: 1f4aaefe1aae ("LU-15962 build: add in-kernel Module.symvers to symbol path")
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: I1c9af91108534d3a67f816077756fded4cd0b653
Reviewed-on: https://review.whamcloud.com/48329
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16059 build: Installation of dkms server builds
Shaun Tancheff [Mon, 19 Sep 2022 19:11:55 +0000 (12:11 -0700)]
LU-16059 build: Installation of dkms server builds

The linux-zfs-dkms package is passing the wrong paths
for zfs [and spl] causing the dkms build to fail.

ZFS_VERSION is not parsed correctly from 'dkms status'.

The splver and zfsver check can match against the wrong
package(s).

lustre-zfs-dkms provides: kmod-lustre-osd-zfs, and
                          lustre-osd-zfs-mount
lustre-ldiskfs-dkms provides: kmod-lustre-osd-ldiskfs and
                              lustre-osd-ldiskfs-mount

In the case of multiple zfs versions installed, build lustre
osd against the highest version number.

Lustre-change: https://review.whamcloud.com/48083
Lustre-commit: c3dc67b2c5bf1974d792b3701d932bd04c756bd8

HPE-bug-id: LUS-11113
Test-Parameters: trivial
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: Ic154ca045427bf26cb7e6a44b8c467675e987aad
Reviewed-on: https://review.whamcloud.com/48594
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16089 kernel: kernel update RHEL 7.9 [3.10.0-1160.76.1.el7]
Jian Yu [Mon, 22 Aug 2022 02:11:08 +0000 (19:11 -0700)]
LU-16089 kernel: kernel update RHEL 7.9 [3.10.0-1160.76.1.el7]

Update RHEL 7.9 kernel to 3.10.0-1160.76.1.el7.

Lustre-change: https://review.whamcloud.com/48202
Lustre-commit: 94955bbc6dc82b43fd77150b82834132bc56f565

Test-Parameters: trivial clientdistro=el7.9 serverdistro=el7.9

Change-Id: I97d087a5d5bb27996a5c0caf382c011928c651b4
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48277
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16000 utils: align updatelog parameters in llog_reader
Etienne AUJAMES [Wed, 14 Sep 2022 20:17:24 +0000 (13:17 -0700)]
LU-16000 utils: align updatelog parameters in llog_reader

Parameters in update log records are aligned on 64bits. llog_reader
do not aligned these parameters: if a parameters size is not mutiple
of 8, the next parameter size will be read incorrectly.

Lustre-change: https://review.whamcloud.com/47913
Lustre-commit: 6d74b759634355e7f6647ccaefef519a1ff208e2

Test-Parameters: trivial
Fixes: 9962d6f ("LU-14617 utils: llog_reader updatelog support")
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Signed-off-by: Etienne AUJAMES <etienne.aujames@cea.fr>
Change-Id: I6871614ab4ea79d59c3c3b4644b377de395bad56
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48551
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoLU-15724 tests: MDT failover hang reproducer
Alexander Boyko [Wed, 14 Sep 2022 20:13:58 +0000 (13:13 -0700)]
LU-15724 tests: MDT failover hang reproducer

The patch adds recovery-small 144a test to reproduce
MDT failover hang when precreate threads are blocked on objects.

LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID
namespace with 46 resources in use, (rc=-110)

Lustre-change: https://review.whamcloud.com/47006
Lustre-commit: aa6250b7412e7baf6760fe4010a81f4f22187127

Test-Parameters: trivial testlist=recovery-small env=ONLY=144a
HPE-bug-id: LUS-10750
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: I2743a1b5c8911d6982b527f7e7b7bbbaf310cd04
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/48550
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15724 osp: wakeup all precreate threads
Alexander Boyko [Wed, 14 Sep 2022 19:56:07 +0000 (12:56 -0700)]
LU-15724 osp: wakeup all precreate threads

Number of threads could sleep at osp_precreate_reserve() and
wait objects from OST. When MDT stops Lustre should wakeup
all threads. When opd_pre_recovering is set any wakeup of
opd_pre_user_waitq is useless. Failover of MDT does not produce
disconnect event, only inactive, so osp_precreate_cleanup_orphans()
can not be awakened.

LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID
namespace with 46 resources in use, (rc=-110)

 schedule_timeout at ffffffff8e551cd3
 osp_precreate_reserve at ffffffffc17d2d83 [osp]
 osp_declare_create at ffffffffc17c7eb9 [osp]
 lod_sub_declare_create at ffffffffc156415b [lod]
 lod_qos_declare_object_on at ffffffffc155bf42 [lod]
 lod_ost_alloc_rr.constprop.23 at ffffffffc155db2f [lod]
 lod_qos_prep_create at ffffffffc15630a6 [lod]
 lod_declare_instantiate_components at ffffffffc154b237 [lod]

Lustre-change: https://review.whamcloud.com/47005
Lustre-commit: e55fc043679cdfadfff6874ef78e2e0128ec37ac

HPE-bug-id: LUS-10750
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: If0164cfbecb1e358d9857421cb234559dc8cecbc
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/48546
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15555 ldiskfs: large directory causes htree corruption
Andrew Perepechko [Wed, 14 Sep 2022 19:50:51 +0000 (12:50 -0700)]
LU-15555 ldiskfs: large directory causes htree corruption

When creating a lot of files in a single directory, it can
get corrupted because of a typo in ext4-kill-dx-root.patch.

Lustre-change: https://review.whamcloud.com/46526
Lustre-commit: ea3ee9337f9bcd42360e4523f1e34bcd04d3bf41

Change-Id: Ia36278580741e1eb905e24a3a6231ba7daaa882a
Fixes: 20a6d32 ("LU-12637 kernel: RHEL 8.1 server support")
HPE-bug-id: LUS-10730
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-on: https://review.whamcloud.com/48545
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5380 lipe: wait longer before restarting the access log reader
John L. Hammond [Tue, 14 Jun 2022 13:46:45 +0000 (08:46 -0500)]
EX-5380 lipe: wait longer before restarting the access log reader

In lamigo_alr_data_collection_thread() if the access log reader exits
with status zero then it means that no OSTs are mounted on the
host. In this case we should wait longer before restarting the access
log reader.

Lustre-change: https://review.whamcloud.com/47627
Lustre-commit: 27c05f8cb39a8bf8d9e9386841fc7ecd700cf0fb

Test-Parameters: trivial testlist=hot-pools
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: I282c6b8e251c432664bc3b4eb202351a5bd7fe5b
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48380
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Colin Faber <cfaber@ddn.com>
2 years agoLU-14305 ldiskfs: add parameters for mb_c123_threshold
Artem Blagodarenko [Thu, 8 Sep 2022 03:13:07 +0000 (23:13 -0400)]
LU-14305 ldiskfs: add parameters for mb_c123_threshold

Add mount options for /sys/fs/ldiskfs/*/mb_c[123]_threshold values
so that they can be set persistently via mount options.

The /sys/fs/ldiskfs/*/mb_c[123]_threshold values are always shown
rounded down to the next lower percentage value due to integer
division, since internal values are stored as blocks for efficiency.

Round up the values shown to the next percent to match what was
used to originally set these parameters.

Lustre-change: https://review.whamcloud.com/41193
Lustre-commit: c2fd5297b46c4973aeda4d4d02cbc7ca2faa0d50

Fixes: 95f8ae567749 ("LU-12103 ldiskfs: don't search large block range if disk full")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Artem Blagodarenko <ablagodarenko@whamcloud.com>
Change-Id: Ie36a6667f8bca7481aa8179ab5b97c85d449d619
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/41955
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48499

2 years agoLU-15003 sec: use enc pool for bounce pages
Sebastien Buisson [Fri, 25 Mar 2022 08:24:32 +0000 (09:24 +0100)]
LU-15003 sec: use enc pool for bounce pages

Take pages from the enc pool so that they can be used for
encryption, instead of letting llcrypt allocate a bounce page
for every call to the encryption primitives.
Pages are taken from the enc pool a whole array at a time.

This requires modifying the llcrypt API, so that new functions
llcrypt_encrypt_page() and llcrypt_decrypt_page() are exported.
These functions take a destination page parameter.
Until this change is pushed in upstream fscrypt, this performance
optimization is not available when Lustre is built and run against
the in-kernel fscrypt lib.

Using enc pool for bounce pages is a worthwhile performance win. Here
are performance penalties incurred by encryption, without this patch,
and with this patch:

                     ||=====================|=====================||
                     || Performance penalty | Performance penalty ||
                     ||    without patch    |     with patch      ||
||==========================================|=====================||
|| Bandwidth – write |        30%-35%       |   5%-10% large IOs  ||
||                   |                      |    15% small IOs    ||
||------------------------------------------|---------------------||
|| Bandwidth – read  |         20%          |    less than 10%    ||
||------------------------------------------|---------------------||
||      Metadata     |         N/A          |         5%          ||
|| creat,stat,remove |                      |                     ||
||==========================================|=====================||

Lustre-change: https://review.whamcloud.com/47149
Lustre-commit: f3fe144b8572e9e75bb55076e29057227476ebf5

Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Change-Id: I3078d0a3349b3d24acc5e61ab53ac434b5f9d0e3
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/47513
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoLU-14719 osp: add inode watermark
Lai Siyao [Fri, 1 Apr 2022 19:58:08 +0000 (15:58 -0400)]
LU-14719 osp: add inode watermark

* move block watermark from debugfs to sysfs.
* add inode watermark for OSP.

Lustre-change: https://review.whamcloud.com/47128
Lustre-commit: 336eb696299e1c9731bd1443f05e5d814314ed36

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I7c768fa2ebfb4b8c2f75255f9e9c061d4c15cf66
Reviewed-on: https://review.whamcloud.com/47866
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16161 kernel: kernel update RHEL8.6 [4.18.0-372.26.1.el8_6]
Jian Yu [Fri, 16 Sep 2022 06:49:21 +0000 (23:49 -0700)]
LU-16161 kernel: kernel update RHEL8.6 [4.18.0-372.26.1.el8_6]

Update RHEL8.6 kernel to 4.18.0-372.26.1.el8_6.

Lustre-change: https://review.whamcloud.com/48564
Lustre-commit: TBD (from 66b1b4469d6e5e65b450702c6cb68ec14a51e9b0)

Test-Parameters: trivial fstype=ldiskfs \
clientdistro=el8.6 serverdistro=el8.6 testlist=sanity

Test-Parameters: trivial fstype=zfs \
clientdistro=el8.6 serverdistro=el8.6 testlist=sanity

Change-Id: I45bf6dbff5061407e1109732b6d466d0f7a8376c
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48575
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-4359 build: add bio-integrity patch to rhel8 series
Andreas Dilger [Thu, 30 Jun 2022 23:36:07 +0000 (17:36 -0600)]
EX-4359 build: add bio-integrity patch to rhel8 series

Add bio-integrity-unbound-concurrency patch to the rhel8.5 and
rhel8.6 series to ensure balanced T10-PI core usage.

Test-Parameters: trivial serverdistro=el8.5 clientdistro=el8.5 testlist=sanity,conf-sanity
Test-Parameters: trivial serverdistro=el8.6 clientdistro=el8.6 testlist=sanity,conf-sanity

Fixes: 97fba9aa48ca ("DDN-2042 bio: allow BIO integrity to run on any core")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I31f9ced4eadad105466556183e2b9e9e0419164d
Reviewed-on: https://review.whamcloud.com/47848
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
2 years agoLU-15795 lbuild: enable KABI
Minh Diep [Thu, 8 Sep 2022 19:54:56 +0000 (12:54 -0700)]
LU-15795 lbuild: enable KABI

Enable build kabi and clean up kmodtool patch

Lustre-change: https://review.whamcloud.com/47507
Lustre-commit: TBD (from 03fc87a2ba08e5c4b8b8787f19b4e736d2752fae)

Test-Parameters: trivial fstype=ldiskfs clientdistro=el8.5 serverdistro=el8.5
Test-Parameters: trivial fstype=ldiskfs clientdistro=el8.6 serverdistro=el8.6

Change-Id: I16d54af0004c4ddc1cc5e6acca81e4aa89a1a1c1
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48486
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14642 flr: allow layout version update from client/MDS
Bobi Jam [Wed, 13 Apr 2022 15:15:22 +0000 (23:15 +0800)]
LU-14642 flr: allow layout version update from client/MDS

Client write/punch request always carries its layout version so
that OFD can reject the request if the carried layout version
is a stale one.

This patch allows MDS as well as client to update new layout version
to OST objects. And during resync write, all OST objects will get
layout version updated.

Lustre-change: https://review.whamcloud.com/45443
Lustre-commit: fa6574150b6f745a668fe69b2d6d970068

Fixes: 7d97777a5d ("LU-14642 flr: abolish MDS transfer layout version to OST")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Change-Id: I9f27af354875d48adda3361f6c8ea5a5f6def73b
Reviewed-on: https://review.whamcloud.com/47097
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-9699 osp: don't assert on OSP duplicating
Jadhav Vikram [Tue, 25 Jul 2017 07:01:37 +0000 (12:31 +0530)]
LU-9699 osp: don't assert on OSP duplicating

Writeconf on an MDT with index > 0000 will cause
"add mdc" to be added to $FSNAME-client config
and "add osp" to be added to $FSNAME-MDTXXXX configs.

However, the configs may already contain these
directives. Duplicating the OSP device will
cause the assertion failure in osp_obd_connect():
ASSERTION( osp->opd_connects == 1 ) failed

Duplicating the MDC just returns -EEXIST in similar
situation.

A possible solution is to check configs for duplicates
before writing to them. However, sometimes we
would like to change nids which are part of
"add mdc" and "add osp".

Another solution is to mark previous entries with
SKIP flags. This patch implements this approach.
Since after revoking the config lock, the clients
and the MDTs will receive the updated log and
apply its newer entries, we still have to handle
OSP duplication, but this is only an issue
immediately after writeconf processing.

Lustre-change: https://review.whamcloud.com/27753
Lustre-commit: 98f107b53e4daa3bfaf026c379c0a9c41cb5f161

Seagate-bug-id: MRP-2634, MRP-3865
Change-Id: Idd7ad43c78d50e6bbe715850503aa0b01fcbf071
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48515
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoLU-15262 osd: bio_integrity_prep_fn return value processing
Alexey Lyashkov [Fri, 16 Sep 2022 20:41:42 +0000 (13:41 -0700)]
LU-15262 osd: bio_integrity_prep_fn return value processing

There is osd_bio_integrity_handle() fn in lustre/osd-ldiskfs/osd_io.c
It checks the returned code of bio_integrity_prep_fn() but between
mainstream Linux 4.12 and 4.13 kernel integrity API has changed and
in 4.13+ (as well as for any RHEL8 including first beta)

bio_integrity_prep() returns boolean true on success.

Lustre-change: https://review.whamcloud.com/45646
Lustre-commit: 41c813d14ec9b353f9cf5ac82638996dcb5273d7

HPe-bug-id: LUS-10443
Signed-off-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Change-Id: I973aa8ccae024157ad863d26afc7b1264a5c7149
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Reviewed-on: https://review.whamcloud.com/48582
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
2 years agoRM-620 build: New tag 2.14.0-ddn60
Andreas Dilger [Fri, 9 Sep 2022 01:46:21 +0000 (19:46 -0600)]
RM-620 build: New tag 2.14.0-ddn60

New tag 2.14.0-ddn60

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ib500a2a5f4677f496380750ff0ca3eee7eff1b57

2 years agoLU-15860 socklnd: Duplicate ksock_conn_cb
Chris Horn [Thu, 12 May 2022 18:16:10 +0000 (13:16 -0500)]
LU-15860 socklnd: Duplicate ksock_conn_cb

If two threads enter ksocknal_add_peer(), the first one to acquire
the ksnd_global_lock will create a ksock_peer_ni and associate a
ksock_conn_cb with it.

When the second thread acquires the ksnd_global_lock it will find the
existing ksock_peer_ni, but it does not check for an existing
ksock_conn_cb. As a result, it overwrites the existing ksock_conn_cb
(ksock_peer_ni::ksnp_conn_cb) and the ksock_conn_cb from the first
thread becomes stranded.

Modify ksocknal_add_peer() to check whether the peer_ni has an
existing ksock_conn_cb associated with it

Lustre-change: https://review.whamcloud.com/47361
Lustre-commit: 0c91d49a44e1214b5c65d4a557f6969b3d217881

Fixes: 7766f01e89 ("LU-13641 socklnd: replace route construct")
HPE-bug-id: LUS-10956
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I6c0190a0c1d3321ddd85c763b86ad1f0d32cf2b9
Reviewed-on: https://review.whamcloud.com/48259
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15234 lnet: Race on discovery queue
Chris Horn [Mon, 29 Nov 2021 17:38:48 +0000 (11:38 -0600)]
LU-15234 lnet: Race on discovery queue

If the discovery thread clears the LNET_PEER_DISCOVERING bit then a
race window opens when the discovery thread drops the
lnet_peer.lp_lock spinlock and closes when the discovery thread
acquires the lnet_net_lock. If another thread queues the peer for
discovery during this window then the LNET_PEER_DISCOVERING bit is
added back to the peer state, but since the peer is already on the
lnet.ln_dc_working queue, it does not get added to the
lnet.ln_dc_request queue.

When the discovery thread acquires the lnet_net_lock/EX, it sees that
the LNET_PEER_DISCOVERING bit has not been cleared, so it does not
call lnet_peer_discovery_complete() which is responsible for sending
messages on the peer's discovery pending queue.

At this point, the peer is stuck on the lnet.ln_dc_working queue, and
messages may continue to accumulate on the peer's
lnet_peer.lp_dc_pendq.

Fix the issue by re-working the main discovery thread loop so that we
do not release the lnet_peer.lp_lock until after we've determined
whether we need to call lnet_peer_discovery_complete().
This ensures that the lnet_peer is correctly removed from the
discovery work queue and any messages on the peer's
lnet_peer.lp_dc_pendq are sent or finalized.

It is also possible for the lnet_peer.lp_dc_error to be cleared
during the aforementioned window, as well as during the time when
lnet_peer_discovery_complete() is processing the contents of the
lnet_peer.lp_dc_pendq. This could prevent messages on the
lnet_peer.lp_dc_pendq from being correctly finalized. To fix this
issue, the responsibilities of lnet_peer_discovery_error() were
incorporated into lnet_peer_discovery_complete().

Lustre-change: https://review.whamcloud.com/45670
Lustre-commit: 852a4b264a984979dcef1fbd4685cab1350010ca

Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-10615
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I3779a342de7108105c2fd2bc41373560e8e5ef14
Reviewed-on: https://review.whamcloud.com/48313
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14941 lnet: Fix source specified to routed destination
Chris Horn [Thu, 12 Aug 2021 21:16:05 +0000 (16:16 -0500)]
LU-14941 lnet: Fix source specified to routed destination

If a source NI is specified for a send then we should not modify the
destination NID that was passed to lnet_send().

Lustre-change: https://review.whamcloud.com/44730
Lustre-commit: 98da4ace43a6c4c59e7981bf0fb649005237d12f

Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-10301
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ie47558d5bce97a0dca30ff7d072dcd39eb903324
Reviewed-on: https://review.whamcloud.com/48441
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14940 lnet: Fix source specified send to different net
Chris Horn [Thu, 12 Aug 2021 21:08:44 +0000 (16:08 -0500)]
LU-14940 lnet: Fix source specified send to different net

The destination NI is fixed for all source-specified sends. Thus, in
order for a source-specified send to be considered "local", i.e. a
send that does not require a route, the destination NID must be on
the same net as the specified source.

Lustre-change: https://review.whamcloud.com/44728
Lustre-commit: 3e3563f719ce89de28d276f3de1add064932506b

HPE-bug-id: LUS-10303
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I4847db1d393bbc36def65123f260b928ebbf944e
Reviewed-on: https://review.whamcloud.com/48440
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14660 lnet: Fix destination NID for discovery PUSH
Chris Horn [Fri, 29 Jan 2021 14:08:08 +0000 (17:08 +0300)]
LU-14660 lnet: Fix destination NID for discovery PUSH

If we're sending a discovery PUSH after receiving a discovery
REPLY then we want to send via the same NID that the reply was
sent to. This introduces a challenge in selecting an appropriate
destination NID for the PUSH because lnet_select_pathway() will not
run the MR selection algorithm for choosing a peer NI if the source
NI has been specified.

It is reasonable to assume that the NID used by the message
originator in sending the REPLY is a suitable destination for the
discovery PUSH. Thus, we record this NID in the same location we
currently record the lp_disc_src_nid, and use it when sending the
PUSH. With this change, the only other user of lnet_peer_select_nid()
is lnet_peer_send_ping(). In the ping case we do not set a source NID,
so lnet_select_pathway() is free to choose any peer NI. So this change
allows us to get rid of lnet_peer_select_nid() altogether.

Alternatively, we would need to reproduce a lot of the path selection
algorithm inside lnet_peer_select_nid() in order to avoid sending to
unhealthy NIDs. It seems undesirable and unnecessary to duplicate that
logic.

Lustre-change: https://review.whamcloud.com/43507
Lustre-commit: dce2f7d1987711dfdced903b13e67091cffe9628

Test-Parameters: trivial
HPE-bug-id: LUS-9333
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I47ef856075f049d71c395565974204b8f6fa9003
Reviewed-on: https://review.whamcloud.com/48439
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13950 lnet: do not crash if lnet_sock_getaddr returns error
Artem Blagodarenko [Tue, 25 Aug 2020 10:01:11 +0000 (06:01 -0400)]
LU-13950 lnet: do not crash if lnet_sock_getaddr returns error

Some issues with network lead to panic in ksocknal_accept

rc = lnet_sock_getaddr(sock, true, &peer_ip, &peer_port);
LASSERT(rc == 0); /* we succeeded before */

Let's pass this error to the caller.

Lustre-change: https://review.whamcloud.com/39834
Lustre-commit: 48a9ea82eb30bbbf66cce527c1205d13fbd4eb58

Test-Parameters: trivial testlist=sanity-lnet
Change-Id: I34d43c19b4e75422db50e7abb02cac3510882b0d
hpe-bug-id: LUS-9256
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-on: https://review.whamcloud.com/48443
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14206 lnet: Router ping timeout with discovery disabled
Chris Horn [Wed, 9 Dec 2020 20:38:57 +0000 (14:38 -0600)]
LU-14206 lnet: Router ping timeout with discovery disabled

Discovery pings are used to determine the health of gateways and
associated routes. Ping replies from gateways with dynamic discovery
(DD) disabled (or if DD is disabled locally) are handled in
a special routine, lnet_router_discovery_ping_reply(), but this
function and related code doesn't handle the case where a discovery
ping hits the response tracker timeout and is unlinked by the
monitor thread. In this case, an UNLINK event is generated and we
do not call the lnet_router_discovery_ping_reply(). For gateways
with DD enabled (and DD enabled locally), we handle this case
in lnet_router_discovery_complete(). If discovery failed then
lp_dc_error is set and we mark all routes down for the gateway. We
can simply extend this logic to the case of gateways w/DD disabled
(or DD disabled locally).

Lustre-change: https://review.whamcloud.com/40923
Lustre-commit: 173d86c6e9a704a84de36ae57a337a3fdae7b1ed

Test-Parameters: trivial
Fixes: 9f337d94e7 ("LU-13029 lnet: fix asym routing with multi-hop")
HPE-bug-id: LUS-9612
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I009c69d4f8990b72d83d9426c782c0e55c1023a4
Reviewed-on: https://review.whamcloud.com/48382
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15275 lnet: Skip router discovery on send path
Chris Horn [Tue, 30 Nov 2021 16:57:34 +0000 (10:57 -0600)]
LU-15275 lnet: Skip router discovery on send path

When the router checker is enabled, routes are regularly marked as out
of date w.r.t. discovery. This can cause upper level messages to be
delayed while the router undergoes discovery. We can avoid delaying
messages by relying on the router checker to initiate discovery of
routers. If we happen to send a message to a router before it has
been discovered then the worst case scenario is that the route is
actually down or we end up utilizing a subset of a multi-rail router's
interfaces. Both situations can be remedied by utilizing the
check_routers_before_use parameter.

Change the logic in lnet_handle_find_routed_path() so that we only
initiate discovery if the alive_router_check_interval is <= 0 (i.e.
router checker pings are disabled).

Lustre-change: https://review.whamcloud.com/45684
Lustre-commit: c8e74c395d5634dbb0d9d8a86605bb36ab2b8233

Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: If0332c21f6157117598b7b908fe17f2d2690fc1d
Reviewed-on: https://review.whamcloud.com/48383
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13781 lnet: Local NI must be on same net as next-hop
Chris Horn [Sun, 12 Jul 2020 15:47:55 +0000 (10:47 -0500)]
LU-13781 lnet: Local NI must be on same net as next-hop

When sending to a remote peer we need to restrict our selection of a
local NI to those on the same peer net as the next-hop.

The code currently selects a local NI on the peer net specified by the
lr_lnet field of the lnet_route returned by lnet_find_route_locked().
However, lnet_find_route_locked() may select a next-hop peer NI on any
local peer net - not just lr_lnet.

A redundant assignment to sd->sd_msg->msg_src_nid_param is also
removed. That variable is always set appropriately in
lnet_select_pathway().

Lustre-change: https://review.whamcloud.com/39352
Lustre-commit: 031c087f3847777c0099cbfae13f0b6fee54452b

Test-Parameters: trivial
HPE-bug-id: LUS-9095
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: If1bec26d6646b9e66b99656d7db2dc538d631a34
Reviewed-on: https://review.whamcloud.com/48381
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13714 lnet: only update gateway NI status on discovery
Chris Horn [Mon, 14 Feb 2022 20:37:05 +0000 (20:37 +0000)]
LU-13714 lnet: only update gateway NI status on discovery

Move the NI status from DOWN to UP only when receiving
a discovery PING. The discovery PING should be the only
message which should update the NI status since it's used
as the gateway NI keep alive mechanism.

This is done to avoid the following scenario:

The gateway itself can push its updates to the peers which
have removed it from its routing table. The peers would
respond to the PUSH with an ACK, the ACK will bring the
gateway's NI status to up. Therefore other peers which have
avoid_asym_router_failure=1 will have their route status
remain up even though the symmetrical route is gone.

Note: there is no way for the gateway to differentiate between
a keep alive discovery and a manually triggered discovery or ping.
However, this a narrow case which will not be handled.

net_last_alive converted to use ktime_get_seconds() instead of
ktime_get_real_seconds() since the NTP adjustment is not needed.

Lustre-change: https://review.whamcloud.com/39176
Lustre-commit: 3e3f70eb1ec95f32d9a97795d7fdf02cca82b5a0

Test-Parameters: trivial
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ifd5b06d4cf783b68b36413ada63f0a1d0095fb5b
Reviewed-on: https://review.whamcloud.com/48379
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15039 lnet: Fix reference leak in lnet_parse
Chris Horn [Wed, 5 Aug 2020 16:19:35 +0000 (11:19 -0500)]
LU-15039 lnet: Fix reference leak in lnet_parse

We need to drop the reference taken by lnet_nid2peerni_locked() if we
determine that we need to drop the message because of asymmetric
route.

Lustre-change: https://review.whamcloud.com/45067
Lustre-commit: e69eca08bce47bf85b3c011598e360a2468019b5

Test-Parameters: trivial
HPE-bug-id: LUS-9186
Fixes: 955080c3ae ("LU-13779 lnet: Correct asymmetric route detection")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I799c9522b1ce5f4caffc5848a829995e5b5484e7
Reviewed-on: https://review.whamcloud.com/48378
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14945 lnet: don't use hops to determine the route state
Serguei Smirnov [Mon, 16 Aug 2021 23:37:30 +0000 (16:37 -0700)]
LU-14945 lnet: don't use hops to determine the route state

NodeA <-tcp1-> GW1 <-tcp2-> GW2 <-tcp3-> NodeB

Assuming GW1 knows how to reach tcp3 network and GW2 knows
how to reach tcp1 network, it should be possible to add routes
without specifying hop=2 on nodes A and B to reach tcp3 and tcp1
respectively and then be able to lnetctl ping between them.
Changes introduced by LU-13785 interpret default hops to be
equivalent to hop=1 set explicitly for the purpose of determining
route aliveness, which results in the routes created as described
above to be considered "down".

Fix it so that default hop setting doesn't prevent
the multi-hop scenario from working.

Lustre-change: https://review.whamcloud.com/44674
Lustre-commit: 3f2844dc9333c86452c37bd7b4519729b1351371

Test-Parameters: trivial
Fixes: 2e07619477 ("LU-13785 lnet: Use lr_hops for avoid_asym_router_failure")
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I341ccdfe156434b0cb306359acc91a9193b44f7b
Reviewed-on: https://review.whamcloud.com/48337
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13780 lnet: Leverage peer aliveness more efficiently
Chris Horn [Fri, 10 Jul 2020 18:52:01 +0000 (13:52 -0500)]
LU-13780 lnet: Leverage peer aliveness more efficiently

When an LNet router is revived after going down, remote peers may
discover it is alive before we do. Thus, remote peers may use it
as a next-hop, and we may start receiving messages from it while we
still consider it to be dead. We should mark router peers as alive
when we receive a message from them.

If an LNet router does not respond to a discovery ping, then we
currently mark all of its NIs as DOWN. This can actually slow down
the process of returning a route to service. If we receive a message
from a router, in the manner described above, then we can safely
return the router to service. We already set the status of the router
NI we received the message from to UP, but the remote NIs will still
be DOWN and thus the route will be considered down until we get a
reply to the next discovery ping.

When selecting a route, we only consider the aliveness of a gateway's
remote NIs if avoid_asym_router_failure is enabled and the route is
single-hop. In this case, as long as the gateway has at least one
alive NI on the remote network then the route is considered UP. In
the situation described above, we know the router has at least one
NI alive because it was used to forward a message from a remote peer.
Thus, when we receive a forwarded message from a router, we can
reasonably set the NI status of all of its NIs that are on the same
peer net as the message originator to UP. This does not impact the
route status of any multi-hop routes because we do not consider the
aliveness of remote NIs for multi-hop routes.

Similarly, we can set the cached lr_alive value to up for any routes
whose lr_net matches the net ID of the message originator NID. This
variable is converted to an atomic_t to get rid of the need for
global locking when updating it.

Lustre-change: https://review.whamcloud.com/39350
Lustre-commit: 886e34ce56c491e8844cf892f32b08807cdf2bff:

Test-Parameters: trivial
HPE-bug-id: LUS-9088
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I0170762d78d80e4b70724799cd1ee1301118f25c
Reviewed-on: https://review.whamcloud.com/48335
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13785 lnet: Use lr_hops for avoid_asym_router_failure
Chris Horn [Tue, 14 Jul 2020 04:08:28 +0000 (23:08 -0500)]
LU-13785 lnet: Use lr_hops for avoid_asym_router_failure

In order for the asymmetric route failure avoidance feature to work
properly it needs to know what the hop count of a route should be.
This information is defined by the lr_hops field of the lnet_route.
The lr_single_hop is what discovery was able to determine the hop
count actually is (single or multi) based on the last ping reply.
If a remote interface on a router goes missing, the route may be
classified as multi-hop by discovery, but it should be considered
single-hop for the purposes of avoiding asymmetric route failure.

Lustre-change: https://review.whamcloud.com/39362
Lustre-commit: 2e07619477684f287a2399ccdbbde0a71289574b

HPE-bug-id: LUS-9099
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I9c255f9a2175d964661850277808dae96ff7735c
Reviewed-on: https://review.whamcloud.com/48336
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13779 lnet: Correct asymmetric route detection
Chris Horn [Fri, 10 Jul 2020 17:33:50 +0000 (12:33 -0500)]
LU-13779 lnet: Correct asymmetric route detection

Failure to lookup the remote net for LNET_NIDNET(src_nid) indicates an
asymmetric route, but we do not drop the message in this case. Another
problem with this code is that there is no guarantee that we'll have a
route->lr_lnet that matches the net of ni->ni_nid.

We can move the asymmetric route detection to after we have looked up
the lpni of from_nid. Then, we can look at just the routes associated
with the gateway that owns the lpni. If one of those routes has
lr_net == LNET_NIDNET(src_nid), then the route is symmetrical.

Lustre-change: https://review.whamcloud.com/39349
Lustre-commit: 955080c3ae3f33c98e068f52a096761ea28624b7

Fixes: 4932febc12 ("LU-11894 lnet: check for asymmetrical route messages")
Test-Parameters: trivial
HPE-bug-id: LUS-9087
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I8044d3f53e6f000c1e4d7c4e34b3b21afe0f9711
Reviewed-on: https://review.whamcloud.com/48334
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13708 lnet: lnet_notify sets route aliveness incorrectly
Chris Horn [Tue, 23 Jun 2020 18:02:51 +0000 (13:02 -0500)]
LU-13708 lnet: lnet_notify sets route aliveness incorrectly

lnet_notify() modifies route aliveness in two ways:
1. By setting lp_alive field of the lnet_peer struct.
2. By setting lr_alive field of the lnet_route struct (via call to
   lnet_set_route_aliveness())

In both cases, the aliveness value assigned is determined by a call
to lnet_is_peer_ni_alive(), but that value only reflects the aliveness
of a particular peer NI. A gateway may have multiple peer NIs, so the
aliveness of a gateway peer (lp_alive) is not necessarily equivalent
to the aliveness of one of its NIs. Furthermore, the lr_alive field
is only used to determine route aliveness for path selection if
discovery is disabled locally or on the gateway (see
lnet_find_route_locked() and lnet_is_route_alive()).

In general, we should not set lp_alive based on an lnet_notify()
call, and we should only set lr_alive if discovery is disabled. For
lr_alive specifically, we should only set it for those routes that
have the peer NI as a next-hop.

An exception to the above exists when the reset argument to
lnet_notify() is set. The gnilnd uses this flag in its calls to
lnet_notify() because gnilnd receives out-of-band notifications of
node up and down events. Thus, when gnilnd calls lnet_notify() we
actually know whether the gateway peer is up or down and we can set
lp_alive appropriately.

net lock/EX is held by other callers of lnet_set_route_aliveness, so
we do the same in lnet_notify().

Lustre-change: https://review.whamcloud.com/39160
Lustre-commit: e24471a722a6f23fb0051b4511f3fee2662d0e4e

Fixes: e35be987da ("LU-12422 lnet: discovery off route state update")
Fixes: ebc9835a97 ("LU-12941 lnet: Add peer level aliveness information")
Test-Parameters: trivial
HPE-bug-id: LUS-9034
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I2927e5f5ef849e45c233c92d2a6deca765e496eb
Reviewed-on: https://review.whamcloud.com/48290
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16012 sec: fix detection of SELinux enforcement
Sebastien Buisson [Fri, 2 Sep 2022 18:09:59 +0000 (11:09 -0700)]
LU-16012 sec: fix detection of SELinux enforcement

On newer distros (e.g. RHEL 9.0), on which selinux_is_enabled() does
not exist anymore, the only way to find out if SELinux is enforced
when initializing the security context is to fetch the length of the
security attribute name. If it is 0, we conclude SELinux is disabled.

Lustre-change: https://review.whamcloud.com/48049
Lustre-commit: 155cbc22ba4f758cf9eec415f36f940ca2b23de9

Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: Ifcdcb8ffbb7f9ad50d16d7d3317e94d0d212fa42
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48422
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
2 years agoEX-5815 lipe: do not print in lpcc signal handler
Lei Feng [Wed, 7 Sep 2022 07:26:13 +0000 (15:26 +0800)]
EX-5815 lipe: do not print in lpcc signal handler

Do not print in lpcc signal handler.
It's invalid in python script.

Signed-off-by: Lei Feng <flei@whamcloud.com>
Test-Parameters: trivial testlist=sanity-pcc env=ONLY=210
Change-Id: I61eb80ff1d59453dc12855fd2f1ac4f1e6e40757
Reviewed-on: https://review.whamcloud.com/48449
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-3442 tests: use wait_file_resync in hot-pools test 15
James Nunez [Wed, 24 Aug 2022 07:40:23 +0000 (00:40 -0700)]
EX-3442 tests: use wait_file_resync in hot-pools test 15

This patch replaces "$LFS mirror resync" with
"wait_file_resync" in hot-pools test 15 to avoid
racing with lamigo's "$LFS mirror resync".

Test-Parameters: trivial testlist=hot-pools,hot-pools

Change-Id: I48ffb7d6a33b664359f227d1f693369feffa70b6
Signed-off-by: James Nunez <jnunez@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/47233
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-2010 scsi: requeue aborted commands for el8
Andreas Dilger [Wed, 24 Aug 2022 23:21:07 +0000 (17:21 -0600)]
EX-2010 scsi: requeue aborted commands for el8

If the underlying SCSI command returns an abort, rather than retry
it quickly in a loop, which can finish within a few milliseconds,
requeue it with delay so that the hardware has a chance to recover.

The command requeue will take several seconds each time and allows
more chance for the problem to be resolved at the SCSI layer instead
of returning an error to the filesystem and causing server failover.

This patch is no longer required with SFAOS 11.8.3 and later, as SFAOS
will change ABORT to busy (SFAP-71972).  Patch can be removed once we
are certain of SFA version and/or have removed other kernel patches.

Test-Parameters: trivial clientdistro=el8.6 serverdistro=el8.6
Test-Parameters: trivial clientdistro=el8.5 serverdistro=el8.5

Signed-off-by: Trung Nguyen <trunguyen@ddn.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ibdf1b3a52dd0a1b388c7f5f97aa7a516203ebbe5
Reviewed-on: https://review.whamcloud.com/48340
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
2 years agoLU-16138 kernel: preserve RHEL8.x server kABI for block integrity
Jian Yu [Wed, 7 Sep 2022 03:41:10 +0000 (20:41 -0700)]
LU-16138 kernel: preserve RHEL8.x server kABI for block integrity

Currently there are two kernel patches supporting SCSI T10-PI feature
left in the RHEL8.x series:

- block-integrity-allow-optional-integrity-functions-rhel8.patch
- block-pass-bio-into-integrity_processing_fn-rhel8.patch

The changes in the patches modified "struct bio_integrity_payload"
and "struct blk_integrity_iter", which caused kABI breakage.

This patch fixes the patches to preserve kABI by using
RH-supplied compatibility macros.

Test-Parameters: trivial fstype=ldiskfs clientdistro=el8.5 serverdistro=el8.5
Test-Parameters: trivial fstype=ldiskfs clientdistro=el8.6 serverdistro=el8.6

Change-Id: If547e1cd4ae4ff1affd315bbfefaeeff4f1dea81
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48445
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16075 kernel: kernel update RHEL8.6 [4.18.0-372.19.1.el8_6]
Jian Yu [Fri, 26 Aug 2022 19:16:25 +0000 (12:16 -0700)]
LU-16075 kernel: kernel update RHEL8.6 [4.18.0-372.19.1.el8_6]

Update RHEL8.6 kernel to 4.18.0-372.19.1.el8_6.

Lustre-change: https://review.whamcloud.com/48116
Lustre-commit: TBD (077f4b13e7fbe564a79c35487e8208e8381fc833)

Test-Parameters: trivial fstype=ldiskfs \
clientdistro=el8.6 serverdistro=el8.6 testlist=sanity

Test-Parameters: trivial fstype=zfs \
clientdistro=el8.6 serverdistro=el8.6 testlist=sanity

Change-Id: I8e0fbdab54d36512c4c4cbdbc97c580994ebcbd3
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48319
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-16082 ldiskfs: old-style EA inode handling fix
Alexander Zarochentsev [Thu, 1 Sep 2022 17:19:15 +0000 (10:19 -0700)]
LU-16082 ldiskfs: old-style EA inode handling fix

The upstream version of EA inodes support coming
with RHEL8 (linux kernel 4.18+) have a slightly different
implementation of EA inodes support and also have a
compatibility code to recognize old-style Lustre-only EAs.
Unfortunately the compatibility code is broken and makes
old xattr data unaccessible due to a wrong hash value check.

Lustre-change: https://review.whamcloud.com/48174
Lustre-commit: 76c3fa96dc30f21e95d80f9119972d7358975258

HPE-bug-id: LUS-11133
Signed-off-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com>
Change-Id: Icd6f93d4ebb33dcd03b58f9eb364905c18ae81dc
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48413
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
2 years agoLU-14719 utils: dir migration stop on error
Lai Siyao [Tue, 29 Mar 2022 23:41:23 +0000 (19:41 -0400)]
LU-14719 utils: dir migration stop on error

Once directory migration fails, it should stop immediately since
current migration won't succceed, and subsequent migration may
fail on the same error.

Lustre-change: https://review.whamcloud.com/47040/
Lustre-commit: 9ca348e8769d2c613082eeaeaf2775e22625e970

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I96c1693d1b1da0856c925b9b22c1ab7f3181f0d8
Reviewed-on: https://review.whamcloud.com/47868
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15694 quota: keep grace time while setting default limits
Hongchao Zhang [Thu, 28 Jul 2022 13:54:00 +0000 (21:54 +0800)]
LU-15694 quota: keep grace time while setting default limits

The quota grace time should only be changed by "lfs setquota -t",
and it should be kept while setting default quota limits.

This patch also fixes an issue of not saving the grace time while
writing glboal quota record.

Lustre-change: https://review.whamcloud.com/46935
Lustre-commit: d4978678b49102226a79a6c8e5d10075d416977d

Signed-off-by: Hongchao Zhag <hongchao@whamcloud.com>
Change-Id: I89ca49d09dc41deffe4bc77e53721b5bb4f4be37
Reviewed-on: https://review.whamcloud.com/48416
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14472 tests: modify version_code in sanity test
Xing Huang [Mon, 22 Aug 2022 12:07:52 +0000 (20:07 +0800)]
LU-14472 tests: modify version_code in sanity test

Modify version_code in sanity-quota.sh and sanity-sec.sh according to
DDN version.  There is no need to modify version_code on other branch.

The version_code in test_59c() and test_49 of in sanity-sec.sh was
changed to 2.14.0.50 according to Sébastien's suggestion.

Signed-off-by: Xing Huang <hxing@ddn.com>
Change-Id: Ie448cbf60f6bdacdbba39ab0a1a86c6953d51ecb
Reviewed-on: https://review.whamcloud.com/48282
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5014 pcc: minor fixes for parameter checks
Andreas Dilger [Tue, 6 Sep 2022 18:47:56 +0000 (18:47 +0000)]
EX-5014 pcc: minor fixes for parameter checks

Improve console message when out-of-range pcc_dio_attach_size_mb
values are supplied.

Fix sanity-pcc test_49b to allow future limit changes.

Test-Parameters: trivial testlist=sanity-pcc
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I2bf7d0bf564c954318980f7a09d8713a70f37db9
Reviewed-on: https://review.whamcloud.com/48438
Reviewed-by: Qian Yingjin <qian@ddn.com>
2 years agoEX-5014 pcc: avoid deadlock during DIO open attach on rhel7
Qian Yingjin [Thu, 1 Sep 2022 06:54:23 +0000 (14:54 +0800)]
EX-5014 pcc: avoid deadlock during DIO open attach on rhel7

The Maloo testing fails with sanity-pcc/45 due to the following
deadlock on rhel7 kernel:

ll_fid_path_cop D ffff9a32db5eb180     0 10783  10782 0x00000080
Call Trace:
schedule_preempt_disabled+0x29/0x70
__mutex_lock_slowpath+0xc7/0x1d0
mutex_lock+0x1f/0x2f
lookup_slow+0x33/0xa7
link_path_walk+0x80f/0x8b0
path_openat+0xae/0x5a0
do_filp_open+0x4d/0xb0
do_sys_open+0x124/0x220
SyS_open+0x1e/0x20

dd              D ffff9a32fb5b6300     0 10779  10755 0x00000080
Call Trace:
wait_for_completion+0xfd/0x140
call_usermodehelper_exec+0x179/0x1a0
call_usermodehelper+0x40/0x60
pcc_copy_data_dio+0x267/0x340 [lustre]
pcc_attach_data_archive+0x6ff/0xe80 [lustre]
pcc_readonly_attach+0x3d2/0xad0 [lustre]
pcc_readonly_attach_sync+0x205/0x260 [lustre]
pcc_file_open+0x798/0xdd0 [lustre]
ll_atomic_open+0xd80/0x1780 [lustre]
do_last+0xa53/0x1340
path_openat+0xcd/0x5a0
do_filp_open+0x4d/0xb0
do_sys_open+0x124/0x220
SyS_open+0x1e/0x20

This bug only happened on el7 kernel which uses mutex for inode
locking.
During ->ll_atomic_open(), the kernel will take this mutex on the
parent inode. However, when copy data via the user space helper
program ll_fid_path_copy, it will also try to obtain this mutex
lock on the parent inode during lookup, resulting in deadlock.

Test-Parameters: clientdistro=el7.9 testlist=sanity-pcc
Test-Parameters: clientdistro=el8.5 mdscount=2 mdtcount=4 testlist=sanity-pcc env=ONLY=45,ONLY_REPEAT=10
Change-Id: I384c7b1979d93183b86bbde311d29a50346a8d56
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/48405
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5014 pcc: Add dio support for data copy during attach
Patrick Farrell [Mon, 11 Jul 2022 17:58:08 +0000 (13:58 -0400)]
EX-5014 pcc: Add dio support for data copy during attach

PCC attach performance is bottlenecked by single threaded
buffered I/O performance.  We could do multi-threading, but
multi-threaded buffered I/O to one file has a very low
performance ceiling.  In order to significantly speed up
PCC attach performance, we need to switch to DIO.

DIO cannot be done from kernel memory due to various
restrictions, so we call out to a usermode helper.

Note that the helper uses open by fid because given a
file pointer, it's not possible to reliably generate the
path to a file on Lustre due to container namespace issues.
Specifically, the path used by the user may not work for
our helper program due to namespace differences.  So we
must use open by fid for the Lustre side of the copy.

This patch improves attach performance from about 1 GiB/s
to about 5 GiB/s.  This performance figure includes time to
read the data from Lustre *and* to write it out to PCC.

Temporarily disable sanity-pcc/45 until the deadlock problem is
fixed.

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: Idb2a12296c3e4778763c9b576bbb0ecd2570a458
Reviewed-on: https://review.whamcloud.com/47158
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5014 pcc: Readability cleanups
Patrick Farrell [Mon, 25 Apr 2022 16:50:33 +0000 (12:50 -0400)]
EX-5014 pcc: Readability cleanups

It's really hard to remember what 'inode' and 'file' mean
when there's more than one in a function, so I've redone
some of the names here.

Test-parameters: trivial

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: Ia00d0bda216a26f285f0fda8bc8edd3c51d66ce4
Reviewed-on: https://review.whamcloud.com/47157
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoRM-620 build: New tag 2.14.0-ddn59
Andreas Dilger [Fri, 26 Aug 2022 17:14:32 +0000 (11:14 -0600)]
RM-620 build: New tag 2.14.0-ddn59

New tag 2.14.0-ddn59

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I36134a0a38fd0f6778bfdf533d0eac2b0f662121

2 years agoLU-15811 llite: Refactor DIO/AIO free code
Patrick Farrell [Mon, 1 Aug 2022 17:28:58 +0000 (13:28 -0400)]
LU-15811 llite: Refactor DIO/AIO free code

Refactor the DIO/AIO free code and add some asserts.

This removes a potential use-after-free in the freeing
code.

Lustre-change: https://review.whamcloud.com/48115/
Lustre-commit: 0358bd41174176cbfc9d6786bffb6dc95b68adcf (tbd)

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: I335b18fc7a28fc426a25675e2449d3d192cba596
Reviewed-on: https://review.whamcloud.com/48103
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15811 llite: Unify range unlock
Patrick Farrell [Wed, 20 Jul 2022 16:46:21 +0000 (12:46 -0400)]
LU-15811 llite: Unify range unlock

Correct parallel_dio condition and unify range unlock code
block.

Lustre-change: https://review.whamcloud.com/48000/
Lustre-commit: 84064c8e8112aed2e49d2dcd6b4f1c6a21770261 (tbd)

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: Ib66e8def571054df5117c279e238894bc3b58bce
Reviewed-on: https://review.whamcloud.com/47999
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15811 llite: clarify 'nofree' usage
Patrick Farrell [Thu, 14 Jul 2022 18:50:31 +0000 (14:50 -0400)]
LU-15811 llite: clarify 'nofree' usage

The 'nofree' value is confusing, and was backwards in master.
It's correct here in ES6, but this patch clarifies status a bit.
(No master equivalent, this was rolled in to:
https://review.whamcloud.com/47187 on master)

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: I2dbd0c68250da17e982f04a566a5d77bd56796ef
Reviewed-on: https://review.whamcloud.com/47954
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoRM-620 build: New tag 2.14.0-ddn58
Andreas Dilger [Fri, 26 Aug 2022 17:01:44 +0000 (11:01 -0600)]
RM-620 build: New tag 2.14.0-ddn58

New tag 2.14.0-ddn58

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ica9b3ae537093d173e06790fea1b2c664e842a57

2 years agoLU-13991 ldlm: speedup flock reprocess
Andriy Skulysh [Wed, 19 Feb 2020 20:06:33 +0000 (22:06 +0200)]
LU-13991 ldlm: speedup flock reprocess

We can check for deadlock only for first
conflicting lock, the rest deadlock checks
will be performed after cancelation of
first conflicting lock.

Lustre-change: https://review.whamcloud.com/40048
Lustre-commit: dadec10251090ba88c1b39517943e6603ba6d682

Change-Id: I18359db405ab021a4f32ac833de203254097142d
HPE-bug-id: LUS-8509
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/48320
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15402 ldlm: speedup RD flock enqueue
Andriy Skulysh [Wed, 24 Nov 2021 11:33:47 +0000 (13:33 +0200)]
LU-15402 ldlm: speedup RD flock enqueue

Scanning of lr_granted can be done until
covering granted RD lock is reached.

Lustre-change: https://review.whamcloud.com/45957
Lustre-commit: b07a57027ee5cc1afa82cc4c82be73a2c4894502

Change-Id: I907cff002d9765c5f8496d377eddd5e62795d89c
HPE-bug-id: LUS-10623
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/48323
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13929 lnet: modify assertion in lnet_post_send_locked
Serguei Smirnov [Wed, 25 Nov 2020 00:05:48 +0000 (16:05 -0800)]
LU-13929 lnet: modify assertion in lnet_post_send_locked

Check that the pointer to the local interface is not NULL
before asserting. While checking if local ni is the destination,
the assertion may attempt to dereference pointer to local
interface after it has already been cleaned up on shutdown.

Lustre-change: https://review.whamcloud.com/40749
Lustre-commit: e5a8f3fc12840aee97fca03d76b1ae9b4572acb8

Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I0f4be04a728a7243823bec70f9efbe52bcb104b3
Reviewed-on: https://review.whamcloud.com/48265
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15446 lnet: Don't use pref NI for reserved portal
Chris Horn [Wed, 12 Jan 2022 19:19:21 +0000 (19:19 +0000)]
LU-15446 lnet: Don't use pref NI for reserved portal

Don't use the preferred NI when sending traffic on the LNet reserved
portal. This allows local recovery pings to utilize any local NI as
source in the case where we do not have a multi-rail peer entry for
the local host. This is typically the case when MR is not being
configured statically (i.e. when discovery is being used for MR
configuration).

lnet_get_best_ni() was modified to include health values of the NIs
being compared in its debug output.

Lustre-change: https://review.whamcloud.com/46078
Lustre-commit: a2815441381cb6cee8eb9865d9279541ea04828e

HPE-bug-id: LUS-10658
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I38f5760bf034f698b7f44ffa89aa91c4f5d4b9ea
Reviewed-on: https://review.whamcloud.com/48312
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14661 lnet: Check if discovery toggled off in ping reply
Chris Horn [Wed, 27 Jan 2021 18:22:09 +0000 (12:22 -0600)]
LU-14661 lnet: Check if discovery toggled off in ping reply

If a peer is initially discovered and found to have discovery
enabled, but the peer later reloads LNet with discovery disabled,
then we can delete the peer and re-create it the next time the peer
is discovered.

It is safe to delete and re-create the peer as long as it wasn't
configured manually.

In lnet_peer_deletion(), we need to use lnet_del_init() when removing
the peer from the discovery queue because the lnet_peer_del() code
path can result in a call to lnet_peer_queue_for_discovery() where
we check if the lp_dc_list is empty.

Lustre-change: https://review.whamcloud.com/43508
Lustre-commit: 143893381d428466d4c71e075a041a9cbbd28818

Test-Parameters: trivial
HPE-bug-id: LUS-9178
Fixes: aa7de0af69 ("LU-13895 lnet: Prevent discovery on peer marked deletion")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I0b43d7541711a3b94c492082d4a29487ebe72b09
Reviewed-on: https://review.whamcloud.com/48296
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15512 lnet: Stop discovery on deleted peer NI
Chris Horn [Wed, 2 Feb 2022 18:37:00 +0000 (18:37 +0000)]
LU-15512 lnet: Stop discovery on deleted peer NI

lnet_discover_peer_locked() needs to check whether the peer NI that is
undergoing discovery has been deleted (i.e. its assocaited peer has
LNET_PEER_MARK_DELETED state). Otherwise, we may enter an infinite
loop because this peer will never be considered up to date.

Lustre-change: https://review.whamcloud.com/46429
Lustre-commit: 94f4e1f517d71ffd6662fb4a82e3dee9aa8f6796

Test-Parameters: trivial testlist=sanity-lnet
Fixes: fd32cd817c ("LU-13895 lnet: Prevent discovery on deleted peer")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I43d276fc460241c1724c8e30913bb6c5cbb7c8f4
Reviewed-on: https://review.whamcloud.com/48295
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13883 lnet: Lookup lpni after discovery
Chris Horn [Thu, 6 Aug 2020 21:24:57 +0000 (16:24 -0500)]
LU-13883 lnet: Lookup lpni after discovery

The lpni for a nid can change as part of the discovery process (see
lnet_peer_add_nid()). As such, callers of lnet_discover_peer_locked()
need to lookup the lpni again after discovery completes to make sure
they get the correct peer.

An exception is lnet_check_routers() which doesn't do anything with
the peer or peer NI after the call to lnet_discover_peer_locked().
If the router list is changed then lnet_check_routers() will already
repeat discovery.

Lustre-change: https://review.whamcloud.com/39747
Lustre-commit: 584d9e46053234d02a3290822317552785e44e76

HPE-bug-id: LUS-9167
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I8bdfcb957e87f65ce65bfad81858a4ce3362298e
Reviewed-on: https://review.whamcloud.com/48294
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13894 lnet: Transfer disc src NID when merging peers
Chris Horn [Thu, 6 Aug 2020 21:39:27 +0000 (16:39 -0500)]
LU-13894 lnet: Transfer disc src NID when merging peers

If we're merging two peers in lnet_peer_data_present() then we need
to transfer the src NID stored in the peer whose ping buffer we are
processing to the peer that actually owns the NIDs in the ping
buffer. Otherwise it is possible that the subsequent push to the peer
that is being discovered will go out over an interface that the peer
does not know about and it will be dropped.

Lustre-change: https://review.whamcloud.com/39607
Lustre-commit: e65d8ba583858ae10f2d53fd270b19d13e423634

Test-Parameters: trivial
HPE-bug-id: LUS-9193
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I050c7c1c2c0eddb8d5ff12f40342a8a02efacb9c
Reviewed-on: https://review.whamcloud.com/48293
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13895 lnet: Prevent discovery on deleted peer
Chris Horn [Thu, 6 Aug 2020 21:21:29 +0000 (16:21 -0500)]
LU-13895 lnet: Prevent discovery on deleted peer

We needn't perform any discovery activities on a peer that has had
lnet_peer_del() called on it.

Lustre-change: https://review.whamcloud.com/39605
Lustre-commit: fd32cd817cba336c684fe3ab7aac79705061e8b5

Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-9192
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I5c89dc89038d2c8bf4d2a29029af7720963b81a2
Reviewed-on: https://review.whamcloud.com/48292
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13895 lnet: Prevent discovery on peer marked deletion
Chris Horn [Fri, 7 Aug 2020 16:02:10 +0000 (11:02 -0500)]
LU-13895 lnet: Prevent discovery on peer marked deletion

If a peer has been marked for deletion then we needn't perform any
other discovery operation on it. Integrate this peer state into the
top level of the discovery state machine so that it is checked before
any other state.

Lustre-change: https://review.whamcloud.com/39604
Lustre-commit: aa7de0af6969df77a896e3a2e90c971a5081e324

HPE-bug-id: LUS-9192
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ie9de5b0d38d720f4f49d7e4a0673a6b52f9d3d80
Reviewed-on: https://review.whamcloud.com/48291
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-14939 lnet: Allow specifying a source NID for lnetctl ping
Chris Horn [Thu, 12 Aug 2021 16:26:07 +0000 (11:26 -0500)]
LU-14939 lnet: Allow specifying a source NID for lnetctl ping

Add a new --source option for lnetctl ping command. This allows the
user to specify a local NI from which to send the ping. This also
ensures that the specified destination NID is also used. Otherwise,
pings to multi-rail peers may end up going to a different peer NI
based on the multi-rail selection algorithm. The ability to specify
a source NI, and thus fix the destination NI, is a great help in
troubleshooting communication issues between multi-rail peers.

Add test to exercise lnetctl ping --source option.

Lustre-change: https://review.whamcloud.com/44727
Lustre-commit: 48ef9982c474a02c460293bce17c9e45f9829eab

HPE-bug-id: LUS-10296
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I454217b30a92414de537880f076a11a693b1f0b3
Reviewed-on: https://review.whamcloud.com/48297
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-4697 lipe: Define statistics fields for lpurge / lamigo
Alexandre Ioffe [Fri, 5 Aug 2022 05:49:40 +0000 (22:49 -0700)]
EX-4697 lipe: Define statistics fields for lpurge / lamigo

Added JSON stats output in lamigo
Extended JSON stats output in lpurge

Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Test-Parameters: trivial testlist=hot-pools
Change-Id: Ib367022dd073c1699d75e3ea7cfa3b586e7b8877
Reviewed-on: https://review.whamcloud.com/48125
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Colin Faber <cfaber@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5505 lipe: JSON statistics crashes lpurge
Alexandre Ioffe [Wed, 6 Jul 2022 02:58:58 +0000 (19:58 -0700)]
EX-5505 lipe: JSON statistics crashes lpurge

Use json_object_get() before json_object_put()
otherwise json_object_put() call causes crash

Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Test-Parameters: trivial testlist=hot-pools
Change-Id: Id5e8f05dd010f6626835176bf854344cd2b58a93
Reviewed-on: https://review.whamcloud.com/47885
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-13813 tests: fix stack_trap in conf-sanity test 110/111
Jian Yu [Sat, 23 Jul 2022 07:29:59 +0000 (00:29 -0700)]
LU-13813 tests: fix stack_trap in conf-sanity test 110/111

This patch fixes stack_trap in conf-sanity test 110 and 111
to restore test environment.

Lustre-change: https://review.whamcloud.com/48022
Lustre-commit: 0109cee2610b8dfeaaca25c3eb1e805e033c593d

Test-Parameters: trivial env=SLOW=yes,ENABLE_QUOTA=yes clientdistro=el8.5 serverdistro=el8.5 testlist=conf-sanity
Test-Parameters: env=SLOW=yes,ENABLE_QUOTA=yes fstype=zfs clientdistro=el8.5 serverdistro=el8.5 testlist=conf-sanity
Test-Parameters: env=SLOW=yes,ENABLE_QUOTA=yes mdscount=2 mdtcount=4 clientdistro=el8.5 serverdistro=el8.5 testlist=conf-sanity
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Change-Id: I540d96e8ad2c4990e7da18fe22256b44e9a19c72
Reviewed-on: https://review.whamcloud.com/48023
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15018 o2iblnd: treat cmid->device == NULL as an error
Serguei Smirnov [Fri, 17 Sep 2021 21:06:26 +0000 (14:06 -0700)]
LU-15018 o2iblnd: treat cmid->device == NULL as an error

Even if rdma_bind_addr is successful, kiblnd_dev_failover should
treat cmid->device == NULL as an error in order to later avoid
calling kiblnd_set_ni_fatal_on with possibly dev->ibd_hdev == NULL.

Lustre-change: https://review.whamcloud.com/44981
Lustre-commit: abd0ce62e96523193bfc2e2a3f574bc59d6c9f7c

Test-Parameters: trivial testlist=sanity-lnet
Fixes: 4668283cd1 ("LU-14806 o2iblnd: clear fatal error on successful failover")
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: Iefbe030b25d2dc543461cf98afeacd734fd64cf8
Reviewed-on: https://review.whamcloud.com/48258
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5014 pcc: Limit attach queue depth
Patrick Farrell [Wed, 13 Apr 2022 15:31:15 +0000 (11:31 -0400)]
EX-5014 pcc: Limit attach queue depth

The existing async attach code does not attempt to limit
the number of async attaches that can be requested at once.
This is a problem because we could theoretically create too
many kthreads and overwhelm the system.

When the attach queue depth is exceeded, we stop allowing
new items to be queued by switching over to sync attach.

Ideally we would rebuild the attach code to generate a
queue of attach requests and have the attach thread code
pull items from the queue until it's exhausted, but that's
a much more substantial change and is left for later.

NB: This patch is incomplete - there's no way to adjust the
queue depth at runtime and there's no test for it.  Both
need to be added.

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: Ib00dfb67f5245a28b722278d031ee8cdf5e190d6
Reviewed-on: https://review.whamcloud.com/47061
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoEX-5014 pcc: Change PCC commands to use constants
Patrick Farrell [Tue, 22 Mar 2022 17:45:31 +0000 (13:45 -0400)]
EX-5014 pcc: Change PCC commands to use constants

PCC command names are just written out as strings, making
them hard to track.  Change them all to use named commands.

This also includes a few minor debug and structural changes
as part of prep for the main patch against EX-5014.

Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Change-Id: Icad8dfdb44ed2562a95b2aaa0432cba221e4a1bc
Reviewed-on: https://review.whamcloud.com/46894
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
2 years agoLU-15874 kernel: new kernel [RHEL 9.0 5.14.0-70.22.1.el9_0]
Jian Yu [Wed, 24 Aug 2022 06:07:36 +0000 (23:07 -0700)]
LU-15874 kernel: new kernel [RHEL 9.0 5.14.0-70.22.1.el9_0]

This patch makes changes to support new RHEL 9.0 release
for Lustre client.

fix lbuild to include modified find-requires.ksyms

Lustre-change: https://review.whamcloud.com/47847
Lustre-commit: bbe5e9818053e43ebf97e2d3fa240917bfbd8336

Test-Parameters: trivial clientdistro=el9.0 \
env=SANITY_EXCEPT="101j 130 244a" testlist=sanity

Test-Parameters: trivial clientdistro=el9.0 \
env=LIPE_FIND_VERBOSE=true testlist=sanity-lipe

Test-Parameters: clientdistro=el9.0 testlist=sanity-pcc
Test-Parameters: clientdistro=el8.6 testlist=sanity-pcc

Change-Id: Ib7fdf9d3946df626759d395b5000b375391da344
Co-Authored-By: Minh Diep <mdiep@whamcloud.com>
Co-Authored-By: Alex Deiter <alex.deiter@gmail.com>
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Signed-off-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-on: https://review.whamcloud.com/47880
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>