Whamcloud - gitweb
Etienne AUJAMES [Fri, 21 Jan 2022 14:49:18 +0000 (15:49 +0100)]
LU-15467 tests: fix sanity-hsm test_103a timeout issue
Add check mds version in "sanity-hsm test_103a" for interop test.
Limit the number of parralel hsm restore requests to
max_rpcs_in_flight.
Lustre-change: https://review.whamcloud.com/46252
Lustre-commit:
98e1e41ce47c95155a8c8d452eef5074492d22f0
Fixes: b449f3d ("LU-15145 hsm: unlock the restore layout lock for a cancel")
Test-Parameters: trivial
Test-Parameters: testlist=sanity-hsm env=ONLY=103a,ONLY_REPEAT=20
Test-Parameters: testlist=sanity-hsm
Signed-off-by: Etienne AUJAMES <etienne.aujames@cea.fr>
Change-Id: I78098042d1316cdcc9d2e25860099a0ffdba2503
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48960
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Mikhail Pershin [Mon, 17 Oct 2022 19:53:44 +0000 (12:53 -0700)]
LU-15646 llog: correct llog FID and path output
- fix wrong LLOG_ID-to-FID convertion to output llog FID by
introducing PLOGID macro to expand llog ID for DFID format
- stop printing lgl_ogen along with llog FID as it always zero
since 2.3.51 and is not used anymore
- output correct path for update llog in llog_reader
- always print header info in llog_reader if available
- print llog flags in header info
Lustre-change: https://review.whamcloud.com/48430
Lustre-commit:
e28f3ee185b2ef7bad8046f46444772fac214a40
Fixes:
5a8e47d0a1a7 ("LU-9153 llog: update llog print format to use FIDs")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I7ba49e8101a67d2d80c204a5fc629bfd0bce89ad
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48896
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Bruno Faccini [Mon, 17 Oct 2022 19:46:25 +0000 (12:46 -0700)]
LU-6612 utils: strengthen llog_reader vs wrong format/header
The following snippet shows that llog_reader can be puzzled due to
an invalid 0 for the number of records when parsing an expected
LLOG file header :
root# dd if=/dev/zero bs=4096 count=1 of=/tmp/zeroes
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.
000263962 s, 15.5 MB/s
root# llog_reader /tmp/zeroes
Memory Alloc for recs_buf error.
Could not pack buffer; rc=-12
Lustre-change: https://review.whamcloud.com/15654
Lustre-commit:
45291b8c06eebf33d3654db3a7d3cfc5836004a6
Test-Parameters: trivial testlist=sanity,sanity-hsm
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Change-Id: I12be79e6c6a5da384a5fd81878a76a7ea8aa5834
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48895
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Etienne AUJAMES [Mon, 17 Oct 2022 19:37:39 +0000 (12:37 -0700)]
LU-15000 llog: read canceled records in llog_backup
llog_backup() do not reproduce index "holes" in the generated copy.
This could result to a llog copy indexes different from the source.
Then it might confuse the configuration update mechanism that rely on
indexes between the MGS source and the target copy.
This index gaps can be caused by "lctl --device MGS llog_cancel".
This patch add "raw" read mode to llog_process* to read canceled
records. So now llog_backup is able to reproduce an exact copy of
the original.
Lustre-change: https://review.whamcloud.com/46552
Lustre-commit:
d8e2723b4e9409954846939026c599b0b1170e6e
Signed-off-by: Etienne AUJAMES <etienne.aujames@cea.fr>
Change-Id: I811e23de8f4545bed36a44fedc2638d7418830dd
Reviewed-by: Dominique Martinet <qhufhnrynczannqp.f@noclue.notk.org>
Reviewed-by: DELBARY Gael <gael.delbary@cea.fr>
Reviewed-by: Stephane Thiell <sthiell@stanford.edu>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48894
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alex Zhuravlev [Mon, 17 Oct 2022 19:31:56 +0000 (12:31 -0700)]
LU-14098 obdclass: try to skip corrupted llog records
if llog's header or record is found corrupted, then
ignore the remaining records and try with the next one.
Lustre-change: https://review.whamcloud.com/40754
Lustre-commit:
910eb97c1b43a44a9da2ae14c3b83e28ca6342fc
Fixes:
186f083722 ("LU-11924 osp: combine llog cancel operations")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: If47ec1fc1e2eaf64be7ba08d3aa9c2b93903c0cf
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48893
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Yang Sheng [Mon, 17 Oct 2022 18:53:47 +0000 (11:53 -0700)]
LU-14044 llog: check fid after convert
We should convert from llog_id and then check fid. Also
change fid-lookup to error check instead LASSERT.
Lustre-change: https://review.whamcloud.com/40294
Lustre-commit:
6df76d3357fc5896b6902399ed7ce6d7c7835f58
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Change-Id: I673d8f16ff9e57a0482d6a3ec3ee3db33699f57f
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48892
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Andreas Dilger [Fri, 14 Oct 2022 22:04:53 +0000 (16:04 -0600)]
EX-5909 tests: clean up in sanity-quota/16a
Clean up the test file in sanity-quota test_16a. If test_16b is
run (DNE config) then the filesystem is reformatted, but in the
non-DNE config test_17 will fail if there is used quota.
Test-Parameters: trivial testlist=sanity-quota
Fixes:
b54b7ce43929 ("LU-14472 quota: skip non-exist or inact tgt for lfs_quota")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Id1faeab9df246d8010bf114582ab17a75846db68
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48899
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Andreas Dilger [Fri, 14 Oct 2022 20:06:26 +0000 (14:06 -0600)]
RM-620 build: New tag 2.14.0-ddn64
New tag 2.14.0-ddn64
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ia86edfc375e1dda7205db1a32c8c1933153a3e92
Hongchao Zhang [Fri, 22 Jul 2022 15:02:24 +0000 (23:02 +0800)]
LU-15738 test: check lfsck status before starting
If the LFSCK has been started before calling "lfsck_start"
to start it, the test shouldn't fail for starting LFSCK.
Lustre-change: https://review.whamcloud.com/48018/
Lustre-commit:
29aaf679afac89359e1b116b8de0480f24b4e8ac
Test-Parameters: trivial testlist=sanity-lfsck
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Change-Id: I266d9e2b9c5f37eb9e08b489fab428268b90d895
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48841
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alex Zhuravlev [Mon, 19 Sep 2022 16:00:15 +0000 (19:00 +0300)]
EX-5964 lamigo: disable idle disconnects
on the connections lamigo uses locally to avoid storms
of reconnects.
Test-Parameters: trivial testlist=hot-pools
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I3bc2742853e9636e38fbd8f7c2f238b3af55e0ba
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48840
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Alex Zhuravlev [Fri, 6 Aug 2021 06:34:31 +0000 (09:34 +0300)]
EX-3142 tests: changelog processing verification
add extra counter to lamigo stats to catch gaps in changelog
processing. add a new test (hot-pools/60) to verify that no
gaps happen (i.e. lamigo gets all changelog records), verify
that the changelog is purged properly.
Test-Parameters: trivial testlist=hot-pools mdscount=2 mdtcount=4
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I34d9d6f6f7f5766d945df43ae7d43dab7c70cef1
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48434
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
John L. Hammond [Wed, 8 Jun 2022 02:15:39 +0000 (19:15 -0700)]
LU-13578 test: sleep longer in sanity test_39
In sanity test_39r(), sleep for 2 * atime_diff rather than atime_diff + 1.
Lustre-change: https://review.whamcloud.com/47346
Lustre-commit:
be2525ffddb4bf55fde77e97b00d1c349119daed
Test-Parameters: trivial testlist=sanity env=ONLY=39r,ONLY_REPEAT=50
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: Ied508e12c848f6935d2317fb86bddc5341a6156e
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48831
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Andriy Skulysh [Fri, 5 Nov 2021 10:55:08 +0000 (12:55 +0200)]
LU-15472 ldlm: optimize flock reprocess
Resource reprocess on flock unlock can be done once
after all pending unlock requests.
It allows to reduce spinlock contention.
Lustre-change: https://review.whamcloud.com/46257
Lustre-commit:
42f377db4a24cefa7a041fcd3106dd58771eb319
Change-Id: I2809070f27fe3af7e1fc34e2b4b22603931f3dff
HPE-bug-id: LUS-10471, LUS-10909
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48818
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Etienne AUJAMES [Mon, 2 May 2022 12:27:17 +0000 (14:27 +0200)]
LU-15132 mdc: Use early cancels for hsm requests
HSM RELEASE and RESTORE requests take EX layout lock on the MDT side.
So the client can use early cancel for its local lock on the resource
to limit the contention (mdt side).
This patch does not pack ldlm request inside the hsm request because
the field (RMF_DLM_REQ) does not exist in the request. Adding this
field inside the request would break compatibility with _old_ servers.
Lustre-change: https://review.whamcloud.com/47181
Lustre-commit:
60d2a4b0efa4a944b558bd9b63b6334f7e70419b
Signed-off-by: Xing Huang <hxing@ddn.com>
Change-Id: I30a57b4855c28eef9c55a9645d3b6c491f962b13
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48652
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Serguei Smirnov [Thu, 8 Sep 2022 22:27:12 +0000 (15:27 -0700)]
LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE
RDMA_CM_EVENT_UNREACHABLE may be received not only when connection
is being connected, but also when it is being closed. Fix handing
of this event accordingly.
Lustre-change: https://review.whamcloud.com/48492
Lustre-commit:
3925b1669d519e6c038ecce1287c1ced3de623d3
Test-Parameters: trivial
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I79428188c159b2d80d36326589b2977db065d4a7
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48827
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alex Zhuravlev [Wed, 12 Oct 2022 06:35:42 +0000 (09:35 +0300)]
LU-14428 libcfs: discard cfs_trace_copyin_string()
Instead of cfs_trace_copyin_string(), use memdup_user_nul().
This combines the allocation with the copyin, and nul-terminates.
The resulting code is a lot simpler.
Lustre-change: https://review.whamcloud.com/41490
Lustre-commit:
67af976c806994cec27414d24b43f6519d72c240
LU-14788 lnet: check memdup_user_nul using IS_ERR
Crash in __proc_lnet_portal_rotor. memdup_user_nul returns an ERR_PTR
on error, not a NULL pointer. IS_ERR and PTR_ERR functions have to be
used to check and return the correct error code. The fix has been
applied in other locations having the wrong check.
Lustre-change: https://review.whamcloud.com/44091
Lustre-commit:
449d046e55a42cc4d1c4ab0217551cded1864bc4
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I089c5da96b59ec62d177aea2f3d170bf751c6fec
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48835
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexander Boyko [Tue, 24 Nov 2020 09:05:36 +0000 (04:05 -0500)]
LU-13974 tests: update log corruption
Test case reproduce missing object for sub transaction during
set xattr operation.
First setattr got -2, second already started, but didn't
make llog_add yet. In this case llog osp object is stale after
top_trans_start. So declaration phase can not refresh llogs. And
at llog_osd_write_rec osp object changes stale state to
valid(dt_attr_get), but llog handle and llog header are invalid.
A new record would be added to updatelog with wrong index.
In that case processing of update log fails with
fs1-MDT0001-osp-MDT0003: [0x2:0x400024d0:0x2] Invalid record: index
112926 but expected 112925
lod_sub_recovery_thread()) fs1-MDT0001-osp-MDT0003 get update log
failed: rc = -34
Recovery aborted, and clients are evicted.
Lustre-change: https://review.whamcloud.com/40743
Lustre-commit:
562837124ec7bffeba7edb4b4b899bc271833374
HPE-bug-id: LUS-9030
Test-Parameters: testlist=sanity envdefinitions=ONLY="427"
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: I6a47fed1bc01f4be62216d1d0787adc413df0cf5
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48832
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Aleksei Alyaev [Thu, 23 Dec 2021 08:48:22 +0000 (11:48 +0300)]
LU-8621 utils: cmd help to stdout or short cmd error
- Changed to print command help to stdout
- Changed to output short error message for an unrecognized command
Lustre-change: https://review.whamcloud.com/47162/
Lustre-commit:
bc69a8d058f5bcdb75e062df57a6ccd23243d1e0
Test-Parameters: trivial
Signed-off-by: Aleksei Alyaev <aalyaev@ddn.com>
Change-Id: I67616ddb576e3347a2da130b3a731a6bf8730185
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48851
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Shaun Tancheff [Thu, 13 Oct 2022 19:19:47 +0000 (12:19 -0700)]
LU-16233 build: Add always target for SUSE15 SP3 LTSS
SUSE 15 SP3 LTSS kernel version 5.3.18-150300.59.93
(and later) breaks lustre build tests which expect
conftest.i to be generated.
Lustre-change: https://review.whamcloud.com/48833
Lustre-commit: TBD (from
274b34c4d3a20937ebb17d139dbde0eaaed503b2)
HPE-bug-id: LUS-11286
Test-Parameters: trivial
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: If23e9b31b537878a43075ffff62a99906f47fd9a
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48863
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Wed, 28 Sep 2022 07:00:22 +0000 (00:00 -0700)]
LU-16174 kernel: kernel update SLES15 SP4 [5.14.21-150400.24.21.2]
Update SLES15 SP4 kernel to 5.14.21-150400.24.21.2 for Lustre client.
Lustre-change: https://review.whamcloud.com/48604
Lustre-commit: TBD (from
896fd88c35b6685a586c1279c83c739b48cbe846)
Test-Parameters: trivial clientdistro=sles15sp4 \
env=SANITY_EXCEPT="27J 101j 244a" testlist=sanity
Change-Id: Ia68e1c960c79f40d0f725b0f440cd562b820a19f
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48689
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Wed, 28 Sep 2022 06:46:30 +0000 (23:46 -0700)]
LU-16177 kernel: kernel update RHEL9.0 [5.14.0-70.26.1.el9_0]
Update RHEL9.0 kernel to 5.14.0-70.26.1.el9_0 for Lustre client.
Lustre-change: https://review.whamcloud.com/48676
Lustre-commit: TBD (from
9951a56c26b1ce6639cd2db350fdf6b81b3b4707)
Test-Parameters: trivial clientdistro=el9.0 \
env=SANITY_EXCEPT="101j 130 244a" testlist=sanity
Change-Id: I9da2ccdf419d6490fdba80199eda69f4f19361be
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48687
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexandre Ioffe [Wed, 12 Oct 2022 00:53:20 +0000 (17:53 -0700)]
EX-6130 lipe: s_volume_name not NUL terminated
s_volume_name field stores string, but the field may have no
termination NUL if the string size equal the size of the field.
Therefore on some target systems the definition of
struct ext2_super_block s_volume_name in
/usr/include/ext2fs/ext2_fs.h may have
attribute "nonstring". In such case it conflicts with calls
which require NUL terminated string.
The fix replaces NUL-terminated string calls by calls with
limited string size (e.g. strlen() -> strnlen())
Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Change-Id: Ieb1921a289328a8f9bfae9bb658c6c74f8ec43b7
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48829
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Andreas Dilger [Tue, 11 Oct 2022 08:04:59 +0000 (02:04 -0600)]
RM-620 build: New tag 2.14.0-ddn63
New tag 2.14.0-ddn63
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I5e4b8d3d863cbd504fc7470b413d2083bb15e371
Etienne AUJAMES [Wed, 5 Oct 2022 07:10:05 +0000 (00:10 -0700)]
LU-15481 llog: Add LLOG_SKIP_PLAIN to skip llog plain
Add the catalog callback return LLOG_SKIP_PLAIN to conditionally skip
an entire llog plain.
This could speedup the catalog processing for specific usages when a
record need to be access in the "middle" of the catalog. This could
be usefull for changelog with several users or HSM.
This patch modify chlg_read_cat_process_cb() to use LLOG_SKIP_PLAIN.
The main idea came from:
d813c75d ("LU-14688 mdt: changelog purge
deletes plain llog")
**Performance test:**
* Environement:
2474195 changelogs record store on the mds0 (40 llog plain):
mds# lctl get_param -n mdd.lustrefs-MDT0000.changelog_users
current index: 2474195
ID index (idle seconds)
cl1 0 (3509)
* Test
Access to records at the end of the catalog (offset: 2474194):
client# time lfs changelog lustrefs-MDT0000 2474194 >/dev/null
* Results
- with the patch: real 0m0.592s
- without the patch: real 0m17.835s (x30)
Lustre-change: https://review.whamcloud.com/46310
Lustre-commit:
aa22a6826ee521ab14994a4533b0dbffb529aab0
Signed-off-by: Etienne AUJAMES <etienne.aujames@cea.fr>
Change-Id: I887d5bef1f3a6a31c46bc58959e0f508266c53d2
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48774
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Gaurang Tapase [Thu, 29 Sep 2022 05:57:45 +0000 (23:57 -0600)]
EX-6033 lipe: Add note to developers for HP scripts
stratagem-hp-* scripts are moved to EMF repo as
they are tightly coupled with EXA release because of
HA configuration. They are kept in lustre repo so that
hotfixes should not delete them.
Test-Parameters: trivial
Change-Id: I33eecaa4ed0c9342a83973bac313322a007d72d0
Signed-off-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48698
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Qian Yingjin [Fri, 23 Sep 2022 09:34:09 +0000 (05:34 -0400)]
EX-5936 pcc: dont take UPDATE lock when set lustre.pin xattr
In this patch, we do not take UPDATE lock whan set lustre.pin
XATTR during the PCC pin command.
The reason is that it may revoke the combined UPDATE|LAYOUT lock
cached on the client namespace, and invalidate the layout and PCC
cache.
As we disable to cache lustre.pin xattr on the client XATTR cache,
so it does not cause problem without taking UPDATE lock bit during
set lustre.pin XATTR.
Add test case: sanity-pcc/204d.
Change-Id: I35a0e399294020efdb0e4710500e8f7b846c290f
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48638
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexander Boyko [Wed, 5 Oct 2022 07:06:59 +0000 (00:06 -0700)]
LU-14599 osp: limit allocation at osp_sync_process_committed
Sometimes osp cancels very large cookie list with 64K elements.
In this case osp_sync_process_committed() tries to allocate 64 pages
and uses vmalloc.
The fix limits memory allocation size to 4 page with kmalloc, and
reuse it in a loop.
Lustre-change: https://review.whamcloud.com/43250
Lustre-commit:
9b692e2e7d105f4926649ea46007ac65b24c4b6d
HPE-bug-id: LUS-9815
Fixes:
6d7332102 ("LU-11924 osp: combine llog cancel operations")
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: Ic875335a28f78494fdb3cbc4b0145e5a43831ee8
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48773
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Mikhail Pershin [Mon, 5 Sep 2022 07:41:37 +0000 (10:41 +0300)]
LU-16135 lod: prohibit DoM pattern in plain layout
DoM pattern can be set as default directory plain layout by
older LFS version. It misses DoM component sanity checks if
plain layout is used. Such layout is not allowed and causes
later crashed when file is created under that directory.
While LFS can prevent this but not in all Lustre versions,
so LOD should do the check as well
Lustre-change: https://review.whamcloud.com/48433
Lustre-commit:
a8272168e3888ec4ced18035182159a8ee56a51a
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: Ic58fdda2ab3e63083128cb6cf949fcb43ccd2c02
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48514
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Etienne AUJAMES [Thu, 21 Oct 2021 14:31:01 +0000 (16:31 +0200)]
LU-15132 hsm: Protect against parallel HSM restore requests
Multiple parallel accesses (read/write) to the same released file
could cause multiple HSM restore requests to be sent.
On the MDT side, each restore request waits the first one to complete
before grabbing the MDS_INODELOCK_LAYOUT LCK_EX and registering the
llog record.
This could cause several MDT threads to hang for the same restore
request sent in parallel. In the worst case, all MDT threads can
hang and the MDS is not longer able to handle requests.
This patch checks if an HSM restore handle exists before taking the
lock.
Lustre-change: https://review.whamcloud.com/45367
Lustre-commit:
66b3e74bccf1451d135b7f331459b6af1c06431b
Test-Parameters: testlist=sanity-hsm,sanity-hsm
Test-Parameters: testlist=sanity-hsm env=ONLY=12s,ONLY_REPEAT=50
Signed-off-by: Xing Huang <hxing@ddn.com>
Change-Id: I9584edc2c7411aa41b2e318e55f57c117d1c3dfb
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48650
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Vitaly Fertman [Tue, 4 Oct 2022 17:30:08 +0000 (10:30 -0700)]
LU-16062 ldlm: improve bl_timeout for prolong
If there is a client's RPC in hand, we can do a better job for
calculating the lock callback timeout as RPC has the info what
client thinks about this RPC timeout. Let's use it.
Lustre-change: https://review.whamcloud.com/48094
Lustre-commit:
34b2246e4a6c8ce827c404cb4e52f7c6a0a1b90b
HPE-bug-id: LUS-8866, LUS-11074
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Change-Id: Ibd67d37c1073d0d3cb2e08b532c801af0de116fe
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48762
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Vitaly Fertman [Tue, 4 Oct 2022 17:24:31 +0000 (10:24 -0700)]
LU-14183 ldlm: wrong ldlm_add_waiting_lock usage
exp_bl_lock_at accounted the period since BLAST send until cancel RPC
came to server originally. LU-6032 started to update l_blast_sent for
expired locks which are still busy - prolonged locks when the timeout
expired. In fact, this is a good idea to cover not the whole period
but until any involved RPC comes - it avoids excessively large lock
callback timeouts - and the IO which does the lock prolong is also
able to re-start the AT cycle by updating the l_blast_sent.
Unfortunately, the change seems to be made occasionally as the main
prolong code was not adjusted accordingly.
Lustre-change: https://review.whamcloud.com/40868
Lustre-commit:
af07c9a79e263f940fea06a911803097b57b55f4
Fixes:
292aa42e08 ("LU-6032 ldlm: don't disable softirq for exp_rpc_lock")
HPE-bug-id: LUS-9278
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Change-Id: Idc598508fc13aa33ac9fce56f13310ca6fc819d4
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48761
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Lei Feng [Thu, 30 Jun 2022 02:46:31 +0000 (10:46 +0800)]
LU-15986 ptlrpc: protect rq_repmsg in ptlrpc_req_drop_rs()
There is a race condition that: on server side, one thread sent
early replay and is deleting the reply message, another is
searching for existing request and print some debug information
in _debug_req() if there is a duplicated request. They both operate on
req->rq_repmsg but it is not protected in ptlrpc_req_drop_rs().
So we protected it with req->rq_early_free_lock.
Lustre-change: https://review.whamcloud.com/47839
Lustre-commit:
aaef545cff2dd958418ec9fb364d4bbe1408edb9
Signed-off-by: Lei Feng <flei@whamcloud.com>
Change-Id: Ied55427ee15c3ef84bdd2d579844eba398dbf010
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/47860
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Yang Sheng [Mon, 19 Sep 2022 05:46:27 +0000 (13:46 +0800)]
LU-16166 ptlrpc: lower the message level in no resend case
Don't report the wrong generation as a error message in
rq_no_resend case.
Lustre-change: https://review.whamcloud.com/48585
Lustre-commit:
d13cca56a5ae2ad44d8083025e37263e408b8f62
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Change-Id: I534cadc916fcd1eb6840439b6507e646d0e5d974
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48807
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Artem Blagodarenko [Wed, 28 Sep 2022 14:28:11 +0000 (10:28 -0400)]
EX-6069 ldiskfs: ext4-simple-blockalloc.patch small fixes
The LU-14305 requires cleanup to do.
MB_DEFAULT_MAX_CX_BYTES #defines are not used anymore,
and should be removed.
Also, in the el8 version of the patch for b_es6_0,
the THRESHOLD_BLOCKS() function should explicitly take "sbi"
as a parameter.
Test-Parameters: trivial
Fixes:
d5d5cfdde2 ("add persistent tuning for mb_c3_threshold")
Change-Id: Idcb93432fdfa7694b4e7cabbf46a0bf21a412f87
Signed-off-by:Artem Blagodarenko <ablagodarenko@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48714
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Serguei Smirnov [Fri, 23 Sep 2022 19:29:59 +0000 (12:29 -0700)]
LU-16184 o2iblnd: fix deadline for tx on peer queue
In o2iblnd, deadline is checked for txs on peer queue,
but not set prior to adding the tx to the queue. This
may cause the tx to be dropped unnecessarily with
"Timed out tx for ..." warning.
Fix it by setting the tx_deadline when adding tx to peer queue.
Lustre-change: https://review.whamcloud.com/48640
Lustre-commit:
4c89ee7d7b098c7f1e6566f49fa2940db577518d
Test-Parameters: trivial
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: Ie7cf5590b440b60f71527049953a64bb31d53578
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48641
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Bobi Jam [Thu, 15 Sep 2022 06:46:34 +0000 (14:46 +0800)]
LU-16160 osc: take ldlm lock when queue sync pages
osc_queue_sync_pages() add osc_extent to osc_object's IO extent
list without taking ldlm locks, and then it calls
osc_io_unplug_async() to queue the IO work for the client.
This patch make sync page queuing take ldlm lock in the
osc_extent.
Lustre-change: https://review.whamcloud.com/48557
Lustre-commit: 67aca1fcc6bed20794832decdba590a758d67d8fp
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Change-Id: Idefa2981e62a2a6e10d8b8a7692c0337b61b9052
Reviewed-on: https://review.whamcloud.com/48597
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexandre Ioffe [Wed, 21 Sep 2022 19:15:03 +0000 (12:15 -0700)]
EX-5932 lipe: stratagem-hp-config.sh has wrong MDTLIST
stratagem-hp-config.sh doesn't pick up proper MDTLIST
if snapshot agents are running. Fix MDTLIST which is used
to configure lpurge
Test-Parameters: trivial
Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Change-Id: Ic1d58d56f1acae140122d0b582410c140759e89e
Reviewed-on: https://review.whamcloud.com/48619
Reviewed-by: Shuichi Ihara <sihara@ddn.com>
Reviewed-by: Colin Faber <cfaber@ddn.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Emoly Liu [Thu, 15 Sep 2022 01:42:47 +0000 (09:42 +0800)]
LU-16154 obdclass: free inst_name correctly
In functon class_config_llog_handler(), inst_name should be freed
correctly before break.
Lustre-change: https://review.whamcloud.com/48542
Lustre-commit:
e7f17c5e0c95dba3b80e192e4ca3628cc42e64b9
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Change-Id: I6adc0ed62c3c637237834b799f25666d0e7e1ecb
Reviewed-on: https://review.whamcloud.com/48670
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Mon, 26 Sep 2022 18:22:56 +0000 (11:22 -0700)]
LU-16050 build: replace ofed_info with dpkg/rpm
After installing MLNX_OFED by running mlnxofedinstall command,
mlnx-ofed-kernel-modules package is not listed by ofed_info,
which causes Lustre configure fail as follows:
checking whether to use Compat RDMA... /usr/bin/ofed_info
dpkg-query: error: --listfiles needs at least one package name argument
This patch fixes the above issue by replacing ofed_info with
"dpkg -l" and "rpm -qa" commands to find OFED package.
Lustre-change: https://review.whamcloud.com/48047
Lustre-commit:
3a7930e63c15b0fbe51ac73db81a1186939115bb
Test-Parameters: trivial
Fixes:
ec03c9628cae ("LU-15417 build: find the new path for MOFED 5.5")
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Change-Id: Ia3c2d6bf10e147ca2761221741eff6f93008556c
Reviewed-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48662
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Wed, 28 Sep 2022 16:54:33 +0000 (09:54 -0700)]
EX-6014 tests: Revert "EX-4093 tests: hot-pools don't recreate pools"
This reverts commit
116cbacc52d8 to resolve the hot-pools
regression test failures.
After running sub-test 1, the OST pools were destroyed by
the following stack_trap in create_pool():
stack_trap "destroy_test_pools $fsname" EXIT
If the pools are not recreated in the successive sub-tests,
then they will fail. We have to revert commit
116cbacc52d8
before we find out a way to avoid triggering the stack_trap
between sub-tests.
Test-Parameters: trivial mdscount=2 mdtcount=4 \
testlist=parallel-scale-nfsv4,hot-pools
Fixes:
116cbacc52d8 ("EX-4093 tests: hot-pools don't recreate pools")
Change-Id: I464a1f9f380c55e70b78a0dd7e52723d5b0a298d
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48690
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Andreas Dilger [Fri, 23 Sep 2022 22:24:58 +0000 (16:24 -0600)]
RM-620 build: New tag 2.14.0-ddn62
New tag 2.14.0-ddn62
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I21b71b04905a70acbaada6d5a7fbab6c9184ca51
Andreas Dilger [Fri, 23 Sep 2022 19:36:53 +0000 (19:36 +0000)]
Revert "EX-4141 lipe: lamigo should detect dead OST and restart ALR"
This reverts commit
028bee14d2c6d8feb5eb418302f8751643e731c6 due to build error.
Change-Id: I6193f3e99192b618a3e6616524e28b230659fc0b
Reviewed-on: https://review.whamcloud.com/48639
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
Andreas Dilger [Fri, 23 Sep 2022 17:19:23 +0000 (11:19 -0600)]
RM-620 build: New tag 2.14.0-ddn61
New tag 2.14.0-ddn61
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I34c78bc6ce2fbac65e4e8b017cad1da05c78d53a
Minh Diep [Thu, 15 Sep 2022 03:41:37 +0000 (20:41 -0700)]
LU-16183 tests: sanity-hsm/70 should detect python
Check for python2 and python3 explicitly, since the
generic python command does not exist in newer distros.
Test-Parameters: env=SLOW=yes,ENABLE_QUOTA=yes \
clientdistro=sles15sp3 testlist=sanity-hsm
Test-Parameters: env=SLOW=yes,ENABLE_QUOTA=yes \
clientdistro=el7.9 testlist=sanity-hsm
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Change-Id: I2251be461129310868868277bf9d46015545ffe2
Reviewed-on: https://review.whamcloud.com/48577
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexandre Ioffe [Tue, 29 Mar 2022 07:48:35 +0000 (00:48 -0700)]
EX-4141 lipe: lamigo should detect dead OST and restart ALR
Use #keepalive message and ssh read with timeout
to detect OST is down and restart ALR.
Add stats for ALR last seen message
Duplicate ofd_access_log_reader from lustre/utils into
lipe/src/es_ofd_access_log_reader
Use common lamigo_hash.h for lamigo and
es_ofd_access_log_reader
Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Test-Parameters: trivial testlist=hot-pools
Change-Id: I26dc631a8663046821e049fc6e091108b2a62f87
Reviewed-on: https://review.whamcloud.com/46944
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: John Hammond <jhammond@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Chris Horn [Tue, 24 Aug 2021 16:16:17 +0000 (11:16 -0500)]
LU-14962 lnet: Check for -ESHUTDOWN in lnet_parse
The fix for LU-8106, http://review.whamcloud.com/19993, no longer
works because rc does not have the return value from
lnet_nid2peerni_locked(). Use PTR_ERR to get the return value and
restore the LU-8106 fix.
Lustre-change: https://review.whamcloud.com/44743
Lustre-commit:
cce82630cbf2c7badbbdd16a8ca9c8c0065ded13
Test-Parameters: trivial
HPE-bug-id: LUS-10333
Fixes:
fa8b4e6357 ("LU-7734 lnet: peer/peer_ni handling adjustments")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I9cc2bc2d6e675d38cf06d99c524bdd95110bf0e9
Reviewed-on: https://review.whamcloud.com/48487
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Thu, 3 Mar 2022 07:12:32 +0000 (01:12 -0600)]
LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
If the peer NI lookup in lnet_parse() fails with ESHUTDOWN then we
should return that value back to the LNDs so that they can treat the
failed call the same way as other lnet_parse() failures.
Returning zero results in at least one bug in socklnd where a
reference on a ksock_conn can be leaked which prevents socklnd from
shutting down.
Lustre-change: https://review.whamcloud.com/46711
Lustre-commit:
4fbd0705a3d25bbc85e953f81e697e5006b215ce
Fixes:
47b7b31978 ("LU-8106 lnet: Do not drop message when shutting down LNet")
Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-15794
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ic403619c6dccf3921c46a674808c404adad7a30e
Reviewed-on: https://review.whamcloud.com/48485
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Mon, 7 Mar 2022 17:03:50 +0000 (11:03 -0600)]
LU-15616 lnet: ln_api_mutex deadlocks
LNetNIFini() acquires the ln_api_mutex and holds onto it throughout
various shutdown routines. Meanwhile, LND threads (via
lnet_nid2peerni_locked()) or the discovery thread (via
lnet_peer_data_present()) may need to acquire this mutex in order to
progress.
Address these potential deadlocks by setting the_lnet.ln_state to
LNET_STATE_STOPPING earlier in LNetNIFini(), and release the mutex
prior to any call into LND module or before any wait.
LNetNIInit() is modified to return -ESHUTDOWN if it finds that there
is a concurrent shutdown in progress.
Lustre-change: https://review.whamcloud.com/46727
Lustre-commit:
22de0bd145b649768b16dd42559d326af3c13200
Test-Parameters: trivial
HPE-bug-id: LUS-10681
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ia8b28cc95ff71e66a0f99aed4f2c22ec9d44ce1e
Reviewed-on: https://review.whamcloud.com/48384
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Fri, 11 Dec 2020 18:04:32 +0000 (12:04 -0600)]
LU-13806 lnet: Ensure proper peer, peer NI, peer net hierarchy
The MR design dictates that the peer nets and peer NIs are ordered
such that the peer net and peer NI for a peer's primary NID appears
first, followed by other peer NIs in the primary NID's peer net,
followed by other peer nets/NIs. This ordering is broken and it can
result in tripping an assertion if the primary NID of a peer is
deleted. Modify lnet_peer_attach_peer_ni() to check whether the
NI being attached is the peer's primary, and place it, and its
associated peer net, appropriately.
Modify lnet_peer_set_primary_nid() so that it updates the
lp_primary_nid before calling lnet_peer_add_nid() so that
lnet_peer_attach_peer_ni() can detect the situation where the
primary is changing and act appropriately.
Finally, modify lnet_peer_merge_data() to enforce the hierarchy
after it has finished merging the contents of the ping buffer. This
ensures we maintain the correct hierarchy in certain edge cases where
we've needed to reconcile two peers. e.g. if a peer adds a new
interface, the discovery push may arrive from that new interface
which will result in a second peer object being created which will
need to be reconciled with the original peer object.
Lustre-change: https://review.whamcloud.com/40985
Lustre-commit:
9eb9474c41c823c70f34e6bb102a8861ca21a3d1
HPE-bug-id: LUS-9630
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I8397a24ba1ba0bba33846e7e97b8d60a8f26a1be
Reviewed-on: https://review.whamcloud.com/48508
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Sat, 5 Feb 2022 23:15:30 +0000 (23:15 +0000)]
LU-15538 lnet: DLC sets map_on_demand incorrectly
When any NET or LND tunable is specified via CLI or yaml, then the
whole tunables struct gets memset to 0, or in the case of yaml config,
0 gets assigned to any tunable that isn't specified in the yaml. This
causes a problem for map_on_demand because 0 is a valid value for that
parameter, and ko2iblnd cannot know whether the user specified that 0
should be used or if DLC is specifying that the parameter was unset.
Rather than setting this parameter to 0 in the LND tunables struct,
have DLC set it to UINT_MAX to indicate that ko2iblnd should use the
value of the kernel module parameter.
Lustre-change: https://review.whamcloud.com/46492
Lustre-commit:
896f4a082b93453f5e7168f685faff4fba594ff3
Test-Parameters: trivial
HPE-bug-id: LUS-10740
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I303e64d4d402ba61b5ae3e3910873f192a4a2845
Reviewed-on: https://review.whamcloud.com/48491
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alex Zhuravlev [Wed, 21 Sep 2022 00:40:46 +0000 (17:40 -0700)]
EX-4093 tests: hot-pools don't recreate pools
the test can save some time skipping pools recreating in every
subtest.
before: 1371 seconds
after: 1058 seconds
Test-Parameters: trivial testlist=hot-pools
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: I9304e29b6fc59dd68626b44844dc81500009a80f
Reviewed-on: https://review.whamcloud.com/48614
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexandre Ioffe [Thu, 8 Sep 2022 08:37:31 +0000 (01:37 -0700)]
EX-5824 test: hot-pools test_57: data copy failed: mirror failed
Add debug prints in hot-pools test_57
Test-Parameters: trivial env=FAIL_ON_ERROR=false,ONLY=56-57 testlist=hot-pools
Change-Id: I863b580f5483c14c24c6f79ebdddbc782b65e945
Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-on: https://review.whamcloud.com/48477
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
James Nunez [Mon, 13 Sep 2021 16:35:30 +0000 (10:35 -0600)]
LU-14992 tests: sanity/replay-vbr mkdir on MDT0
Replace mkdir with mkdir_on_mdt0() for sanity test 133a
and relay-vbr test 7a. These tests expect the newly
created directory is on MDT0.
Lustre-change: https://review.whamcloud.com/44902/
Lustre-commit: TBD
Test-Parameters: trivial mdscount=2 mdtcount=4 testlist=sanity
Test-Parameters: env=SLOW=yes mdscount=2 mdtcount=4 testlist=replay-vbr
Change-Id: Icea2923a8d8d3a3aa0ddf0401f0a025480b2f6f0
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48606
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alex Zhuravlev [Tue, 30 Mar 2021 05:57:14 +0000 (08:57 +0300)]
LU-13358 libcfs: add timeout to cfs_race() to fix race
there is no guarantee for the branches in cfs_race() to be executed
in strict order, thus it's possible that the second branch (with
cfs_race_state=1) is executed before the first branch and then another
thread executing the first branch gets stuck.
this construction is used for testing only and as a
workaround it's enough to timeout.
Lustre-change: https://review.whamcloud.com/43161
Lustre-commit:
2d2d381f35ee004319a20f5d2d8e70d13480d6c7
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Change-Id: Ie1cc0accedb3e1a198d4b17d1ab00ce298c560f2
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48553
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Cyril Bordage [Thu, 17 Feb 2022 11:49:16 +0000 (12:49 +0100)]
LU-14875 import: fix bad CPT read
When importing, CPT was read from tunables field but in fact, it is in
the same level in the YAML file generated during export.
Lustre-change: https://review.whamcloud.com/46541
Lustre-commit:
9ad5c43f4a53f8679cfa1a60f8161b08d3dcfa66
Test-parameters: trivial testlist=sanity-lnet
Signed-off-by: Cyril Bordage <cbordage@whamcloud.com>
Change-Id: Iea7b6189ad1a25b95ae6416d75ee2cbe4dca2fbf
Reviewed-on: https://review.whamcloud.com/48490
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Emoly Liu [Fri, 9 Sep 2022 10:18:24 +0000 (18:18 +0800)]
EX-5798 tests: add a version check to conf-sanity.sh test_133
The patch at https://review.whamcloud.com/47334 has been ported
to b_es6_0 since 2.14.0-ddn46, a version check is added to
conf-sanity.sh test_133 to avoid interop failure.
Test-Parameters: trivial testlist=conf-sanity serverversion=2.14.0-ddn23
Change-Id: I4bfc2986abddfd3a5a606f5586a29311582fca42
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48501
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Sarah Liu <sarah@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Shaun Tancheff [Wed, 7 Sep 2022 04:35:51 +0000 (21:35 -0700)]
LU-16131 build: Do not depend on libmount during --enable-dist
Defer the libmount requirement when using --enable-dist to
generate the lustre-src.rpm.
This allows mock and/or yum build-deps to resolve resolve
dependencies and pickup the libmount requirement without changing
the existing minimal build.
Lustre-change: https://review.whamcloud.com/48407
Lustre-commit:
819c8b169325045ae8bac9c4f38a58c75e22d099
Test-Parameters: trivial
HPE-bug-id: LUS-11091
Fixes:
f21b944127 ("LU-15940 build: add a required dependency for libmount")
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: I20a7a097f9b651b6ea5519f79efda6c96b6f2199
Reviewed-on: https://review.whamcloud.com/48448
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Sebastien Buisson [Fri, 12 Aug 2022 07:59:02 +0000 (09:59 +0200)]
LU-16085 llite: fix stat attributes_mask
Fix stat attributes_mask to return STATX_ATTR_ENCRYPTED whenever it is
possible. Also fix sanityn test_106c to expect at least the 0x30 flag
for attributes_mask.
Lustre-change: https://review.whamcloud.com/48208
Lustre-commit:
0e48653c27eacad29dbff1589da771ad4f5d1014
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
LU-16085 tests: fix sanityn test_106c
Fix sanityn test_106c after modification introduced when fixing
stat attributes_mask.
Lustre-change: https://review.whamcloud.com/48435
Lustre-commit:
b843e8f89fe9b697ceec4657dde445aa60c200d0
Test-Parameters: trivial testlist=sanityn env=ONLY=106c
Fixes:
0e48653c27 ("LU-16085 llite: fix stat attributes_mask")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: Icd16beff058c42d77e9b04ad1a287ec2ac04dfed
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48520
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Mikhail Pershin [Fri, 29 Jul 2022 08:24:15 +0000 (11:24 +0300)]
LU-16052 llog: handle -EBADR for catalog processing
Llog catalog processing might retry to get the last llog block
to check for new records if any. That might return -EBADR code
which should be considered as valid. Previously -EIO was
returned in all cases.
Run conf-sanity test_106 several times as specific test
Lustre-change: https://review.whamcloud.com/48070
Lustre-commit:
e260f751f2a21fa126eeb4bc9e94250ba3e815f1
Test-Parameters: testlist=conf-sanity env=ONLY=106,SLOW=yes,ONLY_REPEAT=10
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I30e04ba2c91c8bdce72c95675a1209639e9f0570
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
Reviewed-on: https://review.whamcloud.com/48540
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Andreas Dilger [Wed, 10 Aug 2022 18:27:56 +0000 (12:27 -0600)]
LU-16084 tests: fix lustre-patched filefrag check
Fix sanity test_130b thru test_130g to check for "filefrag -l"
instead of "filefrag -e", since the "-e" option has been in
upstream e2fsprogs since commit v1.42.6-50-g2508eaa7. The "-l"
option (logical extent ordering) is really what is needed to
handle Lustre-striped files anyway.
While there, fix the code style in these subtests:
- use "local" and lower-case names for local variables
- use $(...) for subshells
- use (( ... )) for numeric comparisons
- use preferred "check || action" style checks
- use "skip_env" for environment configuration checks (e2fsprogs)
- use "skip" for test-related checks that can't be "fixed"
- use pre-defined $ost1_FSTYPE for checking OST filesystem type
Lustre-change: https://review.whamcloud.com/48188
Lustre-commit:
fef1db004c4230e1051f9266f34a658501bf5d03
Test-Parameters: trivial
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I8eb7f17a9532796ab0274247194dd52cbc8a141c
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48555
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Andreas Dilger [Tue, 20 Sep 2022 18:58:35 +0000 (11:58 -0700)]
LU-16082 ldiskfs: old-style EA inode fix for el8.5/el8.6
Add the rhel8/ext4-old_ea_inodes_handling_fix.patch to the ldiskfs
series for el8.5 and el8.6 kernels.
Lustre-change: https://review.whamcloud.com/48496
Lustre-commit:
ba9845274c8ea5c55f57b7fa0e839f18d76031ea
Test-Parameters: trivial testlist=sanity clientdistro=el8.6 serverdistro=el8.6
Fixes:
76c3fa96dc30 ("LU-16082 ldiskfs: old-style EA inode handling fix")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ifb66a0b7d78e5153d7897bee45fbf1d0e58fbc5c
Reviewed-on: https://review.whamcloud.com/48612
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Jian Yu [Wed, 21 Sep 2022 20:28:43 +0000 (13:28 -0700)]
EX-5978 scripts: remove zfsobj2fid
The zfsobj2fid utility is not needed on EXA cluster.
Test-Parameters: trivial clientdistro=el9.0 \
env=SANITY_EXCEPT="101j 130 244a" testlist=sanity
Change-Id: I40993c7c4ddef3f389c002076f5c118a9f610758
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48621
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Wed, 21 Sep 2022 07:41:33 +0000 (00:41 -0700)]
EX-5975 build: check OS type before using dpkg
Bright cluster manager by default installs dpkg
on it's centos/rhel installation - presumably to
allow provisioning debian nodes in the cluster,
so dpkg is in the path and can't be removed.
This patch fixes LB_USES_DPKG to check OS type
before checking if dpkg is installed.
Test-Parameters: trivial clientdistro=el8.6
Test-Parameters: trivial clientdistro=ubuntu2204 env=SANITY_EXCEPT="130 244a"
Change-Id: Idc9f6edc91f9c89b40f259421b088287e08bfe9c
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48616
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Gaurang Tapase <gtapase@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Shaun Tancheff [Wed, 14 Sep 2022 07:48:16 +0000 (00:48 -0700)]
LU-16090 build: Module.symvers lookup by flavor on SUSE
When multiple kernel flavors are found we need to select only
the Module.symvers for the flavor that is being built.
Lustre-change: https://review.whamcloud.com/48195
Lustre-commit:
f3a9921ae4f9c3e48328f2c682e0c7e61221e0d3
HPE-bug-id: LUS-11149
Test-Parameters: trivial
Fixes:
1f4aaefe1aae ("LU-15962 build: add in-kernel Module.symvers to symbol path")
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: I1c9af91108534d3a67f816077756fded4cd0b653
Reviewed-on: https://review.whamcloud.com/48329
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Shaun Tancheff [Mon, 19 Sep 2022 19:11:55 +0000 (12:11 -0700)]
LU-16059 build: Installation of dkms server builds
The linux-zfs-dkms package is passing the wrong paths
for zfs [and spl] causing the dkms build to fail.
ZFS_VERSION is not parsed correctly from 'dkms status'.
The splver and zfsver check can match against the wrong
package(s).
lustre-zfs-dkms provides: kmod-lustre-osd-zfs, and
lustre-osd-zfs-mount
lustre-ldiskfs-dkms provides: kmod-lustre-osd-ldiskfs and
lustre-osd-ldiskfs-mount
In the case of multiple zfs versions installed, build lustre
osd against the highest version number.
Lustre-change: https://review.whamcloud.com/48083
Lustre-commit:
c3dc67b2c5bf1974d792b3701d932bd04c756bd8
HPE-bug-id: LUS-11113
Test-Parameters: trivial
Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Change-Id: Ic154ca045427bf26cb7e6a44b8c467675e987aad
Reviewed-on: https://review.whamcloud.com/48594
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Mon, 22 Aug 2022 02:11:08 +0000 (19:11 -0700)]
LU-16089 kernel: kernel update RHEL 7.9 [3.10.0-1160.76.1.el7]
Update RHEL 7.9 kernel to 3.10.0-1160.76.1.el7.
Lustre-change: https://review.whamcloud.com/48202
Lustre-commit:
94955bbc6dc82b43fd77150b82834132bc56f565
Test-Parameters: trivial clientdistro=el7.9 serverdistro=el7.9
Change-Id: I97d087a5d5bb27996a5c0caf382c011928c651b4
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48277
Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Etienne AUJAMES [Wed, 14 Sep 2022 20:17:24 +0000 (13:17 -0700)]
LU-16000 utils: align updatelog parameters in llog_reader
Parameters in update log records are aligned on 64bits. llog_reader
do not aligned these parameters: if a parameters size is not mutiple
of 8, the next parameter size will be read incorrectly.
Lustre-change: https://review.whamcloud.com/47913
Lustre-commit:
6d74b759634355e7f6647ccaefef519a1ff208e2
Test-Parameters: trivial
Fixes: 9962d6f ("LU-14617 utils: llog_reader updatelog support")
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Signed-off-by: Etienne AUJAMES <etienne.aujames@cea.fr>
Change-Id: I6871614ab4ea79d59c3c3b4644b377de395bad56
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48551
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Alexander Boyko [Wed, 14 Sep 2022 20:13:58 +0000 (13:13 -0700)]
LU-15724 tests: MDT failover hang reproducer
The patch adds recovery-small 144a test to reproduce
MDT failover hang when precreate threads are blocked on objects.
LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID
namespace with 46 resources in use, (rc=-110)
Lustre-change: https://review.whamcloud.com/47006
Lustre-commit:
aa6250b7412e7baf6760fe4010a81f4f22187127
Test-Parameters: trivial testlist=recovery-small env=ONLY=144a
HPE-bug-id: LUS-10750
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: I2743a1b5c8911d6982b527f7e7b7bbbaf310cd04
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/48550
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Alexander Boyko [Wed, 14 Sep 2022 19:56:07 +0000 (12:56 -0700)]
LU-15724 osp: wakeup all precreate threads
Number of threads could sleep at osp_precreate_reserve() and
wait objects from OST. When MDT stops Lustre should wakeup
all threads. When opd_pre_recovering is set any wakeup of
opd_pre_user_waitq is useless. Failover of MDT does not produce
disconnect event, only inactive, so osp_precreate_cleanup_orphans()
can not be awakened.
LustreError: 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID
namespace with 46 resources in use, (rc=-110)
schedule_timeout at
ffffffff8e551cd3
osp_precreate_reserve at
ffffffffc17d2d83 [osp]
osp_declare_create at
ffffffffc17c7eb9 [osp]
lod_sub_declare_create at
ffffffffc156415b [lod]
lod_qos_declare_object_on at
ffffffffc155bf42 [lod]
lod_ost_alloc_rr.constprop.23 at
ffffffffc155db2f [lod]
lod_qos_prep_create at
ffffffffc15630a6 [lod]
lod_declare_instantiate_components at
ffffffffc154b237 [lod]
Lustre-change: https://review.whamcloud.com/47005
Lustre-commit:
e55fc043679cdfadfff6874ef78e2e0128ec37ac
HPE-bug-id: LUS-10750
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Change-Id: If0164cfbecb1e358d9857421cb234559dc8cecbc
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-by: Sergey Cheremencev <sergey.cheremencev@hpe.com>
Reviewed-on: https://review.whamcloud.com/48546
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Andrew Perepechko [Wed, 14 Sep 2022 19:50:51 +0000 (12:50 -0700)]
LU-15555 ldiskfs: large directory causes htree corruption
When creating a lot of files in a single directory, it can
get corrupted because of a typo in ext4-kill-dx-root.patch.
Lustre-change: https://review.whamcloud.com/46526
Lustre-commit:
ea3ee9337f9bcd42360e4523f1e34bcd04d3bf41
Change-Id: Ia36278580741e1eb905e24a3a6231ba7daaa882a
Fixes: 20a6d32 ("LU-12637 kernel: RHEL 8.1 server support")
HPE-bug-id: LUS-10730
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-on: https://review.whamcloud.com/48545
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
John L. Hammond [Tue, 14 Jun 2022 13:46:45 +0000 (08:46 -0500)]
EX-5380 lipe: wait longer before restarting the access log reader
In lamigo_alr_data_collection_thread() if the access log reader exits
with status zero then it means that no OSTs are mounted on the
host. In this case we should wait longer before restarting the access
log reader.
Lustre-change: https://review.whamcloud.com/47627
Lustre-commit:
27c05f8cb39a8bf8d9e9386841fc7ecd700cf0fb
Test-Parameters: trivial testlist=hot-pools
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Change-Id: I282c6b8e251c432664bc3b4eb202351a5bd7fe5b
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48380
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Colin Faber <cfaber@ddn.com>
Artem Blagodarenko [Thu, 8 Sep 2022 03:13:07 +0000 (23:13 -0400)]
LU-14305 ldiskfs: add parameters for mb_c123_threshold
Add mount options for /sys/fs/ldiskfs/*/mb_c[123]_threshold values
so that they can be set persistently via mount options.
The /sys/fs/ldiskfs/*/mb_c[123]_threshold values are always shown
rounded down to the next lower percentage value due to integer
division, since internal values are stored as blocks for efficiency.
Round up the values shown to the next percent to match what was
used to originally set these parameters.
Lustre-change: https://review.whamcloud.com/41193
Lustre-commit:
c2fd5297b46c4973aeda4d4d02cbc7ca2faa0d50
Fixes:
95f8ae567749 ("LU-12103 ldiskfs: don't search large block range if disk full")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Artem Blagodarenko <ablagodarenko@whamcloud.com>
Change-Id: Ie36a6667f8bca7481aa8179ab5b97c85d449d619
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/41955
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48499
Sebastien Buisson [Fri, 25 Mar 2022 08:24:32 +0000 (09:24 +0100)]
LU-15003 sec: use enc pool for bounce pages
Take pages from the enc pool so that they can be used for
encryption, instead of letting llcrypt allocate a bounce page
for every call to the encryption primitives.
Pages are taken from the enc pool a whole array at a time.
This requires modifying the llcrypt API, so that new functions
llcrypt_encrypt_page() and llcrypt_decrypt_page() are exported.
These functions take a destination page parameter.
Until this change is pushed in upstream fscrypt, this performance
optimization is not available when Lustre is built and run against
the in-kernel fscrypt lib.
Using enc pool for bounce pages is a worthwhile performance win. Here
are performance penalties incurred by encryption, without this patch,
and with this patch:
||=====================|=====================||
|| Performance penalty | Performance penalty ||
|| without patch | with patch ||
||==========================================|=====================||
|| Bandwidth – write | 30%-35% | 5%-10% large IOs ||
|| | | 15% small IOs ||
||------------------------------------------|---------------------||
|| Bandwidth – read | 20% | less than 10% ||
||------------------------------------------|---------------------||
|| Metadata | N/A | 5% ||
|| creat,stat,remove | | ||
||==========================================|=====================||
Lustre-change: https://review.whamcloud.com/47149
Lustre-commit:
f3fe144b8572e9e75bb55076e29057227476ebf5
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Change-Id: I3078d0a3349b3d24acc5e61ab53ac434b5f9d0e3
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/47513
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Lai Siyao [Fri, 1 Apr 2022 19:58:08 +0000 (15:58 -0400)]
LU-14719 osp: add inode watermark
* move block watermark from debugfs to sysfs.
* add inode watermark for OSP.
Lustre-change: https://review.whamcloud.com/47128
Lustre-commit:
336eb696299e1c9731bd1443f05e5d814314ed36
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Change-Id: I7c768fa2ebfb4b8c2f75255f9e9c061d4c15cf66
Reviewed-on: https://review.whamcloud.com/47866
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jian Yu [Fri, 16 Sep 2022 06:49:21 +0000 (23:49 -0700)]
LU-16161 kernel: kernel update RHEL8.6 [4.18.0-372.26.1.el8_6]
Update RHEL8.6 kernel to 4.18.0-372.26.1.el8_6.
Lustre-change: https://review.whamcloud.com/48564
Lustre-commit: TBD (from
66b1b4469d6e5e65b450702c6cb68ec14a51e9b0)
Test-Parameters: trivial fstype=ldiskfs \
clientdistro=el8.6 serverdistro=el8.6 testlist=sanity
Test-Parameters: trivial fstype=zfs \
clientdistro=el8.6 serverdistro=el8.6 testlist=sanity
Change-Id: I45bf6dbff5061407e1109732b6d466d0f7a8376c
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48575
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Andreas Dilger [Thu, 30 Jun 2022 23:36:07 +0000 (17:36 -0600)]
EX-4359 build: add bio-integrity patch to rhel8 series
Add bio-integrity-unbound-concurrency patch to the rhel8.5 and
rhel8.6 series to ensure balanced T10-PI core usage.
Test-Parameters: trivial serverdistro=el8.5 clientdistro=el8.5 testlist=sanity,conf-sanity
Test-Parameters: trivial serverdistro=el8.6 clientdistro=el8.6 testlist=sanity,conf-sanity
Fixes:
97fba9aa48ca ("DDN-2042 bio: allow BIO integrity to run on any core")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I31f9ced4eadad105466556183e2b9e9e0419164d
Reviewed-on: https://review.whamcloud.com/47848
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Minh Diep [Thu, 8 Sep 2022 19:54:56 +0000 (12:54 -0700)]
LU-15795 lbuild: enable KABI
Enable build kabi and clean up kmodtool patch
Lustre-change: https://review.whamcloud.com/47507
Lustre-commit: TBD (from
03fc87a2ba08e5c4b8b8787f19b4e736d2752fae)
Test-Parameters: trivial fstype=ldiskfs clientdistro=el8.5 serverdistro=el8.5
Test-Parameters: trivial fstype=ldiskfs clientdistro=el8.6 serverdistro=el8.6
Change-Id: I16d54af0004c4ddc1cc5e6acca81e4aa89a1a1c1
Signed-off-by: Minh Diep <mdiep@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48486
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Bobi Jam [Wed, 13 Apr 2022 15:15:22 +0000 (23:15 +0800)]
LU-14642 flr: allow layout version update from client/MDS
Client write/punch request always carries its layout version so
that OFD can reject the request if the carried layout version
is a stale one.
This patch allows MDS as well as client to update new layout version
to OST objects. And during resync write, all OST objects will get
layout version updated.
Lustre-change: https://review.whamcloud.com/45443
Lustre-commit:
fa6574150b6f745a668fe69b2d6d970068
Fixes:
7d97777a5d ("LU-14642 flr: abolish MDS transfer layout version to OST")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Change-Id: I9f27af354875d48adda3361f6c8ea5a5f6def73b
Reviewed-on: https://review.whamcloud.com/47097
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Jadhav Vikram [Tue, 25 Jul 2017 07:01:37 +0000 (12:31 +0530)]
LU-9699 osp: don't assert on OSP duplicating
Writeconf on an MDT with index > 0000 will cause
"add mdc" to be added to $FSNAME-client config
and "add osp" to be added to $FSNAME-MDTXXXX configs.
However, the configs may already contain these
directives. Duplicating the OSP device will
cause the assertion failure in osp_obd_connect():
ASSERTION( osp->opd_connects == 1 ) failed
Duplicating the MDC just returns -EEXIST in similar
situation.
A possible solution is to check configs for duplicates
before writing to them. However, sometimes we
would like to change nids which are part of
"add mdc" and "add osp".
Another solution is to mark previous entries with
SKIP flags. This patch implements this approach.
Since after revoking the config lock, the clients
and the MDTs will receive the updated log and
apply its newer entries, we still have to handle
OSP duplication, but this is only an issue
immediately after writeconf processing.
Lustre-change: https://review.whamcloud.com/27753
Lustre-commit:
98f107b53e4daa3bfaf026c379c0a9c41cb5f161
Seagate-bug-id: MRP-2634, MRP-3865
Change-Id: Idd7ad43c78d50e6bbe715850503aa0b01fcbf071
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48515
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Alexey Lyashkov [Fri, 16 Sep 2022 20:41:42 +0000 (13:41 -0700)]
LU-15262 osd: bio_integrity_prep_fn return value processing
There is osd_bio_integrity_handle() fn in lustre/osd-ldiskfs/osd_io.c
It checks the returned code of bio_integrity_prep_fn() but between
mainstream Linux 4.12 and 4.13 kernel integrity API has changed and
in 4.13+ (as well as for any RHEL8 including first beta)
bio_integrity_prep() returns boolean true on success.
Lustre-change: https://review.whamcloud.com/45646
Lustre-commit:
41c813d14ec9b353f9cf5ac82638996dcb5273d7
HPe-bug-id: LUS-10443
Signed-off-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Change-Id: I973aa8ccae024157ad863d26afc7b1264a5c7149
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Reviewed-on: https://review.whamcloud.com/48582
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Andreas Dilger [Fri, 9 Sep 2022 01:46:21 +0000 (19:46 -0600)]
RM-620 build: New tag 2.14.0-ddn60
New tag 2.14.0-ddn60
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: Ib500a2a5f4677f496380750ff0ca3eee7eff1b57
Chris Horn [Thu, 12 May 2022 18:16:10 +0000 (13:16 -0500)]
LU-15860 socklnd: Duplicate ksock_conn_cb
If two threads enter ksocknal_add_peer(), the first one to acquire
the ksnd_global_lock will create a ksock_peer_ni and associate a
ksock_conn_cb with it.
When the second thread acquires the ksnd_global_lock it will find the
existing ksock_peer_ni, but it does not check for an existing
ksock_conn_cb. As a result, it overwrites the existing ksock_conn_cb
(ksock_peer_ni::ksnp_conn_cb) and the ksock_conn_cb from the first
thread becomes stranded.
Modify ksocknal_add_peer() to check whether the peer_ni has an
existing ksock_conn_cb associated with it
Lustre-change: https://review.whamcloud.com/47361
Lustre-commit:
0c91d49a44e1214b5c65d4a557f6969b3d217881
Fixes:
7766f01e89 ("LU-13641 socklnd: replace route construct")
HPE-bug-id: LUS-10956
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I6c0190a0c1d3321ddd85c763b86ad1f0d32cf2b9
Reviewed-on: https://review.whamcloud.com/48259
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Mon, 29 Nov 2021 17:38:48 +0000 (11:38 -0600)]
LU-15234 lnet: Race on discovery queue
If the discovery thread clears the LNET_PEER_DISCOVERING bit then a
race window opens when the discovery thread drops the
lnet_peer.lp_lock spinlock and closes when the discovery thread
acquires the lnet_net_lock. If another thread queues the peer for
discovery during this window then the LNET_PEER_DISCOVERING bit is
added back to the peer state, but since the peer is already on the
lnet.ln_dc_working queue, it does not get added to the
lnet.ln_dc_request queue.
When the discovery thread acquires the lnet_net_lock/EX, it sees that
the LNET_PEER_DISCOVERING bit has not been cleared, so it does not
call lnet_peer_discovery_complete() which is responsible for sending
messages on the peer's discovery pending queue.
At this point, the peer is stuck on the lnet.ln_dc_working queue, and
messages may continue to accumulate on the peer's
lnet_peer.lp_dc_pendq.
Fix the issue by re-working the main discovery thread loop so that we
do not release the lnet_peer.lp_lock until after we've determined
whether we need to call lnet_peer_discovery_complete().
This ensures that the lnet_peer is correctly removed from the
discovery work queue and any messages on the peer's
lnet_peer.lp_dc_pendq are sent or finalized.
It is also possible for the lnet_peer.lp_dc_error to be cleared
during the aforementioned window, as well as during the time when
lnet_peer_discovery_complete() is processing the contents of the
lnet_peer.lp_dc_pendq. This could prevent messages on the
lnet_peer.lp_dc_pendq from being correctly finalized. To fix this
issue, the responsibilities of lnet_peer_discovery_error() were
incorporated into lnet_peer_discovery_complete().
Lustre-change: https://review.whamcloud.com/45670
Lustre-commit:
852a4b264a984979dcef1fbd4685cab1350010ca
Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-10615
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I3779a342de7108105c2fd2bc41373560e8e5ef14
Reviewed-on: https://review.whamcloud.com/48313
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Thu, 12 Aug 2021 21:16:05 +0000 (16:16 -0500)]
LU-14941 lnet: Fix source specified to routed destination
If a source NI is specified for a send then we should not modify the
destination NID that was passed to lnet_send().
Lustre-change: https://review.whamcloud.com/44730
Lustre-commit:
98da4ace43a6c4c59e7981bf0fb649005237d12f
Test-Parameters: trivial testlist=sanity-lnet
HPE-bug-id: LUS-10301
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ie47558d5bce97a0dca30ff7d072dcd39eb903324
Reviewed-on: https://review.whamcloud.com/48441
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Thu, 12 Aug 2021 21:08:44 +0000 (16:08 -0500)]
LU-14940 lnet: Fix source specified send to different net
The destination NI is fixed for all source-specified sends. Thus, in
order for a source-specified send to be considered "local", i.e. a
send that does not require a route, the destination NID must be on
the same net as the specified source.
Lustre-change: https://review.whamcloud.com/44728
Lustre-commit:
3e3563f719ce89de28d276f3de1add064932506b
HPE-bug-id: LUS-10303
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I4847db1d393bbc36def65123f260b928ebbf944e
Reviewed-on: https://review.whamcloud.com/48440
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Fri, 29 Jan 2021 14:08:08 +0000 (17:08 +0300)]
LU-14660 lnet: Fix destination NID for discovery PUSH
If we're sending a discovery PUSH after receiving a discovery
REPLY then we want to send via the same NID that the reply was
sent to. This introduces a challenge in selecting an appropriate
destination NID for the PUSH because lnet_select_pathway() will not
run the MR selection algorithm for choosing a peer NI if the source
NI has been specified.
It is reasonable to assume that the NID used by the message
originator in sending the REPLY is a suitable destination for the
discovery PUSH. Thus, we record this NID in the same location we
currently record the lp_disc_src_nid, and use it when sending the
PUSH. With this change, the only other user of lnet_peer_select_nid()
is lnet_peer_send_ping(). In the ping case we do not set a source NID,
so lnet_select_pathway() is free to choose any peer NI. So this change
allows us to get rid of lnet_peer_select_nid() altogether.
Alternatively, we would need to reproduce a lot of the path selection
algorithm inside lnet_peer_select_nid() in order to avoid sending to
unhealthy NIDs. It seems undesirable and unnecessary to duplicate that
logic.
Lustre-change: https://review.whamcloud.com/43507
Lustre-commit:
dce2f7d1987711dfdced903b13e67091cffe9628
Test-Parameters: trivial
HPE-bug-id: LUS-9333
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I47ef856075f049d71c395565974204b8f6fa9003
Reviewed-on: https://review.whamcloud.com/48439
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Artem Blagodarenko [Tue, 25 Aug 2020 10:01:11 +0000 (06:01 -0400)]
LU-13950 lnet: do not crash if lnet_sock_getaddr returns error
Some issues with network lead to panic in ksocknal_accept
rc = lnet_sock_getaddr(sock, true, &peer_ip, &peer_port);
LASSERT(rc == 0); /* we succeeded before */
Let's pass this error to the caller.
Lustre-change: https://review.whamcloud.com/39834
Lustre-commit:
48a9ea82eb30bbbf66cce527c1205d13fbd4eb58
Test-Parameters: trivial testlist=sanity-lnet
Change-Id: I34d43c19b4e75422db50e7abb02cac3510882b0d
hpe-bug-id: LUS-9256
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@hpe.com>
Reviewed-on: https://review.whamcloud.com/48443
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Wed, 9 Dec 2020 20:38:57 +0000 (14:38 -0600)]
LU-14206 lnet: Router ping timeout with discovery disabled
Discovery pings are used to determine the health of gateways and
associated routes. Ping replies from gateways with dynamic discovery
(DD) disabled (or if DD is disabled locally) are handled in
a special routine, lnet_router_discovery_ping_reply(), but this
function and related code doesn't handle the case where a discovery
ping hits the response tracker timeout and is unlinked by the
monitor thread. In this case, an UNLINK event is generated and we
do not call the lnet_router_discovery_ping_reply(). For gateways
with DD enabled (and DD enabled locally), we handle this case
in lnet_router_discovery_complete(). If discovery failed then
lp_dc_error is set and we mark all routes down for the gateway. We
can simply extend this logic to the case of gateways w/DD disabled
(or DD disabled locally).
Lustre-change: https://review.whamcloud.com/40923
Lustre-commit:
173d86c6e9a704a84de36ae57a337a3fdae7b1ed
Test-Parameters: trivial
Fixes:
9f337d94e7 ("LU-13029 lnet: fix asym routing with multi-hop")
HPE-bug-id: LUS-9612
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I009c69d4f8990b72d83d9426c782c0e55c1023a4
Reviewed-on: https://review.whamcloud.com/48382
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Tue, 30 Nov 2021 16:57:34 +0000 (10:57 -0600)]
LU-15275 lnet: Skip router discovery on send path
When the router checker is enabled, routes are regularly marked as out
of date w.r.t. discovery. This can cause upper level messages to be
delayed while the router undergoes discovery. We can avoid delaying
messages by relying on the router checker to initiate discovery of
routers. If we happen to send a message to a router before it has
been discovered then the worst case scenario is that the route is
actually down or we end up utilizing a subset of a multi-rail router's
interfaces. Both situations can be remedied by utilizing the
check_routers_before_use parameter.
Change the logic in lnet_handle_find_routed_path() so that we only
initiate discovery if the alive_router_check_interval is <= 0 (i.e.
router checker pings are disabled).
Lustre-change: https://review.whamcloud.com/45684
Lustre-commit:
c8e74c395d5634dbb0d9d8a86605bb36ab2b8233
Test-Parameters: trivial testlist=sanity-lnet
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: If0332c21f6157117598b7b908fe17f2d2690fc1d
Reviewed-on: https://review.whamcloud.com/48383
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Sun, 12 Jul 2020 15:47:55 +0000 (10:47 -0500)]
LU-13781 lnet: Local NI must be on same net as next-hop
When sending to a remote peer we need to restrict our selection of a
local NI to those on the same peer net as the next-hop.
The code currently selects a local NI on the peer net specified by the
lr_lnet field of the lnet_route returned by lnet_find_route_locked().
However, lnet_find_route_locked() may select a next-hop peer NI on any
local peer net - not just lr_lnet.
A redundant assignment to sd->sd_msg->msg_src_nid_param is also
removed. That variable is always set appropriately in
lnet_select_pathway().
Lustre-change: https://review.whamcloud.com/39352
Lustre-commit:
031c087f3847777c0099cbfae13f0b6fee54452b
Test-Parameters: trivial
HPE-bug-id: LUS-9095
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: If1bec26d6646b9e66b99656d7db2dc538d631a34
Reviewed-on: https://review.whamcloud.com/48381
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Mon, 14 Feb 2022 20:37:05 +0000 (20:37 +0000)]
LU-13714 lnet: only update gateway NI status on discovery
Move the NI status from DOWN to UP only when receiving
a discovery PING. The discovery PING should be the only
message which should update the NI status since it's used
as the gateway NI keep alive mechanism.
This is done to avoid the following scenario:
The gateway itself can push its updates to the peers which
have removed it from its routing table. The peers would
respond to the PUSH with an ACK, the ACK will bring the
gateway's NI status to up. Therefore other peers which have
avoid_asym_router_failure=1 will have their route status
remain up even though the symmetrical route is gone.
Note: there is no way for the gateway to differentiate between
a keep alive discovery and a manually triggered discovery or ping.
However, this a narrow case which will not be handled.
net_last_alive converted to use ktime_get_seconds() instead of
ktime_get_real_seconds() since the NTP adjustment is not needed.
Lustre-change: https://review.whamcloud.com/39176
Lustre-commit:
3e3f70eb1ec95f32d9a97795d7fdf02cca82b5a0
Test-Parameters: trivial
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ifd5b06d4cf783b68b36413ada63f0a1d0095fb5b
Reviewed-on: https://review.whamcloud.com/48379
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Wed, 5 Aug 2020 16:19:35 +0000 (11:19 -0500)]
LU-15039 lnet: Fix reference leak in lnet_parse
We need to drop the reference taken by lnet_nid2peerni_locked() if we
determine that we need to drop the message because of asymmetric
route.
Lustre-change: https://review.whamcloud.com/45067
Lustre-commit:
e69eca08bce47bf85b3c011598e360a2468019b5
Test-Parameters: trivial
HPE-bug-id: LUS-9186
Fixes:
955080c3ae ("LU-13779 lnet: Correct asymmetric route detection")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I799c9522b1ce5f4caffc5848a829995e5b5484e7
Reviewed-on: https://review.whamcloud.com/48378
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Alexandre Ioffe <aioffe@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Serguei Smirnov [Mon, 16 Aug 2021 23:37:30 +0000 (16:37 -0700)]
LU-14945 lnet: don't use hops to determine the route state
NodeA <-tcp1-> GW1 <-tcp2-> GW2 <-tcp3-> NodeB
Assuming GW1 knows how to reach tcp3 network and GW2 knows
how to reach tcp1 network, it should be possible to add routes
without specifying hop=2 on nodes A and B to reach tcp3 and tcp1
respectively and then be able to lnetctl ping between them.
Changes introduced by LU-13785 interpret default hops to be
equivalent to hop=1 set explicitly for the purpose of determining
route aliveness, which results in the routes created as described
above to be considered "down".
Fix it so that default hop setting doesn't prevent
the multi-hop scenario from working.
Lustre-change: https://review.whamcloud.com/44674
Lustre-commit:
3f2844dc9333c86452c37bd7b4519729b1351371
Test-Parameters: trivial
Fixes:
2e07619477 ("LU-13785 lnet: Use lr_hops for avoid_asym_router_failure")
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Change-Id: I341ccdfe156434b0cb306359acc91a9193b44f7b
Reviewed-on: https://review.whamcloud.com/48337
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Fri, 10 Jul 2020 18:52:01 +0000 (13:52 -0500)]
LU-13780 lnet: Leverage peer aliveness more efficiently
When an LNet router is revived after going down, remote peers may
discover it is alive before we do. Thus, remote peers may use it
as a next-hop, and we may start receiving messages from it while we
still consider it to be dead. We should mark router peers as alive
when we receive a message from them.
If an LNet router does not respond to a discovery ping, then we
currently mark all of its NIs as DOWN. This can actually slow down
the process of returning a route to service. If we receive a message
from a router, in the manner described above, then we can safely
return the router to service. We already set the status of the router
NI we received the message from to UP, but the remote NIs will still
be DOWN and thus the route will be considered down until we get a
reply to the next discovery ping.
When selecting a route, we only consider the aliveness of a gateway's
remote NIs if avoid_asym_router_failure is enabled and the route is
single-hop. In this case, as long as the gateway has at least one
alive NI on the remote network then the route is considered UP. In
the situation described above, we know the router has at least one
NI alive because it was used to forward a message from a remote peer.
Thus, when we receive a forwarded message from a router, we can
reasonably set the NI status of all of its NIs that are on the same
peer net as the message originator to UP. This does not impact the
route status of any multi-hop routes because we do not consider the
aliveness of remote NIs for multi-hop routes.
Similarly, we can set the cached lr_alive value to up for any routes
whose lr_net matches the net ID of the message originator NID. This
variable is converted to an atomic_t to get rid of the need for
global locking when updating it.
Lustre-change: https://review.whamcloud.com/39350
Lustre-commit:
886e34ce56c491e8844cf892f32b08807cdf2bff:
Test-Parameters: trivial
HPE-bug-id: LUS-9088
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I0170762d78d80e4b70724799cd1ee1301118f25c
Reviewed-on: https://review.whamcloud.com/48335
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Tested-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Tue, 14 Jul 2020 04:08:28 +0000 (23:08 -0500)]
LU-13785 lnet: Use lr_hops for avoid_asym_router_failure
In order for the asymmetric route failure avoidance feature to work
properly it needs to know what the hop count of a route should be.
This information is defined by the lr_hops field of the lnet_route.
The lr_single_hop is what discovery was able to determine the hop
count actually is (single or multi) based on the last ping reply.
If a remote interface on a router goes missing, the route may be
classified as multi-hop by discovery, but it should be considered
single-hop for the purposes of avoiding asymmetric route failure.
Lustre-change: https://review.whamcloud.com/39362
Lustre-commit:
2e07619477684f287a2399ccdbbde0a71289574b
HPE-bug-id: LUS-9099
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I9c255f9a2175d964661850277808dae96ff7735c
Reviewed-on: https://review.whamcloud.com/48336
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Fri, 10 Jul 2020 17:33:50 +0000 (12:33 -0500)]
LU-13779 lnet: Correct asymmetric route detection
Failure to lookup the remote net for LNET_NIDNET(src_nid) indicates an
asymmetric route, but we do not drop the message in this case. Another
problem with this code is that there is no guarantee that we'll have a
route->lr_lnet that matches the net of ni->ni_nid.
We can move the asymmetric route detection to after we have looked up
the lpni of from_nid. Then, we can look at just the routes associated
with the gateway that owns the lpni. If one of those routes has
lr_net == LNET_NIDNET(src_nid), then the route is symmetrical.
Lustre-change: https://review.whamcloud.com/39349
Lustre-commit:
955080c3ae3f33c98e068f52a096761ea28624b7
Fixes:
4932febc12 ("LU-11894 lnet: check for asymmetrical route messages")
Test-Parameters: trivial
HPE-bug-id: LUS-9087
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I8044d3f53e6f000c1e4d7c4e34b3b21afe0f9711
Reviewed-on: https://review.whamcloud.com/48334
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Chris Horn [Tue, 23 Jun 2020 18:02:51 +0000 (13:02 -0500)]
LU-13708 lnet: lnet_notify sets route aliveness incorrectly
lnet_notify() modifies route aliveness in two ways:
1. By setting lp_alive field of the lnet_peer struct.
2. By setting lr_alive field of the lnet_route struct (via call to
lnet_set_route_aliveness())
In both cases, the aliveness value assigned is determined by a call
to lnet_is_peer_ni_alive(), but that value only reflects the aliveness
of a particular peer NI. A gateway may have multiple peer NIs, so the
aliveness of a gateway peer (lp_alive) is not necessarily equivalent
to the aliveness of one of its NIs. Furthermore, the lr_alive field
is only used to determine route aliveness for path selection if
discovery is disabled locally or on the gateway (see
lnet_find_route_locked() and lnet_is_route_alive()).
In general, we should not set lp_alive based on an lnet_notify()
call, and we should only set lr_alive if discovery is disabled. For
lr_alive specifically, we should only set it for those routes that
have the peer NI as a next-hop.
An exception to the above exists when the reset argument to
lnet_notify() is set. The gnilnd uses this flag in its calls to
lnet_notify() because gnilnd receives out-of-band notifications of
node up and down events. Thus, when gnilnd calls lnet_notify() we
actually know whether the gateway peer is up or down and we can set
lp_alive appropriately.
net lock/EX is held by other callers of lnet_set_route_aliveness, so
we do the same in lnet_notify().
Lustre-change: https://review.whamcloud.com/39160
Lustre-commit:
e24471a722a6f23fb0051b4511f3fee2662d0e4e
Fixes:
e35be987da ("LU-12422 lnet: discovery off route state update")
Fixes:
ebc9835a97 ("LU-12941 lnet: Add peer level aliveness information")
Test-Parameters: trivial
HPE-bug-id: LUS-9034
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I2927e5f5ef849e45c233c92d2a6deca765e496eb
Reviewed-on: https://review.whamcloud.com/48290
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Sebastien Buisson [Fri, 2 Sep 2022 18:09:59 +0000 (11:09 -0700)]
LU-16012 sec: fix detection of SELinux enforcement
On newer distros (e.g. RHEL 9.0), on which selinux_is_enabled() does
not exist anymore, the only way to find out if SELinux is enforced
when initializing the security context is to fetch the length of the
security attribute name. If it is 0, we conclude SELinux is disabled.
Lustre-change: https://review.whamcloud.com/48049
Lustre-commit:
155cbc22ba4f758cf9eec415f36f940ca2b23de9
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Change-Id: Ifcdcb8ffbb7f9ad50d16d7d3317e94d0d212fa42
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/48422
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
Lei Feng [Wed, 7 Sep 2022 07:26:13 +0000 (15:26 +0800)]
EX-5815 lipe: do not print in lpcc signal handler
Do not print in lpcc signal handler.
It's invalid in python script.
Signed-off-by: Lei Feng <flei@whamcloud.com>
Test-Parameters: trivial testlist=sanity-pcc env=ONLY=210
Change-Id: I61eb80ff1d59453dc12855fd2f1ac4f1e6e40757
Reviewed-on: https://review.whamcloud.com/48449
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
James Nunez [Wed, 24 Aug 2022 07:40:23 +0000 (00:40 -0700)]
EX-3442 tests: use wait_file_resync in hot-pools test 15
This patch replaces "$LFS mirror resync" with
"wait_file_resync" in hot-pools test 15 to avoid
racing with lamigo's "$LFS mirror resync".
Test-Parameters: trivial testlist=hot-pools,hot-pools
Change-Id: I48ffb7d6a33b664359f227d1f693369feffa70b6
Signed-off-by: James Nunez <jnunez@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/47233
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>