LU-15033 llite: Increase default RA sizes The default max_readahead_mb is 1/32 of all cached pages, which doesn't make much sense but isn't usually a problem since most real nodes have very high RAM or it is tuned to larger values. It is reduced further for the per-file limit, which is also reasonable. However, on test VMs with smaller RAM sizes, this results in hilariously tiny max_read_ahead_per_file_mb values, like 20. This is small enough it causes extra misses because two RPCs cannot be reliably sent. This edge case isn't important for performance, but it makes small scale testing of readahead nearly impossible. To avoid this, we add a minimum readahead requirement of 256 MiB, which is used unless it's > half of RAM. This should avoid this case on test VMs without changing the behavior for real clients unless they are extremely small. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ie8aab6b04ad520e4633d634d846e7ef23cc91ced Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/46475 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15367 llite: add setattr to iotrace Add setattr messages to iotrace. Test-Parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I10a51285d38e1684ce0ddcc7bb2a0cd90579c96c Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/52005 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
LU-13802 llite: add hybrid IO SBI flag Add an SBI flag so hybrid IO can be fully disabled. Test-Parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I2825b4cf261f98d71a18cd66d6fe3632dfabc37a Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/52592 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
LU-16314 llite: Migrate LASSERTF %p to %px This change covers lustre/ec through lustre/mgs and converts LASSERTF statements to explicitly use %px. Use %px to explicitly report the non-hashed pointer value messages printed when a kernel panic is imminent. When analyzing a crash dump the associated kernel address can be used to determine the system state that lead to the system crash. As crash dumps can and are provided by customers from production systems the use of the kernel command line parameter: no_hash_pointers is not always possible. Ref: Documentation/core-api/printk-formats.rst Test-Parameters: trivial HPE-bug-id: LUS-10945 Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com> Change-Id: I708d9ef60c63f5b4006c7986599a2f39fc9e5fdf Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51213 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16637 llite: tolerate fresh page cache pages after truncate Truncate called by ll_layout_refesh() can race with a fast read or tiny write, which can add an uninitialized non-uptodate page into the page cache. We want to avoid expensive locking for this rare case so if there is any leftover in the cache after truncate, just check that the pages are not uptodate, not dirty and do not have any filesystem-specific information attached to them. Change-Id: I8cadc022a3d1822a585f32e1a765e59ad0ff434d Signed-off-by: Andrew Perepechko <andrew.perepechko@hpe.com> HPE-bug-id: LUS-11937 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53554 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Zhenyu Xu <bobijam@hotmail.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16823 lustre: test if large nid is support Update all LNetGetId() calls to use large NIDs if the connect flags report large NID support. For the case of lmv_setup() we update setting qos_rr_index, to avoid the thundering herd, using nidhash(). Change-Id: I80fda9454f154e27fbc75abb1899c0ccca03097b Signed-off-by: James Simmons <jsimmons@infradead.org> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53398 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16771 llite: add statfs_project tunable Add a client tunable and mount option to turn off project-enabled statfs() if needed, for example to speed up statfs() execution by avoiding project quota check. This new llite tunable statfs_project is set to 1 by default (feature enabled). To turn statfs_project off: lctl set_param llite.*.statfs_project=0 Additionally, statfs_project can be disabled at mount time with: mount -t lustre -o nostatfs_project ... Signed-off-by: Stephane Thiell <sthiell@stanford.edu> Change-Id: I1c3eb27e66b1d05a1c713732dfe0a4d8f7af769f Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/52872 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com> Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14361 statahead: regularized fname statahead pattern Some applications do stat() calls under a directory within which all children files have regularized file name: - mdtest benchmark tool: mdtest.$rank.$i - ML/AI with ingested data that have typically a format rule of the filename in the directory. The most common format for regularized file name is that the suffix part of the file name is number-indexing. However, in the current statahead mechanism, the statahead is populated by the order of the hash of the file name via readdir() calls, not a kind of sorting order. In this patch, we improve the statahead to prefetch attributes for the files with regularized indexing file name via asynchronous batching RPC. This patch adds the support to do statahead for these kinds of applications, which can be optimized, but without opendir()/ close() to start/stop statahead thread explicitly. Instead, the statahead thread will stop and quit when found that there was no acitivy for more than a certain time period (i.e. 30 seconds). Test-Parameters: mdtcount=4 mdscount=2 testlist=sanity env=ONLY=27p,ONLY_REPEAT=5 Test-Parameters: mdtcount=4 mdscount=2 testlist=sanity env=ONLY=27p,ONLY_REPEAT=5 Test-Parameters: mdtcount=4 mdscount=2 testlist=sanity env=ONLY=123f,ONLY_REPEAT=10 Test-Parameters: mdtcount=4 mdscount=2 testlist=sanity env=ONLY=123f,ONLY_REPEAT=10 Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: Ide11ec5a651ae74884ddbe1cecede4f5c961e38d Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/41308 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14651 build: fix build for el7.9 kernels Handle extra setattr_prepare() argument added in Linux 5.12 kernels when building on older kernels. HPE-bug-id: LUS-12059 Signed-off-by: Andrew Perepechko <andrew.perepechko@hpe.com> Change-Id: Ie7fd1c4d51b7a9b086cfca0db941321cbcce7057 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53503 Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: Sebastien Buisson <sbuisson@ddn.com> Tested-by: James Simmons <jsimmons@infradead.org>
LU-16695 llite: switch to ki_flags from f_flags There are possible races between IO checking f_flags and fcntl changing f_flags. The kernel fixed most of these by copying most of the file flags in to the iocb. Let's follow on and use those copied flags. This also lets us change them if we want, since they're now local to the specific IO. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Signed-off-by: Guillaume Courrier <guillaume.courrier@cea.fr> Change-Id: Ib98cccec0e7888865ec10dc5f76f1d9917a1aef7 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50493 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com> Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
LU-13805 llite: Implement unaligned DIO connect flag Unupgraded ZFS servers may crash if they received unaligned DIO, so we need a compat flag and a test to recognize those servers. This patch implements that logic. Fixes: 7194eb6431 ("LU-13805 clio: bounce buffer for unaligned DIO") Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I5d6ee3fa5dca989c671417f35a981767ee55d6e2 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51126 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
LU-13805 llite: add flag to disable unaligned DIO As with any new IO feature, it's a good idea to have the option to turn off unaligned DIO support if needed. It would be reasonable to merge this patch with the core patch implementing the feature; I have kept it separate for ease of review. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ibc86d84704151a7f30afcc538d9c03e3fdf1c38a Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51125 Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-8802 obd: remove MAX_OBD_DEVICES Remove this arbitrary limit by reimplementing the array as an Xarray. Xarray can grow and shink dynamically, hence saving memory and allow for many more OBD devices. There is still technically a limit OBD_MAX_INDEX, which is xa_limit_31b.max or around 2 billion. This is far more than is practically useful. This patch also adds various iterators for OBD devices, which are used to simplify code in various places. Removing class_obd_list() since it is unused. Rename class_dev_by_str() to class_str2obd() to keep the pattern. Several class_* functions have been refactored to improve locking. The larger issue of OBD device locking will be addressed separately. Update the OBD device lifecycle test to try loading more devices (about 24,000 for now). Currently, adding an additional OBD device is an O(n^2) operation due to the class_name2dev calls in class_register_device(). This will be addressed in a future patch adding a hash table for OBD device name lookups. Further, OBD life cycle management could likely be simplified by using Xarray marks. Right now, it is handled by a bit field in the obd_device struct. Since the scope of the changes needed to simplify this seem large, this will also be addressed separately. Test-Parameters: testlist=sanity env=ONLY=55,ONLY_REPEAT=10 Signed-off-by: Timothy Day <timday@amazon.com> Change-Id: Icb2cd94a5529e79f5d3ebd0de5e0f225cf212075 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51040 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16954 llite: add SB_I_CGROUPWB on super block for cgroup Cgroup support can be enabled per super_block by setting SB_I_CGROUPWB in ->s_iflags. Cgroup writeback requires support from both the bdi and filesystem. This patch adds SB_I_CGROUPWB flag on super block for Lustre. This is required by the subsequent patch series to support cgroup in Lustre. Adding this flags for Lustre super block will cause the remount failure on Maloo testing on Unbutu 2204 v5.15 kernel due to the duplicate filename (sysfs) for bdi device. To avoid remount failure, we explicitly unregister the sysfs for the @bdi. Test-Parameters: clientdistro=ubuntu2204 testlist=sanity-sec Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I7fff4f26aa1bfdb0e5de0c4bdbff44ed74d18c2d Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51955 Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Tested-by: Maloo <maloo@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com>
LU-16713 llite: writeback/commit pages under memory pressure Lustre buffered I/O does not work well with restrictive memcg control. This may result in OOM when the system is under memroy pressure. Lustre has implemented unstable pages support similar to NFS. But it is disabled by default due to the performance reason. In Lustre, a client pins the cache pages for writes until the write transcation is committed on the server (OST) even these pinned pages have been finished writeback. The server starts a transaction commit either because the commit interval (5 second, by default) for the backend storage (i.e. OST/ldiskfs) has been reached or there is not enough room in the journal for a particular handle to start. Before the write transcation has been committed and notify the client, these pages are pinned and not flushable in any way by the kernel. This means that when a client hits memory pressure there can be a large number of unfreeable (pinned and uncommitted) pages, so the application on the client will end up OOM killed because when asked to free up memory it can not. This is particularly common with cgroups. Because when cgroups are in use, the memory limit is generally much lower than the total system memory limits and it is more likely to reach the limits. Linux kernel has matured memory reclaim mechanism to avoid OOM even with cgroups. After perform dirtied write for a page, the kernel calls @balance_dirty_pages(). If the dirtied and uncommitted pages are over background threshold for the global memory limits or memory cgroup limits, the writeback threads are woken to perform some writeout. When allocate a new page for I/O under memory pressure, the kernel will try direct reclaim and then allocating. For cgroup, it will try to reclaim pages from the memory cgroup over soft limit. The slow page allocation path with direct reclaim will call @wakeup_flusher_threads() with WB_REASON_VMSCAN to start writeback dirty pages. Our solution uses the page reclaim mechanism in the kernel directly. In the completion of page writeback (in @brw_interpret), call @__mark_inode_dirty() to add this dirty inode which has pinned uncommitted pages into the @bdi_writeback where each memory cgroup has itw own @bdi_writeback to contorl the writeback for buffered writes within it. Thus under memory pressure, the writeback threads will be woken up, and it will call @ll_writepages() to write out data. For background writeout (over background dirty threshold) or writeback with WB_REASON_VMSCAN for direct reclaim, we first flush dirtied pages to OSTs and then sync them to OSTs and force to commit these pages to release them quickly. When a cgroup is under memory pressure, the kernel asks to do writeback and then it does a fsync to OSTs. This will commit uncommitted/unstable pages, and then the kernel can free them finally. In the following, we will give out some performance results. The client has 512G memory in total. 1. dd if=/dev/zero of=$test bs=1M count=$size I/O size 128G 256G 512G 1024G unpatch (GB/s) 2.2 2.2 2.1 2.0 patched (GB/s) 2.2 2.2 2.1 2.0 There is no preformance regession after enable unstable page account with the patch. 2. One process under different memcg limits and total I/O size varies from 2X memlimit to 0.5 memlimit: dd if=/dev/zero of=$file bs=1M count=$((memlimit_mb * time)) memcg limits 1G 4G 16G 64G 2X memlimit (GB/s) 1.7 1.6 1.8 1.7 1X memlimit (GB/s) 1.9 1.9 2.2 2.2 .5X memlimit(GB/s) 2.3 2.3 2.2 2.3 Without this patch, dd with I/O size > memcg limit will be OOM-killed. 3. Multiple cgroups Testing: 8 cgroups in total each with memory limit of 8G. Run dd write on each cgrop with I/O size of 2X memory limit (16G). 17179869184 bytes (17 GB, 16 GiB) copied, 12.7842 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.7889 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9504 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9577 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.4066 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5397 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5769 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.6605 s, 1.3 GB/s 4. Two dd writers one (A) is under memcg control and another (B) is not. The total write data is 128G. Memcg limits varies from 1G to 128G. cmd: ./t2p.sh $memlimit_mb memlimit dd writer (A) dd writer (B) 1G 1.3GB/s 2.2GB/s 4G 1.3GB/s 2.2GB/s 16G 1.4GB/s 2.2GB/s 32G 1.5GB/s 2.2GB/s 64G 1.8GB/s 2.2GB/s 128G 2.1GB/s 2.1GB/s The results demonstrates that the process with memcg limits nearly has no impact on the performance of the process without limits. Test-Parameters: clientdistro=el8.7 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10 Test-Parameters: clientdistro=el9.1 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10 Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I7b548dcc214995c9f00d57817028ec64fd917eab Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50544 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
LU-13306 mgs: use large NIDS in the nid table on the MGS On the MGS the NIDs detected are handled using the struct mgs_target_info which currently only handles lnet_nid_t. This structure also limits the number of NIDs to 32 entries. Some sites have reported that 32 NIDs wasn't enough when they configured virtual LNet networks for isolation. Update the mgs_target_info to use NID strings instead. This has the advantage of working even if struct lnet_nid expands in the future. We place this data at the end of the mgs_target_info as a flexible array. This requires updating the ptlrpc packet handling to increase the size to some new value to contain all the NIDs registered. Also this gives us the option to use hostnames in the future. This information is then feed into a struct mgs_nidtbl_entry which is sent to the mgc on all the remote nodes. With this patch only large NIDs for small address space is translated to the original lnet_nid_t format and sent to the various clients. All the server targets, which are clients of the MGS, use the large NID format. With this patch we don't have to patch old clients when the servers are using the larger NID format. Expand LNetGetId() to return large NID addresses as well. In the future we will use the ocd_connect_flags to determine if the MSG supports large NID addresses. Change-Id: I7083d6ecfc46cf0419a0d4a582e4bf5240f193cd Signed-off-by: James Simmons <jsimmons@infradead.org> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50896 Tested-by: Maloo <maloo@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16770 llite: prune object without layout lock first lov_layout_change() calls cl_object_prune() before changing layout. It may lead to eviction from MDT in case slow responce from OST. To reduce risk of possible eviction call cl_object_prune() without layout lock held before calling lov_layout_change() vvp_prune() attempts to sync and truncate page cache pages. osc_page_delete() may encounter page cache pages in non-clean state during truncate because there's a race window between sync and truncate. Writes may stick into this window and generate dirty or writeback pages. This window is usually protected with a special truncate semaphore e.g. when truncate is requested from the truncate syscall. Let's use this semaphore to avoid write vs truncate race in vvp_prune(). Change-Id: Ie2ee29ea1e792e1b34b6de068ff2b84fd8f52f2a HPE-bug-id: LUS-9927, LUS-11612 Signed-off-by: Andriy Skulysh <andriy.skulysh@hpe.com> Reviewed-by: Vitaly Fertman <c17818@cray.com> Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50742 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16954 llite: do not set SB_I_CGROUPWB on super block On clients with a more recent kernel e.g. ubuntu2204, this makes the mount fails sometimes with sysfs: cannot create duplicate filename '/devices/virtual/bdi/lustre-ffff8dd549f3d000' Change-Id: Ie15e41eb9d039829545e1d69f97ed9e13f89e53e Fixes: f5a75ea44d ("LU-16697 llite: Set BDI_CAP_* flags for lustre") Test-Parameters: clientdistro=ubuntu2204 testlist=sanity,conf-sanity Signed-off-by: Li Dongyang <dongyangli@ddn.com> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51701 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Emoly Liu <emoly@whamcloud.com> Reviewed-by: Qian Yingjin <qian@ddn.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16958 llite: migrate vs regular ops deadlock When it need to lock inode in lov_conf_set(), it could have hold inode's lli_layout_mutex, we need unlock the layout mutex before taking its inode lock to keep the lock order. Fixes: 51d62f2122f ("LU-16637 llite: call truncate_inode_pages() in inode lock") Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Change-Id: I7ee58039a6d31daefc625ac571a52baf112f8151 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51641 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15535 llite: deadlock on lli_lsm_sem it may happen that one process is doing lookup, and after reply while holding the LDLM lock is trying to update LSM/default LSM under the write lli_lsm_sem for a dir. another process has taken the read lli_lsm_sem (taken for all the MD ops in ll_prep_md_op_data()) and is waiting for a conflicting PW LDLM lock on server for its modification for this dir. it may happen on restriping with LSM, on changing the default LSM, but even more often way is racer run even without striped dirs: - racer does LFS mkdir -i $i <subdir> per each MDS, what creates a default LSM on these subdirs inherited endlessly - to keep the MDS index; - racer also does mkdir -p <path>, in which case we do: ll_new_node - create a parent dir, no RMF_DEFAULT_MDT_MD in reply ll_lookup parent it=open - no RMF_DEFAULT_MDT_MD in reply ll_new_node - create a child the default LSM is inherited on the parent creation, however as those RPCs do not have lookup LDLM lock and no data - the default layout is not set for the parent in inode at the time of a child creation. thus a parallel lookup which gets the LSM deadlocks with this ll_new_node(). at the same time, similar to CLIO, we do not need to hold a sem nor an LDLM lock over the whole operation to avoid LSM modification on server, we just need to take an uptodate LSM (this is a subject for LU-16320) and to guarantee this op will be working on the client on this LSM for the whole operation. the solution is to let MD ops to work on a copy of LSM therefore letting others to modify LSM attached to inode in parallel if needed. HPE-bug-id: LUS-10725 Signed-off-by: Vitaly Fertman <vitaly.fertman@hpe.com> Change-Id: I3137300b5bcce2e890994ce8751cdf7fce2f3f54 Reviewed-on: https://es-gerrit.hpc.amslabs.hpecorp.net/161525 Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com> Reviewed-by: Andriy Skulysh <c17819@cray.com> Tested-by: Vitaly Fertman <c17818@cray.com> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50489 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>