LU-15069 llite: Add RAS_CDEBUG in needed spots Some of the basic readahead state controlling functions don't dump the readahead state. Fix that. Test-Parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ia36a8437d1877a31bfc18c1b6a4170f31383ae66 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/45656 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Timothy Day <timday@amazon.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15069 llite: Rename 'skip' label There's a goto label in ras_update named just "skip". Skip what? This is extra confusing because the concept of "skip index" is used in neighboring code, and this is unrelated. Give it a more descriptive name. Test-Parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I1e6ec7a75b6d9a296bfdea4c70a333497d804564 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/45653 Reviewed-by: Timothy Day <timday@amazon.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-15069 llite: Remove ras_set_start ras_set_start is a one line function and serves only to obfuscate how simple "set_start" actually is. Test-Parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I95d0b891ea2c88354dcb9e5b5a205cafa19380c7 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/45652 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Timothy Day <timday@amazon.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15274 llite: whole file read fixes There are two significant issues with whole file read. 1. Whole file read does not interact correctly with fast reads - specifically, whole file read is not recognized by the fast read code so files below the "max_read_ahead_whole_mb" limit will not use fast reads. This has a significant performance impact. 2. Whole file read does not start from the beginning of the file, it starts from the current IO index. This causes issues with unusual IO patterns, and can also confuse readahead more generally (I admit to not fully understanding what happens here, but the change is reasonable regardless.) This is particularly important for cases where the read doesn't start at the beginning of the file but still reads the whole file (eg, random or backwards reads). Performance data: max_read_ahead_whole_mb defaults to 64 MiB, so a 64 MiB file is read with whole file, and a 65 MiB file is not. Without this fix: rm -f file truncate -s 64M file dd if=file bs=4K of=/dev/null 67108864 bytes (67 MB, 64 MiB) copied, 7.40127 s, 9.1 MB/s rm -f file truncate -s 65M file dd if=file bs=4K of=/dev/null 68157440 bytes (68 MB, 65 MiB) copied, 0.0932216 s, 630 MB/s Whole file readahead: 9.1 MB/s Non whole file readahead: 630 MB/s With this fix (same test as above): Whole file readahead: 994 MB/s Non whole file readahead: 630 MB/s (unchanged) Fixes: 7864a68 ("LU-12043 llite,readahead: don't always use max RPC size") Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I72f0b58e289e83a2f2a3868ef0d433a50889d4c0 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/54011 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: Shuichi Ihara <sihara@ddn.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-17469 llite: hold object reference in IO There could be a race between page write and inode free, hold a cl_object reference during the IO lest accessing freed object. Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Change-Id: Ic70cc27430e68265aba0662fc68e9bfe2f86cfe1 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53819 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-13805 llite: make page_list_{add,del} symmetric An earlier patch created the slightly frightening situation where we use cl_page_list_del to remove references which were not taken by cl_page_list_add. This assymetry is scary, so let's not do it. Instead, DIO now explicitly puts the only cl_page reference it takes. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I832d8ca7dc7f2f99dc30f972197bebc83b8b5977 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/52057 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
LU-16314 llite: Migrate LASSERTF %p to %px This change covers lustre/ec through lustre/mgs and converts LASSERTF statements to explicitly use %px. Use %px to explicitly report the non-hashed pointer value messages printed when a kernel panic is imminent. When analyzing a crash dump the associated kernel address can be used to determine the system state that lead to the system crash. As crash dumps can and are provided by customers from production systems the use of the kernel command line parameter: no_hash_pointers is not always possible. Ref: Documentation/core-api/printk-formats.rst Test-Parameters: trivial HPE-bug-id: LUS-10945 Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com> Change-Id: I708d9ef60c63f5b4006c7986599a2f39fc9e5fdf Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51213 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16695 llite: switch to ki_flags from f_flags There are possible races between IO checking f_flags and fcntl changing f_flags. The kernel fixed most of these by copying most of the file flags in to the iocb. Let's follow on and use those copied flags. This also lets us change them if we want, since they're now local to the specific IO. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Signed-off-by: Guillaume Courrier <guillaume.courrier@cea.fr> Change-Id: Ib98cccec0e7888865ec10dc5f76f1d9917a1aef7 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50493 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com> Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
LU-16713 llite: writeback/commit pages under memory pressure Lustre buffered I/O does not work well with restrictive memcg control. This may result in OOM when the system is under memroy pressure. Lustre has implemented unstable pages support similar to NFS. But it is disabled by default due to the performance reason. In Lustre, a client pins the cache pages for writes until the write transcation is committed on the server (OST) even these pinned pages have been finished writeback. The server starts a transaction commit either because the commit interval (5 second, by default) for the backend storage (i.e. OST/ldiskfs) has been reached or there is not enough room in the journal for a particular handle to start. Before the write transcation has been committed and notify the client, these pages are pinned and not flushable in any way by the kernel. This means that when a client hits memory pressure there can be a large number of unfreeable (pinned and uncommitted) pages, so the application on the client will end up OOM killed because when asked to free up memory it can not. This is particularly common with cgroups. Because when cgroups are in use, the memory limit is generally much lower than the total system memory limits and it is more likely to reach the limits. Linux kernel has matured memory reclaim mechanism to avoid OOM even with cgroups. After perform dirtied write for a page, the kernel calls @balance_dirty_pages(). If the dirtied and uncommitted pages are over background threshold for the global memory limits or memory cgroup limits, the writeback threads are woken to perform some writeout. When allocate a new page for I/O under memory pressure, the kernel will try direct reclaim and then allocating. For cgroup, it will try to reclaim pages from the memory cgroup over soft limit. The slow page allocation path with direct reclaim will call @wakeup_flusher_threads() with WB_REASON_VMSCAN to start writeback dirty pages. Our solution uses the page reclaim mechanism in the kernel directly. In the completion of page writeback (in @brw_interpret), call @__mark_inode_dirty() to add this dirty inode which has pinned uncommitted pages into the @bdi_writeback where each memory cgroup has itw own @bdi_writeback to contorl the writeback for buffered writes within it. Thus under memory pressure, the writeback threads will be woken up, and it will call @ll_writepages() to write out data. For background writeout (over background dirty threshold) or writeback with WB_REASON_VMSCAN for direct reclaim, we first flush dirtied pages to OSTs and then sync them to OSTs and force to commit these pages to release them quickly. When a cgroup is under memory pressure, the kernel asks to do writeback and then it does a fsync to OSTs. This will commit uncommitted/unstable pages, and then the kernel can free them finally. In the following, we will give out some performance results. The client has 512G memory in total. 1. dd if=/dev/zero of=$test bs=1M count=$size I/O size 128G 256G 512G 1024G unpatch (GB/s) 2.2 2.2 2.1 2.0 patched (GB/s) 2.2 2.2 2.1 2.0 There is no preformance regession after enable unstable page account with the patch. 2. One process under different memcg limits and total I/O size varies from 2X memlimit to 0.5 memlimit: dd if=/dev/zero of=$file bs=1M count=$((memlimit_mb * time)) memcg limits 1G 4G 16G 64G 2X memlimit (GB/s) 1.7 1.6 1.8 1.7 1X memlimit (GB/s) 1.9 1.9 2.2 2.2 .5X memlimit(GB/s) 2.3 2.3 2.2 2.3 Without this patch, dd with I/O size > memcg limit will be OOM-killed. 3. Multiple cgroups Testing: 8 cgroups in total each with memory limit of 8G. Run dd write on each cgrop with I/O size of 2X memory limit (16G). 17179869184 bytes (17 GB, 16 GiB) copied, 12.7842 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.7889 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9504 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9577 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.4066 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5397 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5769 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.6605 s, 1.3 GB/s 4. Two dd writers one (A) is under memcg control and another (B) is not. The total write data is 128G. Memcg limits varies from 1G to 128G. cmd: ./t2p.sh $memlimit_mb memlimit dd writer (A) dd writer (B) 1G 1.3GB/s 2.2GB/s 4G 1.3GB/s 2.2GB/s 16G 1.4GB/s 2.2GB/s 32G 1.5GB/s 2.2GB/s 64G 1.8GB/s 2.2GB/s 128G 2.1GB/s 2.1GB/s The results demonstrates that the process with memcg limits nearly has no impact on the performance of the process without limits. Test-Parameters: clientdistro=el8.7 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10 Test-Parameters: clientdistro=el9.1 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10 Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I7b548dcc214995c9f00d57817028ec64fd917eab Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50544 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
LU-12518 llite: rename count and nob variables to bytes Rename "*count", "*nob", and "cnt" and similar variables to use "*bytes" to make it clear what the units are vs. number of pages. Test-Parameters: trivial Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: I195f2db4182e4b3099b3f4aa2e25b91f9f3ebbe5 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/38154 Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-12645 llite: Move readahead debug before exit The core debug of ll_readahead() is before two return conditions, which makes it really tricky to debug those conditions. Let's fix that. Test-Parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ic3a3854527cad62c891c6a25029353a4742e555f Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51932 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-8191 llite: convert functions to static Static analysis shows that a number of functions could be made static. This patch declares several functions in llite static. Also, conserve more * in comments. Test-Parameters: trivial Signed-off-by: Timothy Day <timday@amazon.com> Change-Id: Iafa3bb84de158e31b27b7784243bc15e78187f10 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51441 Tested-by: Maloo <maloo@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: jsimmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16805 llite: improve readpage debug LU-16412 (which is a workaround for a kernel bug) added a debug message in ll_readpage(), but this message is printed every time rather than only when the kernel bug is hit. Let's fix this. Fixes: 209afbe28b "LU-16412 llite: check truncated page in ->readpage()" Test-parameters: trivial Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ice02178eb9c07e03b58fb4e2d64ed3ea878cf137 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50892 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Timothy Day <timday@amazon.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-12610 cfs: add unlikely to CFS_ macros Fix the (hopefully) last few OBD_ users to use CFS_ macros instead. Add an 'unlikely()' to CFS_ macros. Some of the OBD_ macros included this hint. Once those macros are removed, the hint will be lost. Add it to the CFS_ macros instead. The libcfs_fail.h only has a couple style issues left. Just fix them in this patch. Test-Parameters: trivial Signed-off-by: Timothy Day <timday@amazon.com> Change-Id: Ie06533b8b408cacf6f6fe2d29a1a8e727ca4280b Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51291 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-12610 llite: remove OBD_ -> CFS_ macros Remove OBD macros that are simply redefinitions of CFS macros. Signed-off-by: Timothy Day <timday@amazon.com> Signed-off-by: Ben Evans <beevans@whamcloud.com> Change-Id: I7bbcc3e1fda6418c258eb4d1c52b929a7cf72ed1 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50804 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-13199 lustre: remove cl_{offset,index,page_size} helpers These helpers could be replaced with PAGE_SIZE and PAGE_SHIFT calculation directly which avoid CPU overhead. Change-Id: I624136d4399a03e599f09f00a77b86de045f19e9 Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/37426 Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-16713 llite: add __GFP_NORETRY for read-ahead page We need __GFP_NORETRY for read-ahead page, otherwise the read process would be OOM killed when reached cgroup memory limits. Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: If699429d5d5cd29bd895d8455296113aa67645fc Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50625 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16649 llite: EIO is possible on a race with page reclaim We must clear the 'uptodate' page flag when we delete a page from Lustre, or stale reads can occur. However, generic_file_buffered_read requires any pages returned from readpage() be uptodate. So, we must retry reading if page truncation happens in parallel with the read. This implements the same fix as: https://review.whamcloud.com/49647 b4da788a819f82d35b685d6ee7f02809c05ca005 did for the mmap path. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Iae0d1eb343f25a0176135347e54c309056c2613a Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50344 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com> Reviewed-by: Qian Yingjin <qian@ddn.com> Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16327 llite: read_folio, release_folio, filler_t Linux commit v5.18-rc5-221-gb7446e7cf15f fs: Remove aop flags parameter from grab_cache_page_write_begin() flags have been dropped from write_begin() and grab_cache_page_write_begin() Linux commit v5.18-rc5-241-g5efe7448a142 fs: Introduce aops->read_folio Provide a ll_read_folio handler around ll_readpage Linux commit v5.18-rc5-280-ge9b5b23e957e fs: Change the type of filler_t Affects read_cache_page, provides a wrapper for read_cache_page and wrappers for filler functions Linux commit v5.18-rc5-282-gfa29000b6b26 fs: Add aops->release_folio Provide an ll_release_folio function based on ll_releasepage Test-Parameters: trivial HPE-bug-id: LUS-11357 Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com> Change-Id: Ibd4ec1133c80cd0eb8400c4cd07b50e421dd35c5 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49199 Tested-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16579 llite: fix the wrong beyond read end calculation During the test, we found a dead loop in the read path which retruns AOP_TRUNCATED_PAGE(0x8001) endless. The reason is that the calculation of the ending beyond offset is wrong: (iter->count + iocb->ki_pos). The ending beyond offset was supposed to be not changed during the read I/O loop for each page in buffered I/O mode. However, @iter->count is decreased with read bytes when finished the read of each page: @iter->count -= read_bytes. In this patch, we store the ending beyond page index in @lcc->lcc_end_index before call @generic_file_read_iter into a loop for each read page and solve this bug. Fixes: 2f8f38effa ("LU-16412 llite: check read page past requested") Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I5bb7ab82e5e2de8b9bd911798fb8ae65fc7c91af Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50065 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>