LU-13805 llite: make page_list_{add,del} symmetric An earlier patch created the slightly frightening situation where we use cl_page_list_del to remove references which were not taken by cl_page_list_add. This assymetry is scary, so let's not do it. Instead, DIO now explicitly puts the only cl_page reference it takes. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I832d8ca7dc7f2f99dc30f972197bebc83b8b5977 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/52057 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
LU-16314 obdclass: Migrate LASSERTF %p to %px This change covers lustre/obdclass through lustre/target and converts LASSERTF statements to explicitly use %px. Use %px to explicitly report the non-hashed pointer value messages printed when a kernel panic is imminent. When analyzing a crash dump the associated kernel address can be used to determine the system state that lead to the system crash. As crash dumps can and are provided by customers from production systems the use of the kernel command line parameter: no_hash_pointers is not always possible. Ref: Documentation/core-api/printk-formats.rst Test-Parameters: trivial HPE-bug-id: LUS-10945 Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com> Change-Id: Ia256dc1f74f976640ec82746a5d761ef662f45ae Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49405 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16837 lustre: avoid the same member name There are several structures using the same member name, such as cl_ladvise_io::li_flags, layout_intent::li_flags and lfsck_instance::li_flags, and this makes it hard to find where it is used. This patch renames some structures member prefix to avoid the homonyms. Test-Parameters: trivial Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Change-Id: Ie592afa06dd0abf0c1110843e5d8007a91c68145 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51766 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-13805 llite: Implement unaligned DIO connect flag Unupgraded ZFS servers may crash if they received unaligned DIO, so we need a compat flag and a test to recognize those servers. This patch implements that logic. Fixes: 7194eb6431 ("LU-13805 clio: bounce buffer for unaligned DIO") Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I5d6ee3fa5dca989c671417f35a981767ee55d6e2 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51126 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
LU-16713 llite: writeback/commit pages under memory pressure Lustre buffered I/O does not work well with restrictive memcg control. This may result in OOM when the system is under memroy pressure. Lustre has implemented unstable pages support similar to NFS. But it is disabled by default due to the performance reason. In Lustre, a client pins the cache pages for writes until the write transcation is committed on the server (OST) even these pinned pages have been finished writeback. The server starts a transaction commit either because the commit interval (5 second, by default) for the backend storage (i.e. OST/ldiskfs) has been reached or there is not enough room in the journal for a particular handle to start. Before the write transcation has been committed and notify the client, these pages are pinned and not flushable in any way by the kernel. This means that when a client hits memory pressure there can be a large number of unfreeable (pinned and uncommitted) pages, so the application on the client will end up OOM killed because when asked to free up memory it can not. This is particularly common with cgroups. Because when cgroups are in use, the memory limit is generally much lower than the total system memory limits and it is more likely to reach the limits. Linux kernel has matured memory reclaim mechanism to avoid OOM even with cgroups. After perform dirtied write for a page, the kernel calls @balance_dirty_pages(). If the dirtied and uncommitted pages are over background threshold for the global memory limits or memory cgroup limits, the writeback threads are woken to perform some writeout. When allocate a new page for I/O under memory pressure, the kernel will try direct reclaim and then allocating. For cgroup, it will try to reclaim pages from the memory cgroup over soft limit. The slow page allocation path with direct reclaim will call @wakeup_flusher_threads() with WB_REASON_VMSCAN to start writeback dirty pages. Our solution uses the page reclaim mechanism in the kernel directly. In the completion of page writeback (in @brw_interpret), call @__mark_inode_dirty() to add this dirty inode which has pinned uncommitted pages into the @bdi_writeback where each memory cgroup has itw own @bdi_writeback to contorl the writeback for buffered writes within it. Thus under memory pressure, the writeback threads will be woken up, and it will call @ll_writepages() to write out data. For background writeout (over background dirty threshold) or writeback with WB_REASON_VMSCAN for direct reclaim, we first flush dirtied pages to OSTs and then sync them to OSTs and force to commit these pages to release them quickly. When a cgroup is under memory pressure, the kernel asks to do writeback and then it does a fsync to OSTs. This will commit uncommitted/unstable pages, and then the kernel can free them finally. In the following, we will give out some performance results. The client has 512G memory in total. 1. dd if=/dev/zero of=$test bs=1M count=$size I/O size 128G 256G 512G 1024G unpatch (GB/s) 2.2 2.2 2.1 2.0 patched (GB/s) 2.2 2.2 2.1 2.0 There is no preformance regession after enable unstable page account with the patch. 2. One process under different memcg limits and total I/O size varies from 2X memlimit to 0.5 memlimit: dd if=/dev/zero of=$file bs=1M count=$((memlimit_mb * time)) memcg limits 1G 4G 16G 64G 2X memlimit (GB/s) 1.7 1.6 1.8 1.7 1X memlimit (GB/s) 1.9 1.9 2.2 2.2 .5X memlimit(GB/s) 2.3 2.3 2.2 2.3 Without this patch, dd with I/O size > memcg limit will be OOM-killed. 3. Multiple cgroups Testing: 8 cgroups in total each with memory limit of 8G. Run dd write on each cgrop with I/O size of 2X memory limit (16G). 17179869184 bytes (17 GB, 16 GiB) copied, 12.7842 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.7889 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9504 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 12.9577 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.4066 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5397 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.5769 s, 1.3 GB/s 17179869184 bytes (17 GB, 16 GiB) copied, 13.6605 s, 1.3 GB/s 4. Two dd writers one (A) is under memcg control and another (B) is not. The total write data is 128G. Memcg limits varies from 1G to 128G. cmd: ./t2p.sh $memlimit_mb memlimit dd writer (A) dd writer (B) 1G 1.3GB/s 2.2GB/s 4G 1.3GB/s 2.2GB/s 16G 1.4GB/s 2.2GB/s 32G 1.5GB/s 2.2GB/s 64G 1.8GB/s 2.2GB/s 128G 2.1GB/s 2.1GB/s The results demonstrates that the process with memcg limits nearly has no impact on the performance of the process without limits. Test-Parameters: clientdistro=el8.7 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10 Test-Parameters: clientdistro=el9.1 testlist=sanity env=ONLY=411b,ONLY_REPEAT=10 Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I7b548dcc214995c9f00d57817028ec64fd917eab Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50544 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Alex Deiter <alex.deiter@gmail.com>
LU-12518 llite: rename count and nob variables to bytes Rename "*count", "*nob", and "cnt" and similar variables to use "*bytes" to make it clear what the units are vs. number of pages. Test-Parameters: trivial Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: I195f2db4182e4b3099b3f4aa2e25b91f9f3ebbe5 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/38154 Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-16043 osc: allow error for write on CL_FSYNC_DISCARD If case of CL_FSYNC_DISCARD error is allowed for write of osc object. Otherwise, the included test fails in rm with: (osc_page.c:174:osc_page_delete()) Trying to teardown failed: -16 (osc_page.c:175:osc_page_delete()) ASSERTION( 0 ) failed: (osc_page.c:175:osc_page_delete()) LBUG Test-Parameters: trivial testlist=sanity env=ONLY=907 HPE-bug-id: LUS-10410 Signed-off-by: Vladimir Saveliev <vladimir.saveliev@hpe.com> Change-Id: I0aae0dc470ba0371964e7643a6d84b19a1b4e106 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48032 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14301 client: use EOPNOTSUPP instead of ENOTSUPP Don't return NFS-specific error code ENOTSUPP back to userspace, instead use EOPNOTSUPP. ENOTSUPP does not print a useful error message from strerror() if it is hit by an application. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Iabd07b31069737e8ee7ca2382fd8cff6143ebbe5 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51511 Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: jsimmons <jsimmons@infradead.org> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-8191 lustre: convert osp,osd,osc,ofd functions to static Static analysis shows that a number of functions could be made static. This patch declares several functions in osp, osd, osc, and ofd static. Also, fix a few minor style issues. Test-Parameters: trivial Signed-off-by: Timothy Day <timday@amazon.com> Change-Id: I3d7af7ec0fa2978bfdd0cb490f18f485a78f81f6 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51477 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: jsimmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-12610 osc: remove OBD_ -> CFS_ macros Remove OBD macros that are simply redefinitions of CFS macros. Also, convert some spaces to tabs. Test-Parameters: trivial Signed-off-by: Timothy Day <timday@amazon.com> Signed-off-by: Ben Evans <beevans@whamcloud.com> Change-Id: Icb4f25f51515d833fed2c05581288cde719c1d08 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51124 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-13199 lustre: remove cl_{offset,index,page_size} helpers These helpers could be replaced with PAGE_SIZE and PAGE_SHIFT calculation directly which avoid CPU overhead. Change-Id: I624136d4399a03e599f09f00a77b86de045f19e9 Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/37426 Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-16338 readahead: clip readahead with kms During I/O test, it found that the read-ahead pages reach 255 for small files with only several KiB. The amount of read data reaches more than 1MiB. The reason is that the granted DLM extent lock is [0, EOF], which is larger than the requested extent. During readahead, the OSC layer will also return [0, EOF] extent which will clip into stripe size (1MiB) regardless the actual object size. In this patch, the readahead range is clipped to the known min size (kms) on OSC layer during readahead. By this way, the read-ahead data will not beyong the last page of the file. Add sanity/101m to verify it. This patch also fixes multiop to return successfully when reaching EOF instead of exiting with ENODATA during read. Test-Parameters: testlist=sanity env=ONLY=101k,ONLY_REPEAT=3 Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: I285e3e1d84ad06231039306106c74d775c1b0b50 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49226 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-15619 osc: pack osc_async_page better The oap_cmd field was used to store a number of other flags, but those were redundant with oap_brw_page.flag, and never used. That allows shrinking oap_cmd down to 2 bits. Modern GCC allows specifying a bitfield for an enum, so the size can be explicitly set. The oap_page_off always holds < PAGE_SIZE, so it can safely fit into PAGE_SHIFT bits, similar to ops_from. However, since this field is used in math operations and we don't need the space, always allocate it as an aligned 16-bit field. This allows packing oap_async_flags, oap_cmd, and oap_page_off into a 32-bit space. This avoids having holes in the struct. The explicit oap_padding fields are needed so that "packed" does not cause the fields to be misaligned, but still allows packing with the following 4-byte field in osc_page. Also move oap_brw_page to the end of the struct, since the bp_padding field therein is useless and can be removed. This allows better packing with the bitfields in struct osc_page. brw_page old size: 32, holes: 0, padding: 4 brw_page new size: 28, holes: 0, padding: 0 osc_async_page old size: 104, holes: 8, padding: 4 osc_async_page new size: 92, holes: 0, bit holes: 10 osc_page old size: 144, holes: 8, bit holes: 4 osc_page new size: 128, holes: 0, bit holes: 4 Together this saves 16 bytes *per page* in cache, and fits osc_page into a noce-sized allocation. That is 512MiB on a system with 128GiB of cache. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ief6aa7664d7299dba02332bc9029e4e9219d0876 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/46721 Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-15619 osc: Remove oap lock The OAP lock is taken around setting the oap flags, but not any of the other fields in oap. As far as I can tell, this is just some cargo cult belief about locking - there's no reason for it. Remove it entirely. (From the code, a queued spin lock appears to be 12 bytes on x86_64.) Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: Ib61190d52c08d88c95a0c19b8ef7d114e26cfae2 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/46719 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-by: Zhenyu Xu <bobijam@hotmail.com>
LU-16160 osc: take ldlm lock when queue sync pages osc_queue_sync_pages() add osc_extent to osc_object's IO extent list without taking ldlm locks, and then it calls osc_io_unplug_async() to queue the IO work for the client. This patch make sync page queuing take ldlm lock in the osc_extent. Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Change-Id: Idefa2981e62a2a6e10d8b8a7692c0337b61b9052 Reviewed-on: https://review.whamcloud.com/48557 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16057 obdclass: set OBD_MD_FLGROUP for ladvise RPC ladvise RPC doesn't have OBD_MD_FLGROUP set, when RPC reaches server, tgt_validate_obdo() will corrupt the FID if it's seq is in FID_SEQ_NORMAL range. Do not mess with seq in obdo_to_ioobj() and tgt_validate_obdo(), since 2.0 all RPCs should have OBD_MD_FLGROUP set. Add OBD_MD_FLGROUP for ladvise RPC to fix new client talking to old servers. Change-Id: I373b7f32458b18e29d9bb716a912fe4a54eccac5 Signed-off-by: Li Dongyang <dongyangli@ddn.com> Reviewed-on: https://review.whamcloud.com/48080 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15619 osc: Remove submit time The osc page submit time is an unused bit of debugging information, but it's allocated for every page. Let's just remove it to save memory. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> Change-Id: I160d38039332cb17e07735b60ce7979626ed43dc Reviewed-on: https://review.whamcloud.com/46712 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15519 quota: fallocate does not increase projectid usage fallocate() was not accounting for projectid quota usage. This was happening due to two reasons. 1) the projectid was not properly passed to md_op_data in ll_set_project() and 2) the OBD_MD_FLPROJID flag was not set receive the projctid. This patch addresses the above reasons. Test-case: sanity-quota/78a added Fixes: 48457868a02a ("LU-3606 fallocate: Implement fallocate preallocate operation") Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Change-Id: I3ed44e7ef7ca8fe49a08133449c33b62b1eff500 Reviewed-on: https://review.whamcloud.com/46676 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15381 hsm: update size upon completion of data version We found a HSM retore followed by a HSM release will set the file size with 0 wrongly during the tests. The reason is that the file size and blocks information is incorrect obtained via @ll_merger_attr(). The data version operation will flush dirty pages from all clients, the size and blocks information returns from the Lustre OST is correct. In this patch, we update the size and block attributes for a file upon the completion of the data version operation accordingly. By this way, HSM release will set the size and blocks information correctly after data version ioctl operation. Add sanity-hsm test_261. Signed-off-by: Qian Yingjin <qian@ddn.com> Change-Id: Ifdbf6b58ecd00dc9677a2328438ef68529b72882 Reviewed-on: https://review.whamcloud.com/45935 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Bobi Jam <bobijam@hotmail.com> Reviewed-by: Artem Blagodarenko <artem.blagodarenko@hpe.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15167 quota: fallocate send UID/GID for quota Calling fallocate() on a newly created file did not account quota usage properly because the OST object did not have a UID/GID assigned yet. Update the fallocate code in the OSC to always send the file UID/GID/PROJID to the OST so that the object ownership can be updated before space is allocated. Test-case: sanity-quota/78 added Fixes: 48457868a02a ("LU-3606 fallocate: Implement fallocate preallocate operation") Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Change-Id: I86d80a7f415a80100f7d2fb5f417cf47bf5b2900 Reviewed-on: https://review.whamcloud.com/45475 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Bobi Jam <bobijam@hotmail.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>