LU-10026 osd-ldiskfs: use preallocation for dense writes use inode's preallocation chunks as per-inode group preallocation: just grab the very first available blocks from the window. Test-Parameters: env=ONLY=1000,ONLY_REPEAT=11 testlist=sanity-compr Test-Parameters: env=ONLY=fsx,ONLY_REPEAT=11 testlist=sanity-compr Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Change-Id: I9d36701f569f4c6305bc46f3373bfc054fcd61a9 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50171 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-17471 osd: add symlink for brw_stats Add symlink at /proc/fs/lustre/osd-*/*/brw_stats to /sys/kernel/debug/lustre/osd-*/*/brw_stats to fix the compatible issue of the previous utils that are still using the old proc entry. Test-Parameters: testlist=sanity env=ONLY=0f serverversion=2.15.4 Fixes: 8a84c7f9c7d6 ("LU-14927 osd: share brw_stats code between OSD back ends.") Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com> Change-Id: Ie86b2b384e3b91f98ead00b6325ddeb020e47aa5 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53829 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Timothy Day <timday@amazon.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16032 osd: move unlink of large objects to separate thread Final unlink and freeing of blocks for large objects can lead to a thread hung with this call stack: Net: Service thread pid 1739 was inactive for 200.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: __wait_on_buffer+0x2a/0x30 ldiskfs_wait_block_bitmap+0xe0/0xf0 [ldiskfs] ldiskfs_read_block_bitmap+0x31/0x60 [ldiskfs] ldiskfs_free_blocks+0x329/0xbb0 [ldiskfs] ldiskfs_ext_remove_space+0x8a9/0x1150 [ldiskfs] ldiskfs_ext_truncate+0xb0/0xe0 [ldiskfs] ldiskfs_truncate+0x3b7/0x3f0 [ldiskfs] ldiskfs_evict_inode+0x58a/0x630 [ldiskfs] evict+0xb4/0x180 iput+0xfc/0x190 osd_object_delete+0x1f8/0x370 [osd_ldiskfs] lu_object_free.isra.30+0x68/0x170 [obdclass] lu_object_put+0xc5/0x3e0 [obdclass] ofd_destroy_by_fid+0x20e/0x500 [ofd] ofd_destroy_hdl+0x267/0x9f0 [ofd] tgt_request_handle+0xaee/0x15f0 [ptlrpc] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] ptlrpc_main+0xb34/0x1470 [ptlrpc] kthread+0xd1/0xe0 Let's move final unlink to workqueue if inode size > 1GB. The size threshold be configured by setting the minimum async truncate size with the "osd-ldiskfs.*.delay_unlink_mb" parameter. Writes to "osd-ldiskfs.*.force_sync" parameter will flush pending delayed unlinks so that space can be reclaimed as needed. Change-Id: Id535ae4c58732769effabee42835bc2da8cb5cc1 Signed-off-by: Artem Blagodarenko <ablagodarenko@whamcloud.com> DDN-bug-id: DDN-3144 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/47995 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-16847 ldiskfs: refactor code brw_stats code. counting a number disk or logical extents don't needs a loop. All information exist around of ldiskfs_map_blocks. HPe-bug-id: LUS-11645 Signed-off-by: Alexey Lyashkov <alexey.lyashkov@hpe.com> Change-Id: I77f3707b88e9bdf6ea06acc950af2a41f056f5d0 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51391 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com> Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14918 osd: don't declare similar ldiskfs writes twice in some cases (like overstriping) the same operations can be declared multiple times (new llog records) and this lead to huge number of credits and performance degradation. we can avoid this checking for duplicate declarations. As every declaration would need an allocation, limit the scope of this checks to transaction likely to be large. % of "large" transaction in sanity-benchmark, depending on threshold: creates < 5 && writes < 5: 0.58% (mds1) and 2.97% (mds2) create < 7 & writes < 7: 0.58% and 2.4% create < 9 & writes < 9: 0.6% and 1.85% create < 10 & write2 < 10: 0.0004% and 0.000001% thus 10 creates or writes is selected as a threshold to enable this logic. Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Change-Id: I7c893fe3b95646b4b813b999bc832659dfcf03ad Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/45765 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com>
LU-16120 build: Add support for kobj_type default_groups Linux commit v5.1-rc3-29-gaa30f47cf666 kobject: Add support for default attribute groups to kobj_type Linux commit v5.18-rc1-2-gcdb4f26a63c3 kobject: kobj_type: remove default_attrs Switch to using kobj_type default_groups when it is available. Provide support for default_attrs for older kernels. HPE-bug-id: LUS-11196 Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com> Change-Id: I43b03c67c22307293a2abc444aa1a73889ca09ee Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48365 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Jian Yu <yujian@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-16231 misc: rename lprocfs_stats functions Rename lprocfs_{alloc,register,clear,free}_stats() to be lprocfs_stats_*() so these functions can be found more easily in relation to struct lprocfs_stats. Test-Parameters: trivial Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: I671284a86ee2a1fd3c58da75923f9467e72540e5 Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48847 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Ellis Wilson <elliswilson@microsoft.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15642 obdclass: use consistent stats units Use consistent stats units, since some were "usec" and others "usecs". Most stats already use LPROCFS_TYPE_* to encode type stats type, so use this to provide units for those stats, and only explicitly provide strings for the few stats that don't match the commonly-used units. This also reduces the number of repeat static strings in the modules. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: I25f31478f238072ddbf9a3918cd43bb08c3ebbe5 Reviewed-on: https://review.whamcloud.com/46833 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Jian Yu <yujian@whamcloud.com> Reviewed-by: Ben Evans <beevans@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-15548 osd-ldiskfs: hide virtual projid xattr Add tunable enable_projid_xattr to hide the virtual project ID xattr by default. Change-Id: I21263d91599f9e2d5850cb9d94a8b6df90c8443c Test-Parameters: trivial testlist=conf-sanity env=ONLY=131 Test-Parameters: testlist=sanity env=ONLY=904 Signed-off-by: Li Dongyang <dongyangli@ddn.com> Reviewed-on: https://review.whamcloud.com/46900 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Li Xi <lixi@ddn.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-13309 osd: use per-cpu counters for brw_stats Based on perf reports, oh_lock is highly contended when running IOR with NVMe storage, so we need to move to per-cpu counters. struct brw_stats becomes larger: from 3872 to 18208 bytes. Also, 4 bytes are allocated per each cpu for every counter. With an 8-cpu system and 32 4-byte per-cpu counters, there are 448 per-cpu counters or 1792 bytes per-cpu. These counters will either reuse already allocated per-cpu pages or allocate a new page on each cpu (8 pages total). Change-Id: I24536a0138067fb868aaf962d9321dea7566d13f Signed-off-by: Andrew Perepechko <andrew.perepechko@hpe.com> HPE-bug-id: LUS-8007, LUS-8185 Reviewed-on: https://review.whamcloud.com/37915 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14927 osd: share brw_stats code between OSD back ends. Both the ldiskfs and ZFS OSD backend handle brw_stats. With the stricter GPL requirement ZFS can no longer carry the brw_stats code. So move the common code to lprocfs_status_server.c as well as move brw_stats to debugfs as well. Change-Id: I294e5df3557552266dd3a02d3bc9844c42c01f60 Signed-off-by: James Simmons <jsimmons@infradead.org> Reviewed-on: https://review.whamcloud.com/44690 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Aurelien Degremont <degremoa@amazon.com> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-11407 obdclass: add start time to stats files When the stats files are initialized or reset, store the current timestamp with the stats. That allows computing average IO and RPC rates over the accumulated stats lifetime, in addition to the normal incremental operation rates found by comparing successive values read from the stats file with the read interval. Any stats that currently print the "snapshot_time:" header will now also print "start_time:" and "elapsed_time:" fields as well. Consolodate this printing into a helper function instead of duplicating very similar code in many different functions. Output can't be exactly the same for all callers, because these fields are embedded into different types of output files, but it is very close. Change struct rename_stats and brw_stats to use a common name prefix. Change the obd_job_stats timestamps to ktime_t so that we can use the common helper function for printing the header. It is easier to store ojs_cleanup_interval internally as 1/2 of the maximum stats age, since since division is more easily done when the value is initially set as seconds compared to when it is ktime_t. This may also be a tiny bit more efficient since we don't do a divide/shift on each access. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Iacefa17def455ef53a28fd14b6d4c670463ebbe5 Reviewed-on: https://review.whamcloud.com/33201 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Ben Evans <beevans@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14927 scrub: create shared scrub_needs_check() function. The current functions osd_consistency_check() in both ldiskfs and zfs use ktime_* functions which are exported for pure GPL modules. This is not the case for ZFS. We can refactor the code to create a new common function scrub_needs_check() that can be used along side osd_consistency_check(). Fix a few cases where the error code is not checked for ZFS. Change-Id: I0cc6cd84a35ecc10b511096f4e749a2961da3bbf Signed-off-by: James Simmons <jsimmons@infradead.org> Reviewed-on: https://review.whamcloud.com/44689 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Aurelien Degremont <degremoa@amazon.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14641 osd-ldiskfs: write commit declaring improvement This patch try to: 1)extent bytes could be missed to increase with less than 1M, fix to to compare it with current value, and decay it for every allocation. 2)with system space usage growing up, mballoc codes won't try best to scan block group to align best free extent as we can. So extent bytes per extent could be decayed to a very small value, this could make us reserve too many credits. We could be more optimistic in the credit reservations, even in a case where the filesystem is nearly full, it is extremely unlikely that the worst case would ever be hit. 3)Add extent bytes stats and debug ability to analysis over reservation problem. Signed-off-by: Wang Shilong <wshilong@ddn.com> Change-Id: I357c4a855147ba26a9e9bbe9ab1269bcfd44e5f3 Reviewed-on: https://review.whamcloud.com/43446 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-14487 modules: remove references to Sun Trademark. "lustre" is no longer a Trademark of Sun Microsystems. There is no need to acknowledge the trademark in every file, so just remove all these claims. Test-Parameters: trivial Signed-off-by: Mr NeilBrown <neilb@suse.de> Change-Id: I66941494eabc54bedf85079c5b85701187f2a8f1 Reviewed-on: https://review.whamcloud.com/42139 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Aurelien Degremont <degremoa@amazon.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
LU-14286 osd-ldiskfs: fallocate with unwritten extents The osd_fallocate() code should typically be allocating unwritten extents with LDISKFS_GET_BLOCKS_CREATE_UNWRIT_EXT instead of actually zeroing the blocks on disk with LDISKFS_GET_BLOCKS_CREATE_ZERO. Writing zeroes during fallocate() is typically slower initially, and is causing timeouts in sanity test_150e, which is trying to fill up all OSTs to 90%. In some cases, zeroing the underlying blocks can use the underlying storage support for efficient zeroing (WRITE_SAME), so it may be faster for later use than uninitialized extents that have to be converted to initialized extents by (possibly) splitting them into smaller extents and/or zero filling them when they are paritally being overwritten. Add a tunable parameter osd-ldiskfs.*.fallocate_zero_blocks to allow selecting this behavior at runtime. The default is -1, to disable fallocate completely (return -EOPNOTSUPP) due to current bugs. Test-Parameters: testlist=sanityn env=ONLY=16,ONLY_REPEAT=10 Fixes: 72617588ac8c ("LU-14286 osd-ldiskfs: fallocate() should zero new blocks") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ida3692c487fdc8918863fc5c99459caaba17d92e Reviewed-on: https://review.whamcloud.com/41204 Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Arshad Hussain <arshad.hussain@aeoncomputing.com> Reviewed-by: John L. Hammond <jhammond@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-8066 osd-ldiskfs: quiet debug mount message We don't need a message printed to the console for every mount reporting that tunable parameters were configured. Test-Parameters: trivial Fixes: 493cd8088388 ("LU-8066 osd: migrate from proc to sysfs") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: I12fb89f8f15a86657fe5c1f46359f184ce3ebbe5 Reviewed-on: https://review.whamcloud.com/40968 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: James Simmons <jsimmons@infradead.org> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-13344 all: Separate debugfs and procfs handling Linux 5.6 introduces proc_ops with v5.5-8862-gd56c0d45f0e2 proc: decouple proc from VFS with "struct proc_ops" Separate debugfs usage and procfs usage to prepare for the divergence of debugfs using file_operations and procfs using proc_ops HPE-bug-id: LUS-8589 Signed-off-by: Shaun Tancheff <shaun.tancheff@hpe.com> Change-Id: I1746e563b55a9e89f90ac01843c304fe6b690d8b Reviewed-on: https://review.whamcloud.com/37834 Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: James Simmons <jsimmons@infradead.org> Tested-by: jenkins <devops@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-9091 sysfs: use string helper like functions for sysfs For a very long time the Linux kernel has supported the function memparse() that allowed the passing in of memory sizes with the suffix set of K, M, G, T, P, E. Lustre adopted this approach with its proc / sysfs implmentation. The difference being that lustre expanded this functionality to allow sizes with a fractional component, such as 1.5G for example. The code used to parse for the numerical value is heavily tied into the debugfs seq_file handling and stomps on the passed in buffer which you can't do with sysfs files. Similar functionality to what Lustre does today exist in newer linux kernels in the form of string helpers. Currently the string helpers only convert a numerical value to human readable format. A new function, string_to_size(), was created that takes a string and turns it into a numerical value. This enables the use of string helper suffixes i.e MiB, kB etc with the lustre tunables and we can now support 10 base numbers i.e MB, kB as well. Already string helper suffixes are used for debugfs files so I expect this to be adopted over time so it should be encouraged to use string_to_size() for newer lustre sysfs files. At the same time we want to perserve the original behavior of using the suffix set of K, M, G, T, P, E. To do this we create the function sysfs_memparse() that supports the new string helper suffixes as well as the older set of suffixes. This new code is also way simpler than what is currently done with the current code. Change-Id: Ia437db44f2a987aa11ab4ff3e9df23e9aeba04d7 Signed-off-by: James Simmons <jsimmons@infradead.org> Reviewed-on: https://review.whamcloud.com/35658 Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Shaun Tancheff <stancheff@cray.com> Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: Maloo <maloo@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com>
LU-12071 osd-ldiskfs: bypass pagecache if requested in few cases (non-rotational drive, by request, or file size) osd-ldiskfs may want to skip caching. If so, bypass page cache instead of later cache invalidation, as cache invalidation can be quite expensive. set the maximum cached read/write IO size use: lctl set_param osd-ldiskfs.*.readcache_max_io_mb=N lctl set_param osd-ldiskfs.*.writethrough_max_io_mb=N The default maximum cached IO size is 8MiB. ladvise() enforces IO to go in the cache and all subsquent reads will consult with the cache. Change-Id: I37403ced7ad9553128ba168fa36315d6aa1aaf2d Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-on: https://review.whamcloud.com/34422 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Tested-by: jenkins <devops@whamcloud.com> Reviewed-by: Wang Shilong <wshilong@ddn.com> Tested-by: Maloo <maloo@whamcloud.com>