Whamcloud - gitweb
LU-16935 llite: avoid hopeless i/o repeats 05/51505/10
authorVladimir Saveliev <vladimir.saveliev@hpe.com>
Wed, 23 Aug 2023 14:19:38 +0000 (17:19 +0300)
committerOleg Drokin <green@whamcloud.com>
Sat, 23 Mar 2024 05:51:46 +0000 (05:51 +0000)
commitf5564c35ede12659acedd14845cb36e70563233a
treeaa4623aabeaadfda25a156464c1d712ce0b00c5f
parent3edc71803af3b4dc672313cd1ba395de724fbc59
LU-16935 llite: avoid hopeless i/o repeats

On SLES12SP5 kernels (4.12.14_122.147, 4.12.14-122.162) a race between
ll_filemap_fault and ll_imp_inval may lead to the livelock:

  - ll_filemap_fault loops endlessly as filemap_fault()->readpage()
    returns VM_FAULT_SIGBUS (it is unable to send read rpc as import
    is invalid) and as ll_page_inv_lock gets incremented within
    cl_page_discard()->..->vvp_page_delete() called after readpage
    failure.

  - ll_imp_inval stucks in
    obd_import_event(IMP_EVENT_INVALIDATE)->..->osc_object_invalidate
    (before recovery) waiting for completion of i/o ll_filemap_fault
    can not complete.

@ll_page_inv_lock is used to check the page being read by kernel
after it has been deleted from Lustre, which avoids potential
stale data reads. This seqlock allows us to see that a page was
potentially deleted, catch it in this case and repeat the I/O in
ll_filemap_fault() or vvp_io_read_start().

To avoid endless I/O repeat wrongly, in this patch we only increse
@ll_page_inv_lock for the page in PageUptodate state when delete
the page in vvp_page_delete(). The page that not in PageUptodate
state is usually deleted due to the error that does not require
retry.
By this way, ll_filemap_fault() and vvp_io_read_start() will not loop
endless for those errors that does not need to repeat I/O as the
seqlock @ll_page_inv_lock does not have any change.

Test to illustrate the issus is added.

sanity.sh tests are to test i/o error handling.

cl_io_loop(): avoid restart if ci_tried_all_mirrors flag is set.

HPE-bug-id: LUS-11686
Signed-off-by: Vladimir Saveliev <vladimir.saveliev@hpe.com>
Signed-off-by: Qian Yingjin <qian@ddn.com>
Change-Id: I3b62bc95db01bf11f6098011bf29e4064c7e201e
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/51505
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lustre/include/obd_support.h
lustre/llite/vvp_io.c
lustre/llite/vvp_page.c
lustre/obdclass/cl_io.c
lustre/tests/recovery-small.sh
lustre/tests/sanity.sh