Whamcloud - gitweb
LU-17634 hsm: serialize HSM restore for a file on a client 66/54366/6
authorQian Yingjin <qian@ddn.com>
Wed, 13 Mar 2024 01:33:19 +0000 (21:33 -0400)
committerOleg Drokin <green@whamcloud.com>
Tue, 2 Apr 2024 21:03:56 +0000 (21:03 +0000)
commita6b3faffeaea7abbef389ad5296880a522a13460
treedb96cac0bca8c826aacebeb0e4baf4d50f6f4b14
parentfa2cfb49decf3d897f63023c998a23fd98c5c3ea
LU-17634 hsm: serialize HSM restore for a file on a client

For a file in HSM released, exists, archived status, start tens of
processes to read it in parallel on a client, and one read process
may report "No data available" error.

After analyzed the error, we found the following bug in HSM code:
Reading a released file already granted LAYOUT lock on a client:
P1:
->vvp_io_init()
->lov_io_init_released(): io->ci_restore_needed = 1;
->vvp_io_fini()
  ->ll_layout_restore()
    ->mdc_ioc_hsm_request()
      ->mdc_hsm_request_lock_to_cancel()
        ->ldlm_cancel_resource_local()
          remove LAYOUT lock from resource into cancel list
          NOT yet cancel the LAYOUT lock on the client via ELC...

P2:
->vvp_io_init()
->lov_io_init_released(): io->ci_restore_needed = 1;
->vvp_io_fini()
  ->ll_layout_restore()
    ->mdc_ioc_hsm_request()
      ->mdc_hsm_request_lock_to_cancel()
      SKIP: No any conflict LAYOUT lock on resource lock list as P1
      has already move it (if any) into its cancel list
    ->mdt_hsm_request()
      ->cdt_restore_handle_add()
        ->cdt_restore_handle_find()
        ->list_add_tail(): add @crh to restore handle list
        NOT yet obtain EX LAYOUT lock to cancel cached LAYOUT
        locks on client side...

P3:
->ll_file_read_iter()
->ll_do_fast_read(): => return -ENODATA;
->vvp_io_init()
->lov_io_init_released(): io->ci_restore_needed = 1;
->vvp_io_fini()
  ->ll_layout_restore()
    ->mdc_ioc_hsm_request()
      ->mdc_hsm_request_lock_to_cancel()
      SKIP as P1 has already move the conflict LAYOUT lock
      (if any) into its cancel list
    ->mdt_hsm_request()
      ->cdt_restore_handle_add()
        ->cdt_restore_handle_find()
        SKIP as found a restore handle with same FID in the
        the restore handle list added by P2.
  ->ll_layout_refresh()
  ->io->ci_need_restart = vio->vui_layout_gen != gen;
  ->LAYOUT gen does not have any change as the LAYOUT lock on
    the client is not revoken yet, will not restart I/O...
->return -ENODATA; =>from fast read

We can fix this bug by serializing the HSM restore operation on a
client by using the @lli->lli_layout_mutex simply.

Add sanity-hsm/test_12{t, u} to verfiy it.

Signed-off-by: Qian Yingjin <qian@ddn.com>
Change-Id: Idc2a8c1818386c64798d7e28500c20c80ff369f1
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/54366
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lustre/llite/file.c
lustre/tests/sanity-hsm.sh