Whamcloud - gitweb
LU-15938 lod: prevent endless retry in recovery thread 98/47698/3
authorMikhail Pershin <mpershin@whamcloud.com>
Wed, 22 Jun 2022 10:27:48 +0000 (13:27 +0300)
committerOleg Drokin <green@whamcloud.com>
Mon, 18 Jul 2022 20:25:18 +0000 (20:25 +0000)
commit1a24dcdce121787428ea820561cfa16ae24bdf82
tree6df1b11844ba12b6448c5d8ddfea92c7c71b7d66
parent5038bf8db83d4cb409b7563f028f48cca0385c19
LU-15938 lod: prevent endless retry in recovery thread

- abort lod_sub_recovery_thread() by obd_abort_recov_mdt in
  addition to obd_abort_recovery
- handle 'short llog' situation gracefully, when remote llog
  is shorter than local copy header expects, trust remote llog
  data and consider llog processing as finished
- on other errors during remote llog read, set obd_abort_recov_mdt
  but not obd_abort_recovery in attempt to skip MDT-MDT recovery
  only and continue with client recovery while possible
- fix parsing problem with 'abort_recov' and 'abort_recov_mdt' in
  lmd_parse() causing no MDT recovery abort but client recovery
  abort always. Allow also 'abort_recovery_mdt' mount option name

The original case with endless retry is caused by such de-sync
between local llog structures and remote llog. The local llog
header says there is record with some ID, so recovery thread
is trying to get that record from remote llog. Meanwhile there
is no such record on remote server, so it reads whole llog and
return it back properly but llog processing consider that as
incomplete llog due to network issues and retry endlessly.

Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: Ib127fd0d1abd5289d90c7b4b3ca74ab6fc78bc71
Reviewed-on: https://review.whamcloud.com/47698
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lustre/lod/lod_dev.c
lustre/mdt/mdt_handler.c
lustre/obdclass/llog_osd.c
lustre/obdclass/obd_mount.c