Whamcloud - gitweb
EX-4141 lipe: lamigo should detect dead OST and restart ALR
authorAlexandre Ioffe <aioffe@ddn.com>
Tue, 29 Mar 2022 07:48:35 +0000 (00:48 -0700)
committerAndreas Dilger <adilger@whamcloud.com>
Mon, 31 Oct 2022 04:09:27 +0000 (04:09 +0000)
commit51ec673af118019c69ea63196301ae5bf3126149
tree3382399efbb322c08fd7137bca46de2e666cffa9
parent4e43fe8d0c72dac812db00b3eedde4a93aa0f522
EX-4141 lipe: lamigo should detect dead OST and restart ALR

Use '# keepalive' message and ssh read with timeout
to detect OST is down and restart ALR.
Add stats for ALR last seen message

To make lamigo compatible with older
ofd_access_log_reader lamigo can work in two modes:
1. lamigo does not expect '# keepalive' message.
In this case after timeout it will restart
ofd_access_log_reader silently
2. lamigo expects periodical # keepalive
message. If lamigo does not receive keepalive message
or any other message from ofd_access_log_reader
within timeout it reports error message and
restarts ofd_access_log_reader.
lamigo switches from 1 to 2 once it receives
'# keepalive' message

Signed-off-by: Alexandre Ioffe <aioffe@ddn.com>
Test-Parameters: trivial testlist=hot-pools
Change-Id: I55bc92b03ef5b45b72ff59ffd4b450cd1927cdb0
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/48647
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
lipe/.gitignore
lipe/src/lamigo.c
lipe/src/lamigo.h
lipe/src/lamigo_alr.c