From 574af56b4f22b9864a9dfeec67120dc36ace553d Mon Sep 17 00:00:00 2001 From: Andreas Dilger Date: Wed, 28 Aug 2019 14:32:35 -0600 Subject: [PATCH] LUDOC-11 debug: add went_back_in_time label Add a section label for the "went back in time" error message, so that it can more easily be referenced by an error message in the code (formerly a link to a bugzilla ticket). Improve the description of the problem and reformat the text to fit within 80 columns. Signed-off-by: Andreas Dilger Change-Id: I612cbd090cd695272009754364be53f65c3ebbe5 Reviewed-on: https://review.whamcloud.com/35954 Tested-by: jenkins Reviewed-by: Joseph Gmitter --- LustreTroubleshooting.xml | 54 +++++++++++++++++++++++++++++++---------------- 1 file changed, 36 insertions(+), 18 deletions(-) diff --git a/LustreTroubleshooting.xml b/LustreTroubleshooting.xml index fa3799e..5e23fce 100644 --- a/LustreTroubleshooting.xml +++ b/LustreTroubleshooting.xml @@ -655,31 +655,49 @@ ptlrpc_main+0x42e/0x7c0 [ptlrpc]
- Handling/Debugging "LustreError: xxx went back in time" - Each time the Lustre software changes the state of the disk file system, it records a - unique transaction number. Occasionally, when committing these transactions to the disk, the - last committed transaction number displays to other nodes in the cluster to assist the - recovery. Therefore, the promised transactions remain absolutely safe on the disappeared - disk. + Handling/Debugging "LustreError: xxx went back in time" + Each time the MDS or OSS modifies the state of the MDT or OST disk + filesystem for a client, it records a per-target increasing transaction + number for the operation and returns it to the client along with the + reply to that operation. Periodically, when the server commits these + transactions to disk, the last_committed transaction + number is returned to the client to allow it to discard pending operations + from memory, as they will no longer be needed for recovery in case of + server failure. + In some cases error messages similar to the following have + been observed after a server was restarted or failed over: + +LustreError: 3769:0:(import.c:517:ptlrpc_connect_interpret()) +testfs-ost12_UUID went back in time (transno 831 was previously committed, +server now claims 791)! + This situation arises when: - You are using a disk device that claims to have data written to disk before it - actually does, as in case of a device with a large cache. If that disk device crashes or - loses power in a way that causes the loss of the cache, there can be a loss of - transactions that you believe are committed. This is a very serious event, and you - should run e2fsck against that storage before restarting the Lustre file system. + You are using a disk device that claims to have data written + to disk before it actually does, as in case of a device with a large + cache. If that disk device crashes or loses power in a way that + causes the loss of the cache, there can be a loss of transactions + that you believe are committed. This is a very serious event, and + you should run e2fsck against that storage before restarting the + Lustre file system. - As required by the Lustre software, the shared storage used for failover is - completely cache-coherent. This ensures that if one server takes over for another, it - sees the most up-to-date and accurate copy of the data. In case of the failover of the - server, if the shared storage does not provide cache coherency between all of its ports, - then the Lustre software can produce an error. + As required by the Lustre software, the shared storage used + for failover is completely cache-coherent. This ensures that if one + server takes over for another, it sees the most up-to-date and + accurate copy of the data. In case of the failover of the server, + if the shared storage does not provide cache coherency between all + of its ports, then the Lustre software can produce an error. - If you know the exact reason for the error, then it is safe to proceed with no further action. If you do not know the reason, then this is a serious issue and you should explore it with your disk vendor. - If the error occurs during failover, examine your disk cache settings. If it occurs after a restart without failover, try to determine how the disk can report that a write succeeded, then lose the Data Device corruption or Disk Errors. + If you know the exact reason for the error, then it is safe to + proceed with no further action. If you do not know the reason, then this + is a serious issue and you should explore it with your disk vendor. + If the error occurs during failover, examine your disk cache + settings. If it occurs after a restart without failover, try to + determine how the disk can report that a write succeeded, then lose the + Data Device corruption or Disk Errors.
Lustre Error: "<literal>Slow Start_Page_Write</literal>" -- 1.8.3.1