Prevent references to invalid relation pages after fresh promotion

If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed.  This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed.  When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed.  Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.

Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me.  The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.

Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore.  The test gets easily supported down to 10, still it
has been tested on all branches.

Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
This commit is contained in:
Michael Paquier 2018-07-05 10:47:19 +09:00
parent 2adadf018b
commit 8d68ee6f31

View File

@ -794,8 +794,14 @@ static XLogSource XLogReceiptSource = 0; /* XLOG_FROM_* code */
static XLogRecPtr ReadRecPtr; /* start of last record read */
static XLogRecPtr EndRecPtr; /* end+1 of last record read */
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */
/*
* Local copies of equivalent fields in the control file. When running
* crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
* expect to replay all the WAL available, and updateMinRecoveryPoint is
* switched to false to prevent any updates while replaying records.
* Those values are kept consistent as long as crash recovery runs.
*/
static XLogRecPtr minRecoveryPoint;
static TimeLineID minRecoveryPointTLI;
static bool updateMinRecoveryPoint = true;
@ -2532,20 +2538,26 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
return;
/*
* An invalid minRecoveryPoint means that we need to recover all the WAL,
* i.e., we're doing crash recovery. We never modify the control file's
* value in that case, so we can short-circuit future checks here too. The
* local values of minRecoveryPoint and minRecoveryPointTLI should not be
* updated until crash recovery finishes.
*/
if (XLogRecPtrIsInvalid(minRecoveryPoint))
{
updateMinRecoveryPoint = false;
return;
}
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
/* update local copy */
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
/*
* An invalid minRecoveryPoint means that we need to recover all the WAL,
* i.e., we're doing crash recovery. We never modify the control file's
* value in that case, so we can short-circuit future checks here too.
*/
if (minRecoveryPoint == 0)
updateMinRecoveryPoint = false;
else if (force || minRecoveryPoint < lsn)
if (force || minRecoveryPoint < lsn)
{
XLogRecPtr newMinRecoveryPoint;
TimeLineID newMinRecoveryPointTLI;
@ -2930,7 +2942,16 @@ XLogNeedsFlush(XLogRecPtr record)
*/
if (RecoveryInProgress())
{
/* Quick exit if already known updated */
/*
* An invalid minRecoveryPoint means that we need to recover all the
* WAL, i.e., we're doing crash recovery. We never modify the control
* file's value in that case, so we can short-circuit future checks
* here too.
*/
if (XLogRecPtrIsInvalid(minRecoveryPoint))
updateMinRecoveryPoint = false;
/* Quick exit if already known to be updated or cannot be updated */
if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
return false;
@ -2944,20 +2965,8 @@ XLogNeedsFlush(XLogRecPtr record)
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
LWLockRelease(ControlFileLock);
/*
* An invalid minRecoveryPoint means that we need to recover all the
* WAL, i.e., we're doing crash recovery. We never modify the control
* file's value in that case, so we can short-circuit future checks
* here too.
*/
if (minRecoveryPoint == 0)
updateMinRecoveryPoint = false;
/* check again */
if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
return false;
else
return true;
return record > minRecoveryPoint;
}
/* Quick exit if already known flushed */
@ -4099,6 +4108,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
/*
* The startup process can update its local copy of
* minRecoveryPoint from this point.
*/
updateMinRecoveryPoint = true;
UpdateControlFile();
LWLockRelease(ControlFileLock);
@ -6578,9 +6593,26 @@ StartupXLOG(void)
/* No need to hold ControlFileLock yet, we aren't up far enough */
UpdateControlFile();
/* initialize our local copy of minRecoveryPoint */
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
/*
* Initialize our local copy of minRecoveryPoint. When doing crash
* recovery we want to replay up to the end of WAL. Particularly, in
* the case of a promoted standby minRecoveryPoint value in the
* control file is only updated after the first checkpoint. However,
* if the instance crashes before the first post-recovery checkpoint
* is completed then recovery will use a stale location causing the
* startup process to think that there are still invalid page
* references when checking for data consistency.
*/
if (InArchiveRecovery)
{
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
}
else
{
minRecoveryPoint = InvalidXLogRecPtr;
minRecoveryPointTLI = 0;
}
/*
* Reset pgstat data, because it may be invalid after recovery.
@ -7520,6 +7552,8 @@ CheckRecoveryConsistency(void)
if (XLogRecPtrIsInvalid(minRecoveryPoint))
return;
Assert(InArchiveRecovery);
/*
* assume that we are called in the startup process, and hence don't need
* a lock to read lastReplayedEndRecPtr
@ -9582,11 +9616,16 @@ xlog_redo(XLogReaderState *record)
* Update minRecoveryPoint to ensure that if recovery is aborted, we
* recover back up to this point before allowing hot standby again.
* This is important if the max_* settings are decreased, to ensure
* you don't run queries against the WAL preceding the change.
* you don't run queries against the WAL preceding the change. The
* local copies cannot be updated as long as crash recovery is
* happening and we expect all the WAL to be replayed.
*/
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
if (minRecoveryPoint != 0 && minRecoveryPoint < lsn)
if (InArchiveRecovery)
{
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
}
if (minRecoveryPoint != InvalidXLogRecPtr && minRecoveryPoint < lsn)
{
ControlFile->minRecoveryPoint = lsn;
ControlFile->minRecoveryPointTLI = ThisTimeLineID;