Reconsider when to wait for WAL flushes/syncrep during commit.

Up to now RecordTransactionCommit() waited for WAL to be flushed (if
synchronous_commit != off) and to be synchronously replicated (if
enabled), even if a transaction did not have a xid assigned. The primary
reason for that is that sequence's nextval() did not assign a xid, but
are worthwhile to wait for on commit.

This can be problematic because sometimes read only transactions do
write WAL, e.g. HOT page prune records. That then could lead to read only
transactions having to wait during commit. Not something people expect
in a read only transaction.

This lead to such strange symptoms as backends being seemingly stuck
during connection establishment when all synchronous replicas are
down. Especially annoying when said stuck connection is the standby
trying to reconnect to allow syncrep again...

This behavior also is involved in a rather complicated <= 9.4 bug where
the transaction started by catchup interrupt processing waited for
syncrep using latches, but didn't get the wakeup because it was already
running inside the same overloaded signal handler. Fix the issue here
doesn't properly solve that issue, merely papers over the problems. In
9.5 catchup interrupts aren't processed out of signal handlers anymore.

To fix all this, make nextval() acquire a top level xid, and only wait for
transaction commit if a transaction both acquired a xid and emitted WAL
records.  If only a xid has been assigned we don't uselessly want to
wait just because of writes to temporary/unlogged tables; if only WAL
has been written we don't want to wait just because of HOT prunes.

The xid assignment in nextval() is unlikely to cause overhead in
real-world workloads. For one it only happens SEQ_LOG_VALS/32 values
anyway, for another only usage of nextval() without using the result in
an insert or similar is affected.

Discussion: 20150223165359.GF30784@awork2.anarazel.de,
    369698E947874884A77849D8FE3680C2@maumau,
    5CF4ABBA67674088B3941894E22A0D25@maumau

Per complaint from maumau and Thom Brown

Backpatch all the way back; 9.0 doesn't have syncrep, but it seems
better to be consistent behavior across all maintained branches.
This commit is contained in:
Andres Freund 2015-02-26 12:50:08 +01:00
parent 40a9a16760
commit 82e0d6eb51
2 changed files with 51 additions and 14 deletions

View File

@ -918,6 +918,7 @@ RecordTransactionCommit(void)
int nmsgs = 0;
SharedInvalidationMessage *invalMessages = NULL;
bool RelcacheInitFileInval = false;
bool wrote_xlog = XactLastRecEnd.xrecoff != 0;
/* Get data needed for commit record */
nrels = smgrGetPendingDeletes(true, &rels, &haveNonTemp);
@ -946,12 +947,11 @@ RecordTransactionCommit(void)
/*
* If we didn't create XLOG entries, we're done here; otherwise we
* should flush those entries the same as a commit record. (An
* example of a possible record that wouldn't cause an XID to be
* assigned is a sequence advance record due to nextval() --- we want
* to flush that to disk before reporting commit.)
* should trigger flushing those entries the same as a commit record
* would. This will primarily happen for HOT pruning and the like; we
* want these to be flushed to disk in due time.
*/
if (XactLastRecEnd.xrecoff == 0)
if (!wrote_xlog)
goto cleanup;
}
else
@ -1039,16 +1039,31 @@ RecordTransactionCommit(void)
}
/*
* Check if we want to commit asynchronously. If the user has set
* synchronous_commit = off, and we're not doing cleanup of any non-temp
* rels nor committing any command that wanted to force sync commit, then
* we can defer flushing XLOG. (We must not allow asynchronous commit if
* there are any non-temp tables to be deleted, because we might delete
* the files before the COMMIT record is flushed to disk. We do allow
* asynchronous commit if all to-be-deleted tables are temporary though,
* since they are lost anyway if we crash.)
* Check if we want to commit asynchronously. We can allow the XLOG flush
* to happen asynchronously if synchronous_commit=off, or if the current
* transaction has not performed any WAL-logged operation or didn't assign
* a xid. The transaction can end up not writing any WAL, even if it has
* a xid, if it only wrote to temporary tables. It can end up having
* written WAL without an xid if it did HOT pruning. In case of a crash,
* the loss of such a transaction will be irrelevant; temp tables will be
* lost anyway and HOT pruning will be done again later. (Given the
* foregoing, you might think that it would be unnecessary to emit the
* XLOG record at all in this case, but we don't currently try to do that.
* It would certainly cause problems at least in Hot Standby mode, where
* the KnownAssignedXids machinery requires tracking every XID assignment.
* It might be OK to skip it only when wal_level < hot_standby, but for
* now we don't.)
*
* However, if we're doing cleanup of any non-temp rels or committing any
* command that wanted to force sync commit, then we must flush XLOG
* immediately. (We must not allow asynchronous commit if there are any
* non-temp tables to be deleted, because we might delete the files before
* the COMMIT record is flushed to disk. We do allow asynchronous commit
* if all to-be-deleted tables are temporary though, since they are lost
* anyway if we crash.)
*/
if (XactSyncCommit || forceSyncCommit || haveNonTemp)
if ((wrote_xlog && markXidCommitted && XactSyncCommit)
|| forceSyncCommit || haveNonTemp)
{
/*
* Synchronous commit case:

View File

@ -247,6 +247,10 @@ DefineSequence(CreateSeqStmt *seq)
*/
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
/* check the comment above nextval_internal()'s equivalent call. */
if (!rel->rd_istemp)
GetTopTransactionId();
START_CRIT_SECTION();
{
@ -363,6 +367,10 @@ AlterSequenceInternal(Oid relid, List *options)
/* Note that we do not change the currval() state */
elm->cached = elm->last;
/* check the comment above nextval_internal()'s equivalent call. */
if (!seqrel->rd_istemp)
GetTopTransactionId();
/* Now okay to update the on-disk tuple */
START_CRIT_SECTION();
@ -596,6 +604,16 @@ nextval_internal(Oid relid)
last_used_seq = elm;
/*
* If something needs to be WAL logged, acquire an xid, so this
* transaction's commit will trigger a WAL flush and wait for
* syncrep. It's sufficient to ensure the toplevel transaction has a xid,
* no need to assign xids subxacts, that'll already trigger a appropriate
* wait. (Have to do that here, so we're outside the critical section)
*/
if (logit && !seqrel->rd_istemp)
GetTopTransactionId();
/* ready to change the on-disk (or really, in-buffer) tuple */
START_CRIT_SECTION();
@ -790,6 +808,10 @@ do_setval(Oid relid, int64 next, bool iscalled)
/* In any case, forget any future cached numbers */
elm->cached = elm->last;
/* check the comment above nextval_internal()'s equivalent call. */
if (!seqrel->rd_istemp)
GetTopTransactionId();
/* ready to change the on-disk (or really, in-buffer) tuple */
START_CRIT_SECTION();