Otherwise, the compiler might decide to move modifications to data
within this structure outside the enclosing SpinLockAcquire /
SpinLockRelease pair, leading to shared memory corruption.
This may or may not explain a recent lmgr-related buildfarm failure
on prairiedog, but it needs to be fixed either way.
Previously, such buffers weren't counted, with the possible result that
EXPLAIN (BUFFERS) and pg_stat_statements would understate the true
number of blocks dirtied by an SQL statement.
Back-patch to 9.2, where this counter was introduced.
Amit Kapila
It is possible for a view or materialized view to depend on a table's
primary key, if the view query relies on functional dependency to
abbreviate a GROUP BY list. This is problematic for pg_dump since we
ordinarily want to dump view definitions in the pre-data section but
indexes in post-data. pg_dump knows how to deal with this situation for
regular views, by breaking the view's ON SELECT rule apart from the view
proper. But it had not been taught what to do about materialized views,
and in fact mistakenly dumped them as regular views in such cases, as
seen in bug #9616 from Jesse Denardo.
If we had CREATE OR REPLACE MATERIALIZED VIEW, we could fix this in a
manner analogous to what's done for regular views; but we don't yet,
and we'd not back-patch such a thing into 9.3 anyway. As a hopefully-
temporary workaround, break the circularity by postponing the matview
into post-data altogether when this case occurs.
Any OS user able to access the socket can connect as the bootstrap
superuser and in turn execute arbitrary code as the OS user running the
test. Protect against that by placing the socket in the temporary data
directory, which has mode 0700 thanks to initdb. Back-patch to 8.4 (all
supported versions). The hazard remains wherever the temporary cluster
accepts TCP connections, notably on Windows.
Attempts to run "make check" from a directory with a long name will now
fail. An alternative not sharing that problem was to place the socket
in a subdirectory of /tmp, but that is only secure if /tmp is sticky.
The PG_REGRESS_SOCK_DIR environment variable is available as a
workaround when testing from long directory paths.
As a convenient side effect, this lets testing proceed smoothly in
builds that override DEFAULT_PGSOCKET_DIR. Popular non-default values
like /var/run/postgresql are often unwritable to the build user.
Security: CVE-2014-0067
This has been true for some time, but we were leaving users to discover it
the hard way.
Back-patch to 9.2. It might've been true before that, but we were claiming
Python 2.2 compatibility before that, so I won't guess at the exact
requirements back then.
We must increment the refcount on "plntup" as soon as we have the
reference, not sometime later. Otherwise, if an error is thrown in
between, the Py_XDECREF(plntup) call in the PG_CATCH block removes a
refcount we didn't add, allowing the object to be freed even though
it's still part of the plpython function's parsetree.
This appears to be the cause of crashes seen on buildfarm member
prairiedog. It's a bit surprising that we've not seen it fail repeatably
before, considering that the regression tests have been exercising the
faulty code path since 2009.
The real-world impact is probably minimal, since it's unlikely anyone would
be provoking the "TD["new"] is not a dictionary" error in production, and
that's the only case that is actually wrong. Still, it's a bug affecting
the regression tests, so patch all supported branches.
In passing, remove dead variable "plstr", and demote "platt" to a local
variable inside the PG_TRY block, since we don't need to clean it up
in the PG_CATCH path.
equalTupleDescs() neglected both of these ConstrCheck fields, and
CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the
only known behavior defect resulting from these omissions is constraint
exclusion disregarding a CHECK constraint validated by an ALTER TABLE
VALIDATE CONSTRAINT statement issued earlier in the same transaction.
Back-patch to 9.2, where these fields were introduced.
Clear errno before calling readdir() and handle old MinGW errno bug
while adding full test coverage for readdir/closedir failures.
Backpatch through 8.4.
For a regex containing backrefs, pg_regexec() might fail to free all the
sub-DFAs that were created during execution, resulting in a permanent
(session lifespan) memory leak. Problem was introduced by me in commit
587359479a. Per report from Sandro Santilli;
diagnosis by Greg Stark.
The recently-fixed bug in WAL replay could result in not finding a parent
tuple for a heap-only tuple. The existing code would either Assert or
generate an invalid index entry, neither of which is desirable. Throw a
regular error instead.
On clean shutdown, walsender waits for all WAL to be replicated to a standby,
and exits. It determined whether that replication had been completed by
checking whether its sent location had been equal to a standby's flush
location. Unfortunately this condition never becomes true when the standby
such as pg_receivexlog which always returns an invalid flush location is
connecting to walsender, and then walsender waits forever.
This commit changes walsender so that it just checks a standby's write
location if a flush location is invalid.
Back-patch to 9.1 where enough infrastructure for this exists.
I discovered the hard way that on some old shells, the locution
FOO="" unset FOO
does not behave the same as
FOO=""; unset FOO
and in fact leaves FOO set to an empty string. test.sh was inconsistently
spelling it different ways on adjacent lines.
This got broken relatively recently, in commit c737a2e56, so the lack of
field reports to date doesn't represent a lot of evidence that the problem
is rare.
"8" was correct back when "disable" was the longest allowed value, but
since "verify-full" was added, it should be "12". Given the lack of
complaints, I wouldn't be surprised if nobody is actually using these
values ... but still, if they're in the API, they should be right.
Noticed while pursuing a different problem. It's been wrong for quite
a long time, so back-patch to all supported branches.
This should eliminate the risk of recursive entry to syslog(3), which
appears to be the cause of the hang reported in bug #9551 from James
Morton.
Arguably, the real problem here is auth.c's willingness to turn on
ImmediateInterruptOK while executing fairly wide swaths of backend code.
We may well need to work at narrowing the code ranges in which the
authentication_timeout interrupt is enabled. For the moment, though,
this is a cheap and reasonably noninvasive fix for a field-reported
failure; the other approach would be complex and not necessarily
bug-free itself.
Back-patch to all supported branches.
Use TransactionIdIsInProgress, then TransactionIdDidCommit, to distinguish
whether a NOTIFY message's originating transaction is in progress,
committed, or aborted. The previous coding could accept a message from a
transaction that was still in-progress according to the PGPROC array;
if the client were fast enough at starting a new transaction, it might fail
to see table rows added/updated by the message-sending transaction. Which
of course would usually be the point of receiving the message. We noted
this type of race condition long ago in tqual.c, but async.c overlooked it.
The race condition probably cannot occur unless there are multiple NOTIFY
senders in action, since an individual backend doesn't send NOTIFY signals
until well after it's done committing. But if two senders commit in close
succession, it's certainly possible that we could see the second sender's
message within the race condition window while responding to the signal
from the first one.
Per bug #9557 from Marko Tiikkaja. This patch is slightly more invasive
than what he proposed, since it removes the now-redundant
TransactionIdDidAbort call.
Back-patch to 9.0, where the current NOTIFY implementation was introduced.
We don't take a full-page image of the GIN metapage; instead, the WAL record
contains all the information required to reconstruct it from scratch. But
to avoid torn page hazards, we must re-initialize it from the WAL record
every time, even if it already has a greater LSN, similar to how normal full
page images are restored.
This was highly unlikely to cause any problems in practice, because the GIN
metapage is small. We rely on an update smaller than a 512 byte disk sector
to be atomic elsewhere, at least in pg_control. But better safe than sorry,
and this would be easy to overlook if more fields are added to the metapage
so that it's no longer small.
Reported by Noah Misch. Backpatch to all supported versions.
Commit 08146775ac changed do_copy() to
temporarily scribble on pset.cur_cmd_source. That was a mighty ugly bit of
code in any case, but in particular it broke handleCopyIn's ability to tell
whether it was reading from the current script source file (in which case
pset.lineno should be incremented for each line of COPY data), or from
someplace else (in which case it shouldn't). The former case still worked,
the latter not so much. The visible effect was that line numbers reported
for errors in a script file would be wrong if there were an earlier \copy
that was reading anything other than inline-in-the-script-file data.
To fix, introduce another pset field that holds the file do_copy wants the
COPY code to use. This is a little bit ugly, but less so than passing the
file down explicitly through several layers that aren't COPY-specific.
Extracted from a larger patch by Kumar Rajeev Rastogi; that patch also
changes printing of COPY command tags, which is not a bug fix and shouldn't
get back-patched. This particular idea was from a suggestion by Amit
Khandekar, if I'm reading the thread correctly.
Back-patch to 9.2 where the faulty code was introduced.
The previous coding supposed that it could consider just a single join
condition in any one parameterized path for the foreign table. But in
reality, the parameterized-path machinery forces all join clauses that are
"movable to" the foreign table to be evaluated at that node; including
clauses that we might not consider safe to send across. Such cases would
result in an Assert failure in an assert-enabled build, and otherwise in
sending an unsafe clause to the foreign server, which might result in
errors or silently-wrong answers. A lesser problem was that the
cost/rowcount estimates generated for the parameterized path failed to
account for any additional join quals that get assigned to the scan.
To fix, rewrite postgresGetForeignPaths so that it correctly collects all
the movable quals for any one outer relation when generating parameterized
paths; we'll now generate just one path per outer relation not one per join
qual. Also fix bogus assumptions in postgresGetForeignPlan and
estimate_path_cost_size that only safe-to-send join quals will be
presented.
Based on complaint from Etsuro Fujita that the path costs were being
miscalculated, though this is significantly different from his proposed
patch.
A fake relcache entry can "own" a SmgrRelation object, like a regular
relcache entry. But when it was free'd, the owner field in SmgrRelation
was not cleared, so it was left pointing to free'd memory.
Amazingly this apparently hasn't caused crashes in practice, or we would've
heard about it earlier. Andres found this with Valgrind.
Report and fix by Andres Freund, with minor modifications by me. Backpatch
to all supported versions.
In make_ruledef and get_query_def, we have long used AcquireRewriteLocks
to ensure that the querytree we are about to deparse is up-to-date and
the schemas of the underlying relations aren't changing. Howwever, that
function thinks the query is about to be executed, so it acquires locks
that are stronger than necessary for the purpose of deparsing. Thus for
example, if pg_dump asks to deparse a rule that includes "INSERT INTO t",
we'd acquire RowExclusiveLock on t. That results in interference with
concurrent transactions that might for example ask for ShareLock on t.
Since pg_dump is documented as being purely read-only, this is unexpected.
(Worse, it used to actually be read-only; this behavior dates back only
to 8.1, cf commit ba4200246.)
Fix this by adding a parameter to AcquireRewriteLocks to tell it whether
we want the "real" execution locks or only AccessShareLock.
Report, diagnosis, and patch by Dean Rasheed. Back-patch to all supported
branches.
If walsender doesn't hear from the client for the time specified by
wal_sender_timeout, it will conclude the connection or client is dead, and
disconnect. When half of wal_sender_timeout has elapsed, it sends a ping
to the client, leaving it the remainig half of wal_sender_timeout to
respond. However, it only checked if half of wal_sender_timeout had elapsed
when it was about to sleep, so if it was busy sending WAL to the client for
long enough, it would not send the ping request in time. Then the client
would not know it needs to send a reply, and the walsender will disconnect
even though the client is still alive. Fix that.
Andres Freund, reviewed by Robert Haas, and some further changes by me.
Backpatch to 9.3. Earlier versions relied on the client to send the
keepalives on its own, and hence didn't have this problem.
We should allow this so that matviews can be referenced in UPDATE/DELETE
statements in READ COMMITTED isolation level. The requirement for that
is that a re-fetch by TID will see the same row version the query saw
earlier, which is true of matviews, so there's no reason for the
restriction. Per bug #9398.
Michael Paquier, after a suggestion by me
We were unlinking the permanent file, not the non-permanent one. But
since the stat collector already unlinks all permanent files on startup,
there was nothing for it to unlink. The non-permanent file remained in
place, and was copied to the permanent directory on shutdown, so in
effect no file was ever dropped.
Backpatch to 9.3, where the issue was introduced by commit 187492b6c2.
Before that, there were no per-database files and thus no file to drop
on DROP DATABASE.
Per report from Thom Brown.
Author: Tomáš Vondra
CheckRequiredParameterValues() should perform the checks if archive recovery
was requested, even if we are going to perform crash recovery first.
Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive
recovery mode.
When entering crash recovery followed by archive recovery, and the latest
checkpoint is a shutdown checkpoint, and there are no more WAL records to
replay before transitioning from crash to archive recovery, we would not
immediately allow read-only connections in hot standby mode even if we
could. That's because when starting from a shutdown checkpoint, we set
lastReplayedEndRecPtr incorrectly to the record before the checkpoint
record, instead of the checkpoint record itself. We don't run the redo
routine of the shutdown checkpoint record, but starting recovery from it
goes through the same motions, so it should be considered as replayed.
Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected,
so backpatch to 9.0.
I changed the loop in 9.3 to use "goto send_failure" instead of "break" on
errors, but I missed this one case. It was a relatively harmless bug: if
the flush fails once it will most likely fail again as soon as we try to
flush the output again. But it's a bug nevertheless.
Report and fix by Andres Freund.
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time. Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.
If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library. But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.
A bigger issue is that there may be more places than these two where
we need to check for cancels. We can always add more checks later,
now that the infrastructure is in place.
Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
Commit abf5c5c9a4 added a bogus while-
statement after the for(;;)-loop. It went unnoticed in testing, because
it was dead code.
Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced
this was also applied to 9.2, but not the bogus while-loop part, because
the code in 9.2 looks quite different.
We were resetting the tuple's HEAP_HOT_UPDATED flag as well as t_ctid on
WAL replay of a tuple-lock operation, which is incorrect when the tuple
is already updated.
Back-patch to 9.3. The clearing of both header elements was there
previously, but since no update could be present on a tuple that was
being locked, it was harmless.
Bug reported by Peter Geoghegan and Greg Stark in
CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com and
CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com
respectively; diagnosis by Andres Freund.
If there are lots of uncommitted tuples at the end of the index range,
get_actual_variable_range() ends up fetching each one and doing an MVCC
visibility check on it, until it finally hits a visible tuple. This is
bad enough in isolation, considering that we don't need an exact answer
only an approximate one. But because the tuples are not yet committed,
each visibility check does a TransactionIdIsInProgress() test, which
involves scanning the ProcArray. When multiple sessions do this
concurrently, the ensuing contention results in horrid performance loss.
20X overall throughput loss on not-too-complicated queries is easy to
demonstrate in the back branches (though someone's made it noticeably
less bad in HEAD).
We can dodge the problem fairly effectively by using SnapshotDirty rather
than a normal MVCC snapshot. This will cause the index probe to take
uncommitted tuples as good, so that we incur only one tuple fetch and test
even if there are many such tuples. The extent to which this degrades the
estimate is debatable: it's possible the result is actually a more accurate
prediction than before, if the endmost tuple has become committed by the
time we actually execute the query being planned. In any case, it's not
very likely that it makes the estimate a lot worse.
SnapshotDirty will still reject tuples that are known committed dead, so
we won't give bogus answers if an invalid outlier has been deleted but not
yet vacuumed from the index. (Because btrees know how to mark such tuples
dead in the index, we shouldn't have a big performance problem in the case
that there are many of them at the end of the range.) This consideration
motivates not using SnapshotAny, which was also considered as a fix.
Note: the back branches were using SnapshotNow instead of an MVCC snapshot,
but the problem and solution are the same.
Per performance complaints from Bartlomiej Romanski, Josh Berkus, and
others. Back-patch to 9.0, where the issue was introduced (by commit
40608e7f94).
Most estimation functions apply estimate_expression_value to see if they
can reduce an expression to a constant; the key difference is that it
allows evaluation of stable as well as immutable functions in hopes of
ending up with a simple Const node. scalararraysel didn't get the memo
though, and neither did gincost_opexpr/gincost_scalararrayopexpr. Fix
that, and remove a now-unnecessary estimate_expression_value step in the
subsidiary function scalararraysel_containment.
Per complaint from Alexey Klyukin. Back-patch to 9.3. The problem
goes back further, but I'm hesitant to change estimation behavior in
long-stable release branches.