When LockBufferForCleanup() has to wait for getting a cleanup lock on a
buffer it does so by setting a flag in the buffer header and then wait
for other backends to signal it using ProcWaitForSignal().
Unfortunately LockBufferForCleanup() missed that ProcWaitForSignal() can
return for other reasons than the signal it is hoping for. If such a
spurious signal arrives the wait flags on the buffer header will still
be set. That then triggers "ERROR: multiple backends attempting to wait
for pincount 1".
The fix is simple, unset the flag if still set when retrying. That
implies an additional spinlock acquisition/release, but that's unlikely
to matter given the cost of waiting for a cleanup lock. Alternatively
it'd have been possible to move responsibility for maintaining the
relevant flag to the waiter all together, but that might have had
negative consequences due to possible floods of signals. Besides being
more invasive.
This looks to be a very longstanding bug. The relevant code in
LockBufferForCleanup() hasn't changed materially since its introduction
and ProcWaitForSignal() was documented to return for unrelated reasons
since 8.2. The master only patch series removing ImmediateInterruptOK
made it much easier to hit though, as ProcSendSignal/ProcWaitForSignal
now uses a latch shared with other tasks.
Per discussion with Kevin Grittner, Tom Lane and me.
Backpatch to all supported branches.
Discussion: 11553.1423805224@sss.pgh.pa.us
Previously when the standby server failed to retrieve WAL files from any sources
(i.e., streaming replication, local pg_xlog directory or WAL archive), it always
waited for five seconds (hard-coded) before the next attempt. For example,
this is problematic in warm-standby because restore_command can fail
every five seconds even while new WAL file is expected to be unavailable for
a long time and flood the log files with its error messages.
This commit adds new parameter, wal_retrieve_retry_interval, to control that
wait time.
Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.
If libpq output buffer is full, pqSendSome() function tries to drain any
incoming data. This avoids deadlock, if the server e.g. sends a lot of
NOTICE messages, and blocks until we read them. However, pqSendSome() only
did that in blocking mode. In non-blocking mode, the deadlock could still
happen.
To fix, take a two-pronged approach:
1. Change the documentation to instruct that when PQflush() returns 1, you
should wait for both read- and write-ready, and call PQconsumeInput() if it
becomes read-ready. That fixes the deadlock, but applications are not going
to change overnight.
2. In pqSendSome(), drain the input buffer before returning 1. This
alleviates the problem for applications that only wait for write-ready. In
particular, a slow but steady stream of NOTICE messages during COPY FROM
STDIN will no longer cause a deadlock. The risk remains that the server
attempts to send a large burst of data and fills its output buffer, and at
the same time the client also sends enough data to fill its output buffer.
The application will deadlock if it goes to sleep, waiting for the socket
to become write-ready, before the server's data arrives. In practice,
NOTICE messages and such that the server might be sending are usually
short, so it's highly unlikely that the server would fill its output buffer
so quickly.
Backpatch to all supported versions.
We did not need a location tag on NullTest or BooleanTest before, because
no error messages referred directly to their locations. That's planned
to change though, so add these fields in a separate housekeeping commit.
Catversion bump because stored rules may change.
transformExpr() has for many years had provisions to do nothing when
applied to an already-transformed expression tree. However, this was
always ugly and of dubious reliability, so we'd be much better off without
it. The primary historical reason for it was that gram.y sometimes
returned multiple links to the same subexpression, which is no longer true
as of my BETWEEN fixes. We'd also grown some lazy hacks in CREATE TABLE
LIKE (failing to distinguish between raw and already-transformed index
specifications) and one or two other places.
This patch removes the need for and support for re-transforming already
transformed expressions. The index case is dealt with by adding a flag
to struct IndexStmt to indicate that it's already been transformed;
which has some benefit anyway in that tablecmds.c can now Assert that
transformation has happened rather than just assuming. The other main
reason was some rather sloppy code for array type coercion, which can
be fixed (and its performance improved too) by refactoring.
I did leave transformJoinUsingClause() still constructing expressions
containing untransformed operator nodes being applied to Vars, so that
transformExpr() still has to allow Var inputs. But that's a much narrower,
and safer, special case than before, since Vars will never appear in a raw
parse tree, and they don't have any substructure to worry about.
In passing fix some oversights in the patch that added CREATE INDEX
IF NOT EXISTS (missing processing of IndexStmt.if_not_exists). These
appear relatively harmless, but still sloppy coding practice.
Previously, gram.y itself converted BETWEEN into AND (or AND/OR) nests of
expression comparisons. This was always as bogus as could be, but fixing
it hasn't risen to the top of the to-do list. The present patch invents an
A_Expr representation for BETWEEN expressions, and does the expansion to
comparison trees in parse_expr.c which is at least a slightly saner place
to be doing semantic conversions. There should be no change in the post-
parse-analysis results.
This does nothing for the semantic issues with BETWEEN (dubious connection
to btree-opclass semantics, and multiple evaluation of possibly volatile
subexpressions) ... but it's a necessary preliminary step before we could
fix any of that. The main immediate benefit is that preserving BETWEEN as
an identifiable raw-parse-tree construct will enable better error messages.
While at it, fix the code so that multiply-referenced subexpressions are
physically duplicated before being passed through transformExpr(). This
gets rid of one of the principal reasons why transformExpr() has
historically had to allow already-processed input.
Everywhere else in the file, "context" is of type MemoryContext and
"set" is of type AllocSet. AllocSetContextCreate uses a variable of
type AllocSet, so rename it from "context" to "set".
Previously, each new array created a new memory context that started
out at 8kB. This is incredibly wasteful when there are lots of small
groups of just a few elements each.
Change initArrayResult() and friends to accept a "subcontext" argument
to indicate whether the caller wants the ArrayBuildState allocated in
a new subcontext or not. If not, it can no longer be released
separately from the rest of the memory context.
Fixes bug report by Frank van Vugt on 2013-10-19.
Tomas Vondra. Reviewed by Ali Akbar, Tom Lane, and me.
In a manual pass over the catalog declaration I found a number of
columns which the boostrap automatism didn't mark NOT NULL even though
they actually were. Add BKI_FORCE_NOT_NULL markings to them.
It's usually not critical if a system table column is falsely determined
to be nullable as the code should always catch relevant cases. But it's
good to have a extra layer in place.
Discussion: 20150215170014.GE15326@awork2.anarazel.de
Bootstrap determines whether a column is null based on simple builtin
rules. Those work surprisingly well, but nonetheless a few existing
columns aren't set correctly. Additionally there is at least one patch
sent to hackers where forcing the nullness of a column would be helpful.
The boostrap format has gained FORCE [NOT] NULL for this, which will be
emitted by genbki.pl when BKI_FORCE_(NOT_)?NULL is specified for a
column in a catalog header.
This patch doesn't change the marking of any existing columns.
Discussion: 20150215170014.GE15326@awork2.anarazel.de
This requires changing quite a few places that were depending on
sizeof(HeapTupleHeaderData), but it seems for the best.
Michael Paquier, some adjustments by me
After finding an "=" character, the pointer was advanced twice when it
should only advance once. This is harmless as long as the value after "="
has at least one character; but if it doesn't, we'd miss the terminator
character and include too much in the value.
In principle this could lead to reading off the end of memory. It does not
seem worth treating as a security issue though, because it would happen on
client side, and besides client logic that's taking conninfo strings from
untrusted sources has much worse security problems than this.
Report and patch received off-list from Thomas Fanghaenel.
Back-patch to 9.2 where the faulty code was introduced.
Commit 8001fe67a3 introduced this
requirement, but per discussion, we want to avoid requirements of
this type to make things easier on the calling code. An especially
important consideration is that this may be used in frontend code,
not just the backend.
Asif Naeem, reviewed by Michael Paquier
clang complains about this, not unreasonably, so define another struct
that's explicitly for a WordEntryPos with exactly one element.
While at it, get rid of pretty dubious use of a static variable for
more than one purpose --- if it were being treated as const maybe
I'd be okay with this, but it isn't.
This works by keeping a per-subtransaction record of the ins/upd/del
counters before the truncate, and then resetting them; this record is
useful to return to the previous state in case the truncate is rolled
back, either in a subtransaction or whole transaction. The state is
propagated upwards as subtransactions commit.
When the per-table data is sent to the stats collector, a flag indicates
to reset the live/dead counters to zero as well.
Catalog version bumped due to the change in pgstat format.
Author: Alexander Shulgin
Discussion: 1007.1207238291@sss.pgh.pa.us
Discussion: 548F7D38.2000401@BlueTreble.com
Reviewed-by: Álvaro Herrera, Jim Nasby
This gives a stronger guarantee than a mere comment against accessing these
fields as simple struct members. Since rolpassword is in fact varlena,
it's not clear why these didn't get marked from the beginning, but let's
do it now.
Michael Paquier
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.
This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.
Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent. Instead use offsetof(struct, lastfield). Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).
Michael Paquier, review and additional fixes by me.
Per discussion, this could be useful for purposes such as programmatically
detecting a nonresponding stats collector. We already have the timestamp
anyway, it's just a matter of providing a SQL-accessible function to fetch
it.
Matt Kelly, reviewed by Jim Nasby
While working on documentation for expanded arrays, I noticed a number of
details in the TOAST-related documentation that were already inaccurate or
obsolete. This should be fixed independently of whether expanded arrays
get in or not. One issue is that the already existing indirect-pointer
facility was not documented at all. Also, the documentation says that you
only need to use VARSIZE/SET_VARSIZE if you've made your variable-length
type TOAST-aware, but actually we've forced that business on all varlena
types even if they've opted out of TOAST by setting storage = plain.
Wordsmith a few other things too, like an amusingly archaic claim that
there are few 64-bit machines.
I thought about back-patching this, but since all this doco is oriented
to hackers and C-coded extension authors, fixing it in HEAD is probably
good enough.
There wasn't any good reason for a single C function to implement both
these SQL functions: it saved very little code overall, and it required
significant pushups to re-determine at runtime which case applied. Redoing
it as two functions ends up with just slightly more lines of code, but it's
simpler to understand, and faster too because we need not repeat syscache
lookups on every call.
An important side benefit is that this eliminates the only case in which
different aliases of the same C function had both anyarray and anyelement
arguments at the same position, which would almost always be a mistake.
The opr_sanity regression test will now notice such mistakes since there's
no longer a valid case where it happens.
Code like
open(P, "cl /? 2>&1 |") || die "cl command not found";
does not actually catch any errors, because the exit status of the
command before the pipe is ignored. The fix is to look at $?.
This also gave the opportunity to clean up the logic of this code a bit.
The original representation uses "opcname for amname", which is good
enough; but if we replace "for" with "using", we can apply the returned
identity directly in a DROP command, as in
DROP OPERATOR CLASS opcname USING amname
This slightly simplifies code using object identities to programatically
execute commands on these kinds of objects.
Note backwards-incompatible change:
The previous representation dates back to 9.3 when object identities
were introduced by commit f8348ea3, but we don't want to change the
behavior on released branches unnecessarily and so this is not
backpatched.
Somebody apparently threw darts at the code to decide where to insert
these. They certainly didn't proceed by adding them where other similar
SETs were handled. This at least broke pg_restore, and perhaps other
use-cases too.
cfopen() and cfopen_write() failed to pass the compression level through
to zlib, so that you always got the default compression level if you got
any at all.
In passing, also fix these and related functions so that the correct errno
is reliably returned on failure; the original coding supposes that free()
cannot change errno, which is untrue on at least some platforms.
Per bug #12779 from Christoph Berg. Back-patch to 9.1 where the faulty
code was introduced.
Michael Paquier
The previous coding in EXPLAIN always labeled a ModifyTable node with the
name of the target table affected by its first child plan. When originally
written, this was necessarily the parent table of the inheritance tree,
so everything was unconfusing. But when we added NO INHERIT constraints,
it became possible for the parent table to be deleted from the plan by
constraint exclusion while still leaving child tables present. This led to
the ModifyTable plan node being labeled with the first surviving child,
which was deemed confusing. Fix it by retaining the parent table's RT
index in a new field in ModifyTable.
Etsuro Fujita, reviewed by Ashutosh Bapat and myself
After removal, the next_sibling pointer of a node was sometimes incorrectly
left to point to another node in the heap, which meant that a node was
sometimes linked twice into the heap. Surprisingly that didn't cause any
crashes in my testing, but it was clearly wrong and could easily segfault
in other scenarios.
Also always keep the prev_or_parent pointer as NULL on the root node. That
was not a correctness issue AFAICS, but let's be tidy.
Add a debugging function, to dump the contents of a pairing heap as a
string. It's #ifdef'd out, as it's not used for anything in any normal
code, but it was highly useful in debugging this. Let's keep it handy for
further reference.
The part of the comparison function that was supposed to keep heap tuples
ahead of index items was backwards. It would not lead to incorrect results,
but it is more efficient to return heap tuples first, before scanning more
index pages, when both have the same distance.
Alexander Korotkov
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f5 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented). But
actually the logic dates back to commit 3c9bb8886d and was simply
broken by erroneous refactoring in the later commit. A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses. The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.
Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct. If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases. So let's let the post-9.0 behavior
stand, and change the documentation to match it.
In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit. A lot of this text
hasn't been touched since we were IPv4-only.
Avoid losing errno if readdir() fails and closedir() works. Consistently
return 4 rather than 3 if both a lost+found directory and other files are
found, rather than returning one value or the other depending on the
order of the directory listing. Update comments to match the actual
behavior.
These oversights date to commits 6f03927fce
and 17f1523932.
Marco Nenciarini
The previous coding copied garbage into a local variable, pretty much
ensuring that the intended test of an IPv6 connection address against a
promoted IPv4 address from pg_hba.conf would never match. The lack of
field complaints likely indicates that nobody realized this was supposed
to work, which is unsurprising considering that no user-facing docs suggest
it should work.
In principle this could have led to a SIGSEGV due to reading off the end of
memory, but since the source address would have pointed to somewhere in the
function's stack frame, that's quite unlikely. What led to discovery of
the bug is Hugo Osvaldo Barrera's report of a crash after an OS upgrade,
which is probably because he is now running a system in which memcpy raises
abort() upon detecting overlapping source and destination areas. (You'd
have to additionally suppose some things about the stack frame layout to
arrive at this conclusion, but it seems plausible.)
This has been broken since the code was added, in commit f3aec2c7f5,
so back-patch to all supported branches.
This reverts the removal of the call in commit (272923a0). It turns out it
wasn't superfluous after all: without it, renegotiation fails if a client
certificate was used. The rest of the changes in that commit are still OK
and not reverted.
Per investigation of bug #12769 by Arne Scheffer, although this doesn't fix
the reported bug yet.
exec_stmt_return() and exec_stmt_return_next() have fast-path code for
handling a simple variable reference (i.e. "return var") without going
through the full expression evaluation machinery. For some reason,
pl_gram.y was under the impression that this fast path only applied for
record/row variables; but in reality code for handling regular scalar
variables has been there all along. Adjusting the logic to allow that
code to be used actually results in a net savings of code in pl_gram.y
(by eliminating some redundancy), and it buys a measurable though not
very impressive amount of speedup.
Noted while fooling with my expanded-array patch, wherein this makes a much
bigger difference because it enables returning an expanded array variable
without an extra flattening step. But AFAICS this is a win regardless,
so commit it separately.
All the other certificates were created to be valid for 10000 days, because
we don't want to have to recreate them. But I missed the root CA cert, and
the pre-created certificates included in the repository expired in January.
Fix, and re-create all the certificates.