Previously, ginInsertCleanup could exit early if it detects that someone else
is cleaning up the pending list, without waiting for that someone else to
finish the job. But in this case vacuum could miss tuples to be deleted.
Cleanup process now locks metapage with a help of heavyweight
LockPage(ExclusiveLock), and it guarantees that there is no another cleanup
process at the same time. Lock is taken differently depending on caller of
cleanup process: any vacuums and gin_clean_pending_list() will be blocked
until lock becomes available, ordinary insert uses conditional lock to
prevent indefinite waiting on lock.
Insert into pending list doesn't use this lock, so insertion isn't blocked.
Also, patch adds stopping of cleanup process when at-start-cleanup-tail is
reached in order to prevent infinite cleanup in case of massive insertion. But
it will stop only for automatic maintenance tasks like autovacuum.
Patch introduces choice of limit of memory to use: autovacuum_work_mem,
maintenance_work_mem or work_mem depending on call path.
Patch for previous releases should be reworked due to changes between 9.6 and
previous ones in this area.
Discover and diagnostics by Jeff Janes and Tomas Vondra
Patch by me with some ideas of Jeff Janes
The previous coding here was formally undefined, though it seems to
accidentally work on most platforms in the buildfarm. Caught by some
OpenBSD platforms in which libc contains an assertion check for
overlapping areas passed to memcpy().
Thomas Munro
Commit 989be0810d added a flex/bison lexer/parser to interpret
synchronous_standby_names. It was done in a pretty crufty way, though,
making assorted end-use sites responsible for calling the parser at the
right times. That was not only vulnerable to errors of omission, but made
it possible that lexer/parser errors occur at very undesirable times,
and created memory leakages even if there was no error.
Instead, perform the parsing once during check_synchronous_standby_names
and let guc.c manage the resulting data. To do that, we have to flatten
the parsed representation into a single hunk of malloc'd memory, but that
is not very hard.
While at it, work a little harder on making useful error reports for
parsing problems; the previous code felt that "synchronous_standby_names
parser returned 1" was an appropriate user-facing error message. (To
be fair, it did also log a syntax error message, but separately from the
GUC problem report, which is at best confusing.) It had some outright
bugs in the face of invalid input, too.
I (tgl) also concluded that we need to restrict unquoted names in
synchronous_standby_names to be just SQL identifiers. The previous coding
would accept darn near anything, which (1) makes the quoting convention
both nearly-unnecessary and formally ambiguous, (2) makes it very hard to
understand what is a syntax error and what is a creative interpretation of
the input as a standby name, and (3) makes it impossible to further extend
the syntax in future without a compatibility break. I presume that we're
intending future extensions of the syntax, else this parsing infrastructure
is massive overkill, so (3) is an important objection. Since we've taken
a compatibility hit for non-identifier names with this change anyway, we
might as well lock things down now and insist that users use double quotes
for standby names that aren't identifiers.
Kyotaro Horiguchi and Tom Lane
In addition to adding new typedefs, I also re-sorted the file so that
various entries add piecemeal, mostly or entirely by me, were alphabetized
the same way as other entries in the file.
In nodeFuncs.c, pgindent wants to introduce spurious indentation into
the definitions of planstate_tree_walker and planstate_walk_subplans.
Fix that by spreading the definition out across several lines, similar
to what is already done for other walker functions in that file.
In execParallel.c, in the definition of SharedExecutorInstrumentation,
pgindent wants to insert more whitespace between the type name and the
member name. That causes it to mangle comments later on the line. Fix
by moving the comments out of line. Now that we have a bit more room,
add some more details that may be useful to the next person reading
this code.
The way that PartialAggregate and FinalizeAggregate plan nodes were
displaying output columns before was bogus. Now, FinalizeAggregate
produces the same outputs as an Aggregate would have produced, while
PartialAggregate produces each of those outputs prefixed by the word
PARTIAL.
Discussion: 12585.1460737650@sss.pgh.pa.us
Patch by me, reviewed by David Rowley.
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29 we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.comCAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
pg_atomic_compare_exchange_*_impl() were providing only the semantics of
an acquire barrier. Buildfarm members hornet and mandrill revealed this
deficit beginning with commit 008608b9d5.
While we have no report of symptoms in 9.5, we can't rule out the
possibility of certain compilers, hardware, or extension code relying on
these functions' specified barrier semantics. Back-patch to 9.5, where
commit b64d92f1a5 introduced atomics.
Reviewed by Andres Freund.
In commit c0b050192, Andres introduced the idea of including one-line
commit references in our major release notes. Teach git_changelog to
emit a (lightly adapted) version of that format, so that we don't
have to laboriously add it to the notes after the fact. The default
output isn't changed, since I anticipate still using that for minor
release notes.
Previously, database clusters created by a TAP test were shut down by
DESTROY methods attached to the PostgresNode objects representing them.
The trouble with that is that if the objects survive into the final global
destruction phase (which they do), Perl executes the DESTROY methods in an
unspecified order. Thus, the order of shutdown of multiple clusters was
indeterminate, which might lead to not-very-reproducible errors getting
logged (eg from a slave whose master might or might not get killed first).
Worse, the File::Temp objects representing the temporary PGDATA directories
might get destroyed before the PostgresNode objects, resulting in attempts
to delete PGDATA directories that still have live servers in them. On
Windows, this would lead to directory deletion failures; on Unix, it
usually had no effects worse than erratic "could not open temporary
statistics file "pg_stat/global.tmp": No such file or directory" log
messages.
While none of this would affect the reported result of the TAP test, which
is already determined, it could be very confusing when one is trying to
understand from the logs what went wrong with a failed test.
To fix, do the postmaster shutdowns in an END block rather than at object
destruction time. The END block will execute at a well-defined (and
reasonable) time during script termination, and it will stop the
postmasters in order of PostgresNode object creation. (Perhaps we should
change that to be reverse order of creation, but the main point here is
that we now have control which we did not before.) Use "pg_ctl stop", not
an asynchronous kill(SIGQUIT), so that we wait for the postmasters to shut
down before proceeding with directory deletion.
Deletion of temporary directories still happens in an unspecified order
during global destruction, but I can see no reason to care about that
once the postmasters are stopped.
The true explanation for Peter Eisentraut's report of inexact asind results
seems to be that (a) he's compiling into x87 instruction set, which uses
wider-than-double float registers, plus (b) the library function asin() on
his platform returns a result that is wider than double and is not rounded
to double width. To fix, we have to force the function's result to be
rounded comparably to what happened to the scaling constant asin_0_5.
Experimentation suggests that storing it into a volatile local variable is
the least ugly way of making that happen. Although only asin() is known to
exhibit an observable inexact result, we'd better do this in all the places
where we're hoping to get an exact result by scaling.
Change max_parallel_degree default from 0 to 2. It is possible that
this is not a good idea, or that we should go with 1 worker rather
than 2, but we won't find out without trying it. Along the way,
reword the documentation for max_parallel_degree a little bit to
hopefully make it more clear.
Discussion: 20160420174631.3qjjhpwsvvx5bau5@alap3.anarazel.de
Commit 65abaab547 tried to prevent the scaling constants used in
the degree-based trig functions from being precomputed at compile time,
because some compilers do that with functions that don't yield results
identical-to-the-last-bit to what you get at runtime. A report from
Peter Eisentraut suggests that some recent compilers are smart enough
to see through that trick, though. Instead, let's put the inputs to
these calculations into non-const global variables, which should be a
more reliable way of convincing the compiler that it can't assume that
they are compile-time constants. (If we really get desperate, we could
mark these variables "volatile", but I do not believe we should have to.)
Commit fab84c7787 tried to get away without doing an actual bind(),
but buildfarm results show that that doesn't get the job done. So we must
really bind to the target port --- and at least on my Linux box, we need a
listen() as well, or conflicts won't be detected. We rely on SO_REUSEADDR
to prevent problems from starting a postmaster on the socket immediately
after we've bound to it in the test code. (There may be platforms where
that doesn't work too well. But fortunately, we only really care whether
this works on Windows, and there the default behavior should be OK.)
Buildfarm members bowerbird and jacana have shown intermittent "could not
bind IPv4 socket" failures in the BinInstallCheck stage since mid-December,
shortly after commits 1caef31d9e and 9821492ee4 changed the
logic for selecting which port to use in temporary installations. One
plausible explanation is that we are randomly selecting ports that are
already in use for some non-Postgres purpose. Although the code tried
to defend against already-in-use ports, it used pg_isready to probe
the port which is quite unhelpful: if some non-Postgres server responds
at the given address, pg_isready will generally say "no response",
leading to exactly the wrong conclusion about whether the port is free.
Instead, let's use a simple TCP connect() call to see if anything answers
without making assumptions about what it is. Note that this means there's
no direct check for a conflicting Unix socket, but that should be okay
because there should be no other Unix sockets in use in the temporary
socket directory created for a test run.
This is only a partial solution for the TCP case, since if the port number
is in use for an outgoing connection rather than a listening socket, we'll
fail to detect that. We could try to bind() to the proposed port as a
means of detecting that case, but that would introduce its own failure
modes, since the system might consider the address to remain reserved for
some period of time after we drop the bound socket. Close study of the
errors returned by bowerbird and jacana suggests that what we're seeing
there may be conflicts with listening not outgoing sockets, so let's try
this and see if it improves matters. It's certainly better than what's
there now, in any case.
Michael Paquier, adjusted by me to work on non-Windows as well as Windows
Several issues:
1) checkpoint_flush_after doc and code disagreed about the default
2) new GUCs were missing from postgresql.conf.sample
3) Outdated source-code comment about bgwriter_flush_after's default
4) Sub-optimal categories assigned to new GUCs
5) Docs suggested backend_flush_after is PGC_SIGHUP, but it's PGC_USERSET.
6) Spell out int as integer in the docs, as done elsewhere
Reported-By: Magnus Hagander, Fujii Masao
Discussion: CAHGQGwETyTG5VYQQ5C_srwxWX7RXvFcD3dKROhvAWWhoSBdmZw@mail.gmail.com
NetBSD has seen fit to invent a libc function named strtoi(), which
conflicts with the long-established static functions of the same name in
datetime.c and ecpg's interval.c. While muttering darkly about intrusions
on application namespace, we'll rename our functions to avoid the conflict.
Back-patch to all supported branches, since this would affect attempts
to build any of them on recent NetBSD.
Thomas Munro
It appears that we can no longer get away with using V0 call convention
for bool-returning functions in newer versions of MSVC. The compiler
seems to generate code that doesn't clear the higher-order bits of the
result register, causing the bool result Datum to often read as "true"
when "false" was intended. This is not very surprising, since the
function thinks it's returning a bool-width result but fmgr_oldstyle
assumes that V0 functions return "char *"; what's surprising is that
that hack worked for so long on so many platforms.
The only functions of this description in core+contrib are in contrib/seg,
which we'd intentionally left mostly in V0 style to serve as a warning
canary if V0 call convention breaks. We could imagine hacking things
so that they're still V0 (we'd have to redeclare the bool-returning
functions as returning some suitably wide integer type, like size_t,
at the C level). But on the whole it seems better to convert 'em to V1.
We can still leave the pointer- and int-returning functions in V0 style,
so that the test coverage isn't gone entirely.
Back-patch to 9.5, since our intention is to support VS2015 in 9.5
and later. There's no SQL-level change in the functions' behavior
so back-patching should be safe enough.
Discussion: <22094.1461273324@sss.pgh.pa.us>
Michael Paquier, adjusted some by me
The implementation of that feature involves injecting nodes into the
raw parsetree where explicit parentheses appear. Various places in
parse_expr.c that test to see "is this child node of type Foo" need to
look through such nodes, else we'll get different behavior when
operator_precedence_warning is on than when it is off. Note that we only
need to handle this when testing untransformed child nodes, since the
AEXPR_PAREN nodes will be gone anyway after transformExprRecurse.
Per report from Scott Ribe and additional code-reading. Back-patch
to 9.5 where this feature was added.
Report: <ED37E303-1B0A-4CD8-8E1E-B9C4C2DD9A17@elevated-dev.com>
Given a left join containing a full join in its righthand side, with
the left join's joinclause referencing only one side of the full join
(in a non-strict fashion, so that the full join doesn't get simplified),
the planner could fail with "failed to build any N-way joins" or related
errors. This happened because the full join was seen as overlapping the
left join's RHS, and then recent changes within join_is_legal() caused
that function to conclude that the full join couldn't validly be formed.
Rather than try to rejigger join_is_legal() yet more to allow this,
I think it's better to fix initsplan.c so that the required join order
is explicit in the SpecialJoinInfo data structure. The previous coding
there essentially ignored full joins, relying on the fact that we don't
flatten them in the joinlist data structure to preserve their ordering.
That's sufficient to prevent a wrong plan from being formed, but as this
example shows, it's not sufficient to ensure that the right plan will
be formed. We need to work a bit harder to ensure that the right plan
looks sane according to the SpecialJoinInfos.
Per bug #14105 from Vojtech Rylko. This was apparently induced by
commit 8703059c6 (though now that I've seen it, I wonder whether there
are related cases that could have failed before that); so back-patch
to all active branches. Unfortunately, that patch also went into 9.0,
so this bug is a regression that won't be fixed in that branch.
The coverage was rather lean for cases that bind() or listen() might
return. Add entries for everything that there's a direct equivalent
for in the set of Unix errnos that elog.c has heard of.
There's no longer a need for the MSVC-version-specific code stanza that
forcibly redefines errno code symbols, because since commit 73838b52 we're
unconditionally redefining them in the stanza before this one anyway.
Now it's merely confusing and ugly, so get rid of it; and improve the
comment that explains what's going on here.
Although this is just cosmetic, back-patch anyway since I'm intending
to back-patch some less-cosmetic changes in this same hunk of code.
Revert commit 7cb1db1d95, which represented
a misunderstanding of the problem (if snapmgr.h weren't already included
in bufmgr.h, things wouldn't compile anywhere). Instead install what
I think is the real fix.
When we shoehorned "x op ANY (array)" into the SQL syntax, we created a
fundamental ambiguity as to the proper treatment of a sub-SELECT on the
righthand side: perhaps what's meant is to compare x against each row of
the sub-SELECT's result, or perhaps the sub-SELECT is meant as a scalar
sub-SELECT that delivers a single array value whose members should be
compared against x. The grammar resolves it as the former case whenever
the RHS is a select_with_parens, making the latter case hard to reach ---
but you can get at it, with tricks such as attaching a no-op cast to the
sub-SELECT. Parse analysis would throw away the no-op cast, leaving a
parsetree with an EXPR_SUBLINK SubLink directly under a ScalarArrayOpExpr.
ruleutils.c was not clued in on this fine point, and would naively emit
"x op ANY ((SELECT ...))", which would be parsed as the first alternative,
typically leading to errors like "operator does not exist: text = text[]"
during dump/reload of a view or rule containing such a construct. To fix,
emit a no-op cast when dumping such a parsetree. This might well be
exactly what the user wrote to get the construct accepted in the first
place; and even if she got there with some other dodge, it is a valid
representation of the parsetree.
Per report from Karl Czajkowski. He mentioned only a case involving
RLS policies, but actually the problem is very old, so back-patch to
all supported branches.
Report: <20160421001832.GB7976@moraine.isi.edu>
Also, avoid reading PGPROC's wait_event field twice, once for the wait
event and again for the wait_event_type, because the value might change
in the middle.
Petr Jelinek and Robert Haas
That commit increased all shared memory allocations to the next higher
multiple of PG_CACHE_LINE_SIZE, but it didn't ensure that allocation
started on a cache line boundary. It also failed to remove a couple
other pieces of now-useless code.
BUFFERALIGN() is perhaps obsolete at this point, and likely should be
removed at some point, too, but that seems like it can be left to a
future cleanup.
Mistakes all pointed out by Andres Freund. The patch is mine, with
a few extra assertions which I adopted from his version of this fix.
This fixes a problem which is not new, but with the advent of direct
foreign table modification in 0bf3ae88af,
it's somewhat more likely to be annoying than previously. So,
arrange for a local query cancelation to propagate to the remote side.
Michael Paquier, reviewed by Etsuro Fujita. Original report by
Thom Brown.
Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
If there's a filter condition on either side of a full outer join,
it is neither correct to attach it to the join's ON clause nor to
throw it into the toplevel WHERE clause. Just don't push down the
join in that case.
To maximize the number of cases where we can still push down full
joins, push inner join conditions into the ON clause at the first
opportunity rather than postponing them to the top-level WHERE
clause. This produces nicer SQL, anyway.
This bug was introduced in e4106b2528.
Ashutosh Bapat, per report from Rajkumar Raghuwanshi.
In commit 2ffa869620 we made pg_ctl recognize an environment variable
PGCTLTIMEOUT to set the default timeout for starting and stopping the
postmaster. However, pg_regress uses pg_ctl only for the "stop" end of
that; it has bespoke code for starting the postmaster, and that code has
historically had a hard-wired 60-second timeout. Further buildfarm
experience says it'd be a good idea if that timeout were also controlled
by PGCTLTIMEOUT, so let's make it so. Like the previous patch, back-patch
to all active branches.
Discussion: <13969.1461191936@sss.pgh.pa.us>