This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.
When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.
This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
On top of the previous support in pg_dump, add support to specify
multiple tables (by using the -t option multiple times) to
pg_restore, clsuterdb, reindexdb and vacuumdb.
Josh Kupershmidt, reviewed by Karl O. Pinc
(-i), producing only one progress message per 5 seconds along with
elapsed time and estimated remaining time. Also add elapsed time and
estimated remaining time to the default logging(prints one message
each 100000 rows).
Patch contributed by Tomas Vondra, reviewed by Jeevan Chalke and
Tatsuo Ishii.
Adds commandline option -R to pg_basebackup that creates a recovery.conf which
enables standby mode using the same parameters that pg_basebackup used to
connect to the master, and writes it into the output directory (or injects it
in the tar file when tar format is used).
Zoltan Boszormenyi, modified by Magnus Hagander, reviewed by Amit Kapila & Fujii Masao
SPI_execute() and related functions create a CachedPlan, execute it once,
and immediately discard it, so that the functionality offered by
plancache.c is of no value in this code path. And performance measurements
show that the extra data copying and invalidation checking done by
plancache.c slows down simple queries by 10% or more compared to 9.1.
However, enough of the SPI code is shared with functions that do need plan
caching that it seems impractical to bypass plancache.c altogether.
Instead, let's invent a variant version of cached plans that preserves
99% of the API but doesn't offer any of the actual functionality, nor the
overhead. This puts SPI_execute() performance back on par, or maybe even
slightly better, than it was before. This change should resolve recent
complaints of performance degradation from Dong Ye, Pavel Stehule, and
others.
By avoiding data copying, this change also reduces the amount of memory
needed to execute many-statement SPI_execute() strings, as for instance in
a recent complaint from Tomas Vondra.
An additional benefit of this change is that multi-statement SPI_execute()
query strings are now processed fully serially, that is we complete
execution of earlier statements before running parse analysis and planning
on following ones. This eliminates a long-standing POLA violation, in that
DDL that affects the behavior of a later statement will now behave as
expected.
Back-patch to 9.2, since this was a performance regression compared to 9.1.
(In 9.2, place the added struct fields so as to avoid changing the offsets
of existing fields.)
Heikki Linnakangas and Tom Lane
Code review for commit 2f582f76b1: don't use
a static variable for what ought to be a deparse_context field, fix
non-multibyte-safe test for spaces, avoid useless and potentially O(N^2)
(though admittedly with a very small constant) calculations of wrap
positions when we aren't going to wrap.
If pg_extension_config_dump() is executed again for a table already listed
in the extension's extconfig, the code was blindly making a new array entry.
This does not seem useful. Fix it to replace the existing array entry
instead, so that it's possible for extension update scripts to alter the
filter conditions for configuration tables.
In addition, teach ALTER EXTENSION DROP TABLE to check for an extconfig
entry for the target table, and remove it if present. This is not a 100%
solution because it's allowed for an extension update script to just
summarily DROP a member table, and that code path doesn't go through
ExecAlterExtensionContentsStmt. We could probably make that case clean
things up if we had to, but it would involve sticking a very ugly wart
somewhere in the guts of dependency.c. Since on the whole it seems quite
unlikely that extension updates would want to remove pre-existing
configuration tables, making the case possible with an explicit command
seems sufficient.
Per bug #7756 from Regina Obe. Back-patch to 9.1 where extensions were
introduced.
Before this patch, streaming replication would refuse to start replicating
if the timeline in the primary doesn't exactly match the standby. The
situation where it doesn't match is when you have a master, and two
standbys, and you promote one of the standbys to become new master.
Promoting bumps up the timeline ID, and after that bump, the other standby
would refuse to continue.
There's significantly more timeline related logic in streaming replication
now. First of all, when a standby connects to primary, it will ask the
primary for any timeline history files that are missing from the standby.
The missing files are sent using a new replication command TIMELINE_HISTORY,
and stored in standby's pg_xlog directory. Using the timeline history files,
the standby can follow the latest timeline present in the primary
(recovery_target_timeline='latest'), just as it can follow new timelines
appearing in an archive directory.
START_REPLICATION now takes a TIMELINE parameter, to specify exactly which
timeline to stream WAL from. This allows the standby to request the primary
to send over WAL that precedes the promotion. The replication protocol is
changed slightly (in a backwards-compatible way although there's little hope
of streaming replication working across major versions anyway), to allow
replication to stop when the end of timeline reached, putting the walsender
back into accepting a replication command.
Many thanks to Amit Kapila for testing and reviewing various versions of
this patch.
Commit 729205571e added privileges on data
types, but there were a number of oversights. The implementation of
default privileges for types missed a few places, and pg_dump was
utterly innocent of the whole concept. Per bug #7741 from Nathan Alden,
and subsequent wider investigation.
This patch makes "simple" views automatically updatable, without the need
to create either INSTEAD OF triggers or INSTEAD rules. "Simple" views
are those classified as updatable according to SQL-92 rules. The rewriter
transforms INSERT/UPDATE/DELETE commands on such views directly into an
equivalent command on the underlying table, which will generally have
noticeably better performance than is possible with either triggers or
user-written rules. A view that has INSTEAD OF triggers or INSTEAD rules
continues to operate the same as before.
For the moment, security_barrier views are not considered simple.
Also, we do not support WITH CHECK OPTION. These features may be
added in future.
Dean Rasheed, reviewed by Amit Kapila
For some reason lost in the mists of prehistory, RETURN was only coded to
allow a simple reference to a composite variable when the function's return
type is composite. Allow an expression instead, while preserving the
efficiency of the original code path in the case where the expression is
indeed just a composite variable's name. Likewise for RETURN NEXT.
As is true in various other places, the supplied expression must yield
exactly the number and data types of the required columns. There was some
discussion of relaxing that for pl/pgsql, but no consensus yet, so this
patch doesn't address that.
Asif Rehman, reviewed by Pavel Stehule
Background workers are postmaster subprocesses that run arbitrary
user-specified code. They can request shared memory access as well as
backend database connections; or they can just use plain libpq frontend
database connections.
Modules listed in shared_preload_libraries can register background
workers in their _PG_init() function; this is early enough that it's not
necessary to provide an extra GUC option, because the necessary extra
resources can be allocated early on. Modules can install more than one
bgworker, if necessary.
Care is taken that these extra processes do not interfere with other
postmaster tasks: only one such process is started on each ServerLoop
iteration. This means a large number of them could be waiting to be
started up and postmaster is still able to quickly service external
connection requests. Also, shutdown sequence should not be impacted by
a worker process that's reasonably well behaved (i.e. promptly responds
to termination signals.)
The current implementation lets worker processes specify their start
time, i.e. at what point in the server startup process they are to be
started: right after postmaster start (in which case they mustn't ask
for shared memory access), when consistent state has been reached
(useful during recovery in a HOT standby server), or when recovery has
terminated (i.e. when normal backends are allowed).
In case of a bgworker crash, actions to take depend on registration
data: if shared memory was requested, then all other connections are
taken down (as well as other bgworkers), just like it were a regular
backend crashing. The bgworker itself is restarted, too, within a
configurable timeframe (which can be configured to be never).
More features to add to this framework can be imagined without much
effort, and have been discussed, but this seems good enough as a useful
unit already.
An elementary sample module is supplied.
Author: Álvaro Herrera
This patch is loosely based on prior patches submitted by KaiGai Kohei,
and unsubmitted code by Simon Riggs.
Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund,
Heikki Linnakangas, Simon Riggs, Amit Kapila
storage.
Have pg_upgrade use it, and enable server options fsync=off and
full_page_writes=off.
Document that users turning fsync from off to on should run initdb
--sync-only.
[ Previous commit was incorrectly applied as a git merge. ]
We've generally recommended use of INSTEAD triggers over rules since that
feature was added; but this old text in the CREATE VIEW reference page
didn't get the memo. Noted by Thomas Kellerer.
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.
Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner
This allows a caller to get back the exact conninfo array that was
used to create a connection, including parameters read from the
environment.
In doing this, restructure how options are copied from the conninfo
to the actual connection.
Zoltan Boszormenyi and Magnus Hagander
Commit 8cb53654db, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
This reverts commit d573e239f0, "Take fewer
snapshots". While that seemed like a good idea at the time, it caused
execution to use a snapshot that had been acquired before locking any of
the tables mentioned in the query. This created user-visible anomalies
that were not present in any prior release of Postgres, as reported by
Tomas Vondra. While this whole area could do with a redesign (since there
are related cases that have anomalies anyway), it doesn't seem likely that
any future patch would be reasonably back-patchable; and we don't want 9.2
to exhibit a behavior that's subtly unlike either past or future releases.
Hence, revert to prior code while we rethink the problem.
This way it works more like the DSSSL build, and dependencies are
tracked better by make.
Also copy the CSS stylesheet to the html directory. This was forgotten
when the output directory was changed.
Some versions of the XSLT stylesheets don't handle the missing slash
correctly (they concatenate directory and file name without the slash).
This might never have worked correctly.
Without this, the connection will be killed after timeout if
wal_sender_timeout is set in the server.
Original patch by Amit Kapila, modified by me to fit recent changes in the
code.
We used to send structs wrapped in CopyData messages, which works as long as
the client and server agree on things like endianess, timestamp format and
alignment. That's good enough for running a standby server, which has to run
on the same platform anyway, but it's useful for tools like pg_receivexlog
to work across platforms.
This breaks protocol compatibility of streaming replication, but we never
promised that to be compatible across versions, anyway.
Represent a sequence's current value as a separate TableDataInfo dumpable
object, so that it can be dumped within the data section of the archive
rather than in pre-data. This fixes an undesirable inconsistency between
the meanings of "--data-only" and "--section=data", and also fixes dumping
of sequences that are marked as extension configuration tables, as per a
report from Marko Kreen back in July. The main cost is that we do one more
SQL query per sequence, but that's probably not very meaningful in most
databases.
Back-patch to 9.1, since it has the extension configuration issue even
though not the --section switch.
... and have sepgsql use it to determine whether to check permissions
during certain operations. Indexes that are being created as a result
of REINDEX, for instance, do not need to have their permissions checked;
they were already checked when the index was created.
Author: KaiGai Kohei, slightly revised by me
In commit 4317e0246c, I accidentally broke
this behavior while rearranging code to ensure that --create wouldn't
affect whether a DATABASE entry gets put into archive-format output.
Thus, 9.2 would issue a DROP DATABASE command in --clean mode, which is
either useless or dangerous depending on the usage scenario.
It should not do that, and no longer does.
A bright spot is that this refactoring makes it easy to allow the
combination of --clean and --create to work sensibly, ie, emit DROP
DATABASE then CREATE DATABASE before reconnecting. Ordinarily we'd
consider that a feature addition and not back-patch it, but it seems
silly to not include the extra couple of lines required in the 9.2
version of the code.
Per report from Guillaume Lelarge, though this is slightly more extensive
than his proposed patch.
Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.
It is no longer necessary to manually set wal_receiver_status_interval to
a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the other end.
Amit Kapila, heavily edited by me.
dblink now has its own validator function dblink_fdw_validator(), which is
better than the core function postgresql_fdw_validator() because it gets
the list of legal options from libpq instead of having a hard-wired list.
Make the dblink extension module provide a standard foreign data wrapper
dblink_fdw that encapsulates use of this validator, and recommend use of
that wrapper instead of making up wrappers on the fly.
Unfortunately, because ad-hoc wrappers *were* recommended practice
previously, it's not clear when we can get rid of postgresql_fdw_validator
without causing upgrade problems. But this is a step in the right
direction.
Shigeru Hanada, reviewed by KaiGai Kohei
libpq defines these functions as accepting "size_t" lengths ... but the
underlying backend functions expect signed int32 length parameters, and so
will misinterpret any value exceeding INT_MAX. Fix the libpq side to throw
error rather than possibly doing something unexpected.
This is a bug of long standing, but I doubt it's worth back-patching. The
problem is really pretty academic anyway with lo_read/lo_write, since any
caller expecting sane behavior would have to have provided a multi-gigabyte
buffer. It's slightly more pressing with lo_truncate, but still we haven't
supported large objects over 2GB until now.
Copy-editing for previous patch, plus fixing some longstanding markup
issues and oversights (like not mentioning that failures will set the
PQerrorMessage string).
4TB large objects (standard 8KB BLCKSZ case). For this purpose new
libpq API lo_lseek64, lo_tell64 and lo_truncate64 are added. Also
corresponding new backend functions lo_lseek64, lo_tell64 and
lo_truncate64 are added. inv_api.c is changed to handle 64-bit
offsets.
Patch contributed by Nozomi Anzai (backend side) and Yugo Nagata
(frontend side, docs, regression tests and example program). Reviewed
by Kohei Kaigai. Committed by Tatsuo Ishii with minor editings.
Use the terms "simple bind" and "search+bind" consistently do
distinguish the two modes (better than first mode and second mode in
any case). They were already used in some places, now it's just more
prominent.
Split up the list of options into one for common options and one for
each mode, for clarity.
Add configuration examples for either mode.
These reference pages still claimed that you have to be superuser to create
a database or schema owned by a different role. That was true before 8.1,
but it was changed in commits aa1110624c and
f91370cd2f to allow assignment of ownership
to any role you are a member of. However, at the time we were thinking of
that primarily as a change to the ALTER OWNER rules, so the need to touch
these two CREATE ref pages got missed.
Per discussion, schema-element subcommands are not allowed together with
this option, since it's not very obvious what should happen to the element
objects.
Fabrízio de Royes Mello
This allows logging only some fraction of transactions, greatly reducing
the amount of log generated.
Tomas Vondra, reviewed by Robert Haas and Jeff Janes.
You can now get the number of rows processed by a COPY statement in a
PL/pgSQL function with "GET DIAGNOSTICS x = ROW_COUNT".
Pavel Stehule, reviewed by Amit Kapila, with some editing by me.
Both programs got the "magic" string wrong, causing standard-conforming tar
implementations to believe the output was just legacy tar format without
any POSIX extensions. This doesn't actually matter that much, especially
since pg_dump failed to fill the POSIX fields anyway, but still there is
little point in emitting tar format if we can't be compliant with the
standard. In addition, pg_dump failed to write the EOF marker correctly
(there should be 2 blocks of zeroes not just one), pg_basebackup put the
numeric group ID in the wrong place, and both programs had a pretty
brain-dead idea of how to compute the checksum. Fix all that and improve
the comments a bit.
pg_restore is modified to accept either the correct POSIX-compliant "magic"
string or the previous value. This part of the change will need to be
back-patched to avoid an unnecessary compatibility break when a previous
version tries to read tar-format output from 9.3 pg_dump.
Brian Weaver and Tom Lane
The syntax "su -c 'command' username" is not accepted by all versions of
su, for example not OpenBSD's. More portable is "su username -c
'command'". So change runtime.sgml to recommend that syntax. Also,
add a -D switch to the OpenBSD example script, for consistency with other
examples. Per Denis Lapshin and Gábor Hidvégi.
This allows easily splitting configuration into many files, deployed in a
directory.
Magnus Hagander, Greg Smith, Selena Deckelmann, reviewed by Noah Misch.
Produce a NOTICE when the label already exists, for consistency with other
CREATE IF NOT EXISTS commands. Also, fix the code so it produces something
more user-friendly than an index violation when the label already exists.
This not incidentally enables making a regression test that the previous
patch didn't make for fear of exposing an unpredictable OID in the results.
Also some wordsmithing on the documentation.
If the label is already in the enum the statement becomes a no-op.
This will reduce the pain that comes from our not allowing this
operation inside a transaction block.
Andrew Dunstan, reviewed by Tom Lane and Magnus Hagander.
Somewhere along the line, somebody decided to remove all trace of this
notation from the documentation text. It was still in the command syntax
synopses, or at least some of them, but with no indication what it meant.
This will not do, as evidenced by the confusion apparent in bug #7543;
even if the notation is now unnecessary, people will find it in legacy
SQL code and need to know what it does.
The documentation mentioned setting autovacuum_freeze_max_age to
"its maximum allowed value of a little less than two billion".
This led to a post asking about the exact maximum allowed value,
which is precisely two billion, not "a little less".
Based on question by Radovan Jablonovsky. Backpatch to 8.3.
When starting either an old or new postmaster, force it to place its Unix
socket in the current directory. This makes it even harder for accidental
connections to occur during pg_upgrade, and also works around some
scenarios where the default socket location isn't usable. (For example,
if the default location is something other than "/tmp", it might not exist
during "make check".)
When checking an already-running old postmaster, find out its actual socket
directory location from postmaster.pid, if possible. This dodges problems
with an old postmaster having a configured location different from the
default built into pg_upgrade's libpq. We can't find that out if the old
postmaster is pre-9.1, so also document how to cope with such scenarios
manually.
In support of this, centralize handling of the connection-related command
line options passed to pg_upgrade's subsidiary programs, such as pg_dump.
This should make future changes easier.
Bruce Momjian and Tom Lane