Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
As already explained in configure.in, using the OpenSSL version number
to detect presence of functions doesn't work, because LibreSSL reports
incompatible version numbers. Fortunately, the functions we need here
are actually macros, so we can just test for them directly.
The documentation claimed that an enum type requires "one or more"
labels, but since 1fd9883ff4, zero labels are also allowed.
Reported-by: Lukas Eder <lukas.eder@gmail.com>
Bug: #15356
Two issues have been spotted and get fixed here:
- When checking for corrupted files, make sure that pg_verify_checksums
complains about the correct file. In order to make the logic more
robust, all files created are immediately removed once checks on them
are done. The error message generated by pg_verify_checksums also now
includes the file name it sees as corrupted.
- Before running corruption-related tests, empty files are generated
which used names mapping with the corrupted files, potentially leading
to conflicts. So use different set of names for both.
Author: Michael Banck
Discussion: https://postgr.es/m/20181119181119.GC23740@nighthawk.caipicrew.dd-dns.de
Per POSIX, WIFSIGNALED and related macros are provided by <sys/wait.h>.
Apparently on Linux they're also pulled in by some other inclusions,
but BSD-ish systems are pickier. Fixes portability issue in ffa4cbd62.
Per buildfarm.
Previously, any program launched by COPY TO/FROM PROGRAM inherited the
server's setting of SIGPIPE handling, i.e. SIG_IGN. Hence, if we were
doing COPY FROM PROGRAM and closed the pipe early, the child process
would see EPIPE on its output file and typically would treat that as
a fatal error, in turn causing the COPY to report error. Similarly,
one could get a failure report from a query that didn't read all of
the output from a contrib/file_fdw foreign table that uses file_fdw's
PROGRAM option.
To fix, ensure that child programs inherit SIG_DFL not SIG_IGN processing
of SIGPIPE. This seems like an all-around better situation since if
the called program wants some non-default treatment of SIGPIPE, it would
expect to have to set that up for itself. Then in COPY, if it's COPY
FROM PROGRAM and we stop reading short of detecting EOF, treat a SIGPIPE
exit from the called program as a non-error condition. This still allows
us to report an error for any case where the called program gets SIGPIPE
on some other file descriptor.
As coded, we won't report a SIGPIPE if we stop reading as a result of
seeing an in-band EOF marker (e.g. COPY BINARY EOF marker). It's
somewhat debatable whether we should complain if the called program
continues to transmit data after an EOF marker. However, it seems like
we should avoid throwing error in any questionable cases, especially in a
back-patched fix, and anyway it would take additional code to make such
an error get reported consistently.
Back-patch to v10. We could go further back, since COPY FROM PROGRAM
has been around awhile, but AFAICS the only way to reach this situation
using core or contrib is via file_fdw, which has only supported PROGRAM
sources since v10. The COPY statement per se has no feature whereby
it'd stop reading without having hit EOF or an error already. Therefore,
I don't see any upside to back-patching further that'd outweigh the
risk of complaints about behavioral change.
Per bug #15449 from Eric Cyr.
Patch by me, review by Etsuro Fujita and Kyotaro Horiguchi
Discussion: https://postgr.es/m/15449-1cf737dd5929450e@postgresql.org
In \d and \z, instead of conflating partitioned tables and indexes with
plain ones, set the "type" column and table title differently to make
the distinction obvious. A simple ease-of-use improvement.
Author: Pavel Stehule, Michaël Paquier, Álvaro Herrera
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/CAFj8pRDMWPgijpt_vPj1t702PgLG4Ls2NCf+rEcb+qGPpossmg@mail.gmail.com
This change doesn't fix any bugs that we've heard about, but it seems
like a good idea on general principles to track upstream occasionally.
Discussion: https://postgr.es/m/3320.1542647565@sss.pgh.pa.us
Calling AC_CHECK_DECLS before we've finished setting up the compiler's
CFLAGS seems like a pretty risky proposition, especially now that the
first use of that macro will result in a test to see whether the compiler
gives warning or error for undeclared built-in functions. That answer
could very easily get changed later than where PGAC_LLVM_SUPPORT is
called; furthermore, it's hardly unlikely that flags such as -D_GNU_SOURCE
could change visibility of declarations. Hence, be a little less cavalier
about where to do LLVM-related tests. This results in v11 and HEAD doing
the warning-or-error check at the same place in the script as older
branches are doing it, which seems like a good thing.
Per further thought about commits 0b59b0e8b and 16fbac39f.
When hostaddr is given, the actual IP address that psql is connected to
can be totally unexpected for the given host. The more verbose output
we now generate makes things clearer. Since the "host" and "hostaddr"
parts of the conninfo could come from different sources (say, one of
them is in the service specification or a URI-style conninfo and the
other is not), this is not as silly as it may first appear. This is
also definitely useful if the hostname resolves to multiple addresses.
Author: Fabien Coelho
Reviewed-by: Pavel Stehule, Arthur Zakirov
Discussion: https://postgr.es/m/alpine.DEB.2.21.1810261532380.27686@lancrehttps://postgr.es/m/alpine.DEB.2.21.1808201323020.13832@lancre
The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'. There's no
point to duplicating the list, so avoid that. Instead, construct
the array representation directly from the original 'inhoids' list.
Also, use an array rather than a list for 'boundspecs'. We know
exactly how many items we need to store, so there's really no
reason to use a list. Using an array instead reduces the number
of memory allocations we perform.
Patch by me, reviewed by Michael Paquier and Amit Langote, the
latter of whom also helped with rebasing.
The test case that Autoconf uses to discover whether a function has
been declared doesn't work reliably with clang, because clang reports
a warning not an error if the name is a known built-in function.
On some platforms, this results in a lot of compile-time warnings about
strlcpy and related functions not having been declared.
There is a fix for this (by Noah Misch) in the upstream Autoconf sources,
but since they've not made a release in years and show no indication of
doing so anytime soon, let's just absorb their fix directly. We can
revert this when and if we update to a newer Autoconf release.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/26819.1542515567@sss.pgh.pa.us
This didn't actually work: COPY would fail to flush the right files, and
instead would try to flush a non-existing file, causing the whole
transaction to fail.
Cope by raising an error as soon as the command is sent instead, to
avoid a nasty later surprise. Of course, it would be much better to
make it work, but we don't have a patch for that yet, and we don't know
if we'll want to backpatch one when we do.
Reported-by: Tomas Vondra
Author: David Rowley
Reviewed-by: Amit Langote, Steve Singer, Tomas Vondra
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
If fsync() fails, md.c must keep the request in its bitmap, so that
future attempts will try again.
Back-patch to all supported releases.
Author: Thomas Munro
Reviewed-by: Amit Kapila
Reported-by: Andrew Gierth
Discussion: https://postgr.es/m/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
When reading a WAL record, its contents are copied into an intermediate
buffer. However, doing so is not necessary if the record fits fully
into the current page, saving one memcpy for each such record. The
allocation handling of the intermediate buffer is also now done only
when a record crosses a page boundary, shaving some extra cycles when
reading a WAL record.
Author: Andrey Lepikhov
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas
Discussion: https://postgr.es/m/c2ea54dd-a1d3-80eb-ddbf-7e6f258e615e@postgrespro.ru
Any Autoconf macro that uses AC_REQUIRES -- directly or indirectly --
must not be inside a plain shell "if" test; if it is, whatever code
gets pulled in by the AC_REQUIRES will also be inside that "if".
Instead of "if" we can use AS_IF, which knows how to get this right
(cf commit 01051a987).
The only immediate problem from getting this wrong was that AC_PROG_AWK
had to be run twice, once inside the "if llvm" block and once in the
main line. However, it broke a different patch I'm about to submit
more thoroughly.
wcsrtombs (called through wchar2char from common functions like lower,
upper, etc.) uses various optimizations that may look like access to
uninitialized data, triggering valgrind reports.
For example AVX2 instructions load data in 256-bit chunks, and gconv
does something similar with 32-bit chunks. This is faster than accessing
the bytes one by one, and the uninitialized part of the buffer is not
actually used. So suppress the bogus reports.
The exact stack depends on possible optimizations - it might be AVX, SSE
(as in the report by Aleksander Alekseev) or something else. Hence the
last frame is wildcarded, to deal with this.
Backpatch all the way back to 9.4.
Author: Tomas Vondra
Discussion: https://www.postgresql.org/message-id/flat/90ac0452-e907-e7a4-b3c8-15bd33780e62%402ndquadrant.com
Discussion: https://www.postgresql.org/message-id/20180220150838.GD18315@e733.localdomain
Setting them to SIG_IGN seems unlikely to have any beneficial effect
on that platform, and given the signal numbering collision with SIGABRT,
it could easily have bad effects.
Given the lack of field complaints that can be traced to this, I don't
presently feel a need to back-patch.
Discussion: https://postgr.es/m/5627.1542477392@sss.pgh.pa.us
For reasons lost in the mists of time, most postmaster child processes
reset SIGTTIN/SIGTTOU signal handling to SIG_DFL, with the major exception
that backend sessions do not. It seems like a pretty bad idea for any
postmaster children to do that: if stderr is connected to the terminal,
and the user has put the postmaster in background, any log output would
result in the child process freezing up. Hence, switch them all to
doing what backends do, ie, nothing. This allows them to inherit the
postmaster's SIG_IGN setting. On the other hand, manually-launched
processes such as standalone backends will have default processing,
which seems fine.
In passing, also remove useless resets of SIGCONT and SIGWINCH signal
processing. Perhaps the postmaster once changed those to something
besides SIG_DFL, but it doesn't now, so these are just wasted (and
confusing) syscalls.
Basically, this propagates the changes made in commit 8e2998d8a from
backends to other postmaster children. Probably the only reason these
calls now exist elsewhere is that I missed changing pgstat.c along with
postgres.c at the time.
Given the lack of field complaints that can be traced to this, I don't
presently feel a need to back-patch.
Discussion: https://postgr.es/m/5627.1542477392@sss.pgh.pa.us
This commit completes the work prepared in 1a0586de36, splitting the
old TupleTableSlot implementation (which could store buffer, heap,
minimal and virtual slots) into four different slot types. As
described in the aforementioned commit, this is done with the goal of
making tuple table slots extensible, to allow for pluggable table
access methods.
To achieve runtime extensibility for TupleTableSlots, operations on
slots that can differ between types of slots are performed using the
TupleTableSlotOps struct provided at slot creation time. That
includes information from the size of TupleTableSlot struct to be
allocated, initialization, deforming etc. See the struct's definition
for more detailed information about callbacks TupleTableSlotOps.
I decided to rename TTSOpsBufferTuple to TTSOpsBufferHeapTuple and
ExecCopySlotTuple to ExecCopySlotHeapTuple, as that seems more
consistent with other naming introduced in recent patches.
There's plenty optimization potential in the slot implementation, but
according to benchmarking the state after this commit has similar
performance characteristics to before this set of changes, which seems
sufficient.
There's a few changes in execReplication.c that currently need to poke
through the slot abstraction, that'll be repaired once the pluggable
storage patchset provides the necessary infrastructure.
Author: Andres Freund and Ashutosh Bapat, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
This becomes useful when used to retry a transaction after a
serialization error or deadlock abort. (We don't yet have that feature,
but this is preparation for it.)
While at it, use separate random state for thread administratrivia such
as deciding which script to run, how long to delay for throttling, or
whether to log a message when sampling; this not only makes these tasks
independent of each other, but makes the actual thread run
deterministic.
Author: Marina Polyakova
Reviewed-by: Fabien Coelho
Discussion: https://postgr.es/m/72a0d590d6ba06f242d75c2e641820ec@postgrespro.ru
This speeds up write operations (INSERT, UPDATE, DELETE, COPY, as well
as the future MERGE) on partitioned tables.
This changes the setup for tuple routing so that it does far less work
during the initial setup and pushes more work out to when partitions
receive tuples. PartitionDispatchData structs for sub-partitioned
tables are only created when a tuple gets routed through it. The
possibly large arrays in the PartitionTupleRouting struct have largely
been removed. The partitions[] array remains but now never contains any
NULL gaps. Previously the NULLs had to be skipped during
ExecCleanupTupleRouting(), which could add a large overhead to the
cleanup when the number of partitions was large. The partitions[] array
is allocated small to start with and only enlarged when we route tuples
to enough partitions that it runs out of space. This allows us to keep
simple single-row partition INSERTs running quickly. Redesign
The arrays in PartitionTupleRouting which stored the tuple translation maps
have now been removed. These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.
The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.
In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.
Authors: David Rowley and Amit Langote, heavily revised by Álvaro Herrera
Testing help from Jesper Pedersen and Kato Sho.
Discussion: https://postgr.es/m/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
The assumption made in 1a0586de36 was wrong, as evidenced by
buildfarm failure on locust, which runs with
force_parallel_mode=regress. The tuples accessed in either nodes are
in the outer slot, and we can't trivially rely on the slot type being
known because the leader might execute the subsidiary node directly,
or via the tuple queue on a worker. In the latter case the tuple will
always be a heaptuple slot, but in the former, it'll be whatever the
subsidiary node returns.
Virtual tuple table slots never need tuple deforming. Therefore, if we
know at expression compilation time, that a certain slot will always
be virtual, there's no need to create a tuple deforming routine for
it.
Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
This is important so JIT compilation knows what kind of tuple slot the
deforming routine can expect. There's also optimization potential for
expression initialization without JIT compilation. It e.g. seems
plausible to elide EEOP_*_FETCHSOME ops entirely when dealing with
virtual slots.
Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
Previously this information was computed when JIT compiling an
expression. But the information is useful for assertions in the
non-JIT case too (for assertions), therefore it makes sense to move
it.
This will, in a followup commit, allow to treat different slot types
differently. E.g. for virtual slots there's no need to generate a JIT
function to deform the slot.
Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
Previously materializing a slot always returned a HeapTuple. As
current work aims to reduce the reliance on HeapTuples (so other
storage systems can work efficiently), that needs to change. Thus
split the tasks of materializing a slot (i.e. making it independent
from the underlying storage / other memory contexts) from fetching a
HeapTuple from the slot. For brevity, allow to fetch a HeapTuple from
a slot and materializing the slot at the same time, controlled by a
parameter.
For now some callers of ExecFetchSlotHeapTuple, with materialize =
true, expect that changes to the heap tuple will be reflected in the
underlying slot. Those places will be adapted in due course, so while
not pretty, that's OK for now.
Also rename ExecFetchSlotTuple to ExecFetchSlotHeapTupleDatum and
ExecFetchSlotTupleDatum to ExecFetchSlotHeapTupleDatum, as it's likely
that future storage methods will need similar methods. There already
is ExecFetchSlotMinimalTuple, so the new names make the naming scheme
more coherent.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
It was not really that obvious that there's meant to be exactly 1 item
in the partexprs List for each zero-valued partattrs element. Some
incorrect code using these fields was the cause of CVE-2018-1052, so
it's worthwhile to mention how they should be used in the comments.
Author: David Rowley <david.rowley@2ndquadrant.com>
The comments above the PartitionedRelPruneInfo struct incorrectly
document how subplan_map and subpart_map are indexed. This seems to
have snuck in on 4e23236403.
Author: David Rowley <david.rowley@2ndquadrant.com>
With run-time partition pruning, there is no longer necessarily an
executor node for each corresponding plan node.
Author: David Rowley <david.rowley@2ndquadrant.com>
The current pattern of reseting expressions both in
ExecProcessReturning() and ExecOnConflictUpdate() makes it harder than
necessary to reason about memory lifetimes. It also requires
materializing slots unnecessarily, although this patch doesn't take
advantage of the fact that that's not necessary anymore.
Instead reset the expression context once for each input tuple.
Author: Ashutosh Bapat
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de