61148 Commits

Author SHA1 Message Date
Nathan Bossart
b0a4c3e88b Prevent 006_transfer_modes.pl from leaving files behind.
This test was leaving files like delete_old_cluster.{sh,bat} in the
source directory for VPATH and meson builds.  To fix, change the
directory to tmp_check before running the test, as was done in
commits 15b6d21553, 8af917be6b, and c462b054ba.

Oversight in commit af0d4901c1.

Reported-by: Andrew Dunstan <andrew@dunslane.net> (on Discord)
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/Z_RHkG770w3SE0yU%40nathan
2025-04-08 10:57:31 -05:00
Daniel Gustafsson
88edd661c8 ci: Add MBUILD_TARGET for NetBSD and OpenBSD
Commit b2bdb972c0 added MBUILD_TARGET to ensure that meson builds
the tests before running them, this adds MBUILD_TARGET to OpenBSD
and NetBSD builds as well where it was missing.

No backpatching since OpenBSD and NetBSD support does not exist
in the backbranch CI.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAN55FZ2LNnRrtL+cpSdEg44fQcLPq_GjJjfNa0vz+xqEdq=ZHw@mail.gmail.com
2025-04-08 15:28:29 +02:00
Tomas Vondra
91f1fe90c7 pg_buffercache: Change page_num type to bigint
The page_num was defined as integer, which should be sufficient for the
near future (with 4K pages it's 8TB). But it's virtually free to return
bigint, and get a wider range. This was agreed on the thread, but I
forgot to tweak this in ba2a3c2302f1.

While at it, make the data types in CREATE VIEW a bit more consistent.

Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.co
2025-04-08 12:38:42 +02:00
Tomas Vondra
b8a6078ca8 doc: Correct pg_shmem_allocations_numa.size data type
The code in pg_get_shmem_allocations_numa() returned 'size' as int64,
but the docs said int32.

Report and fix by Noriyoshi Shinoda.

Reported-by: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com>
Discussion: https://postgr.es/m/DM4PR84MB1734308EB741A6ECFF040C27EEAA2@DM4PR84MB1734.NAMPRD84.PROD.OUTLOOK.COM
2025-04-08 12:36:36 +02:00
Amit Kapila
12eece5fd5 Fix uninitialized index information access during apply.
The issue happens when building conflict information during apply of
INSERT or UPDATE operations that violate unique constraints on leaf
partitions.

The problem was introduced in commit 9ff68679b5, which removed the
redundant calls to ExecOpenIndices/ExecCloseIndices. The previous code was
relying on the redundant ExecOpenIndices call in
apply_handle_tuple_routing() to build the index information required for
unique key conflict detection.

The fix is to delay building the index information until a conflict is
detected instead of relying on ExecOpenIndices to do the same. The
additional benefit of this approach is that it avoids building index
information when there is no conflict.

Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by:Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB57244ADA33DDA57119B9D26494A62@TYAPR01MB5724.jpnprd01.prod.outlook.com
2025-04-08 15:35:42 +05:30
Thomas Munro
7ea21f4ee2 Fix typo in docs.
Typo in previous commit.
2025-04-08 22:02:45 +12:00
Thomas Munro
f78ca6f3eb Introduce file_copy_method setting.
It can be set to either COPY (the default) or CLONE if the system
supports it.  CLONE causes callers of copydir(), currently CREATE
DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE =
..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy
files instead of a read-write loop over the contents.

CLONE gives the kernel the opportunity to share block ranges on
copy-on-write file systems and push copying down to storage on others,
depending on configuration.  On some systems CLONE can be used to clone
large databases quickly with CREATE DATABASE ... TEMPLATE=source
STRATEGY=FILE_COPY.

Other operating systems could be supported; patches welcome.

Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
2025-04-08 21:35:38 +12:00
Daniel Gustafsson
042a66291b Add function to get memory context stats for processes
This adds a function for retrieving memory context statistics
and information from backends as well as auxiliary processes.
The intended usecase is cluster debugging when under memory
pressure or unanticipated memory usage characteristics.

When calling the function it sends a signal to the specified
process to submit statistics regarding its memory contexts
into dynamic shared memory.  Each memory context is returned
in detail, followed by a cumulative total in case the number
of contexts exceed the max allocated amount of shared memory.
Each process is limited to use at most 1Mb memory for this.

A summary can also be explicitly requested by the user, this
will return the TopMemoryContext and a cumulative total of
all lower contexts.

In order to not block on busy processes the caller specifies
the number of seconds during which to retry before timing out.
In the case where no statistics are published within the set
timeout,  the last known statistics are returned, or NULL if
no previously published statistics exist.  This allows dash-
board type queries to continually publish even if the target
process is temporarily congested.  Context records contain a
timestamp to indicate when they were submitted.

Author: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
2025-04-08 11:06:56 +02:00
Andres Freund
15f0cb26b5 Increase BAS_BULKREAD based on effective_io_concurrency
Before, BAS_BULKREAD was always of size 256kB. With the default
io_combine_limit of 16, that only allowed 1-2 IOs to be in flight -
insufficient even on very low latency storage.

We don't just want to increase the size to a much larger hardcoded value, as
very large rings (10s of MBs of of buffers), appear to have negative
performance effects when reading in data that the OS has cached (but not when
actually needing to do IO).

To address this, increase the size of BAS_BULKREAD to allow for
io_combine_limit * effective_io_concurrency buffers getting read in. To
prevent the ring being much larger than useful, limit the increased size with
GetPinLimit().

The formula outlined above keeps the ring size to sizes for which we have not
observed performance regressions, unless very large effective_io_concurrency
values are used together with large shared_buffers setting.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
2025-04-08 02:41:03 -04:00
Andres Freund
dcf7e1697b Add pg_buffercache_evict_{relation,all} functions
In addition to the added functions, the pg_buffercache_evict() function now
shows whether the buffer was flushed.

pg_buffercache_evict_relation(): Evicts all shared buffers in a
relation at once.
pg_buffercache_evict_all(): Evicts all shared buffers at once.

Both functions provide mechanism to evict multiple shared buffers at
once. They are designed to address the inefficiency of repeatedly calling
pg_buffercache_evict() for each individual buffer, which can be time-consuming
when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for
16GB of fully populated shared buffers).

These functions are intended for developer testing and debugging
purposes and are available to superusers only.

Minimal tests for the new functions are included. Also, there was no test for
pg_buffercache_evict(), test for this added too.

No new extension version is needed, as it was already increased this release
by ba2a3c2302f.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
2025-04-08 02:19:32 -04:00
David Rowley
d69d45a5a9 Speedup child EquivalenceMember lookup in planner
When planning queries to partitioned tables, we clone all
EquivalenceMembers belonging to the partitioned table into em_is_child
EquivalenceMembers for each non-pruned partition.  For partitioned tables
with large numbers of partitions, this meant the ec_members list could
become large and code searching that list would become slow.  Effectively,
the more partitions which were present, the more searches needed to be
performed for operations such as find_ec_member_matching_expr() during
create_plan() and the more partitions present, the longer these searches
would take, i.e., a quadratic slowdown.

To fix this, here we adjust how we store EquivalenceMembers for
em_is_child members.  Instead of storing these directly in ec_members,
these are now stored in a new array of Lists in the EquivalenceClass,
which is indexed by the relid.  When we want to find EquivalenceMembers
belonging to a certain child relation, we can narrow the search to the
array element for that relation.

To make EquivalenceMember lookup easier and to reduce the amount of code
change, this commit provides a pair of functions to allow iteration over
the EquivalenceMembers of an EC which also handles finding the child
members, if required.  Callers that never need to look at child members
can remain using the foreach loop over ec_members, which will now often
be faster due to only parent-level members being stored there.

The actual performance increases here are highly dependent on the number
of partitions and the query being planned.  Performance increases can be
visible with as few as 8 partitions, but the speedup is marginal for
such low numbers of partitions.  The speedups become much more visible
with a few dozen to hundreds of partitions.  With some tested queries
using 56 partitions, the planner was around 3x faster than before.  For
use cases with thousands of partitions, these are likely to become
significantly faster.  Some testing has shown planner speedups of 60x or
more with 8192 partitions.

Author: Yuya Watari <watari.yuya@gmail.com>
Co-authored-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Tested-by: Thom Brown <thom@linux.com>
Tested-by: newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com>
Discussion: https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com
2025-04-08 18:09:57 +12:00
Amit Kapila
105b2cb336 Stabilize 035_standby_logical_decoding.pl.
Some tests try to invalidate logical slots on the standby server by
running VACUUM on the primary. The problem is that xl_running_xacts was
getting generated and replayed before the VACUUM command, leading to the
advancement of the active slot's catalog_xmin. Due to this, active slots
were not getting invalidated, leading to test failures.

We fix it by skipping the generation of xl_running_xacts for the required
tests with the help of injection points. As the required interface for
injection points was not present in back branches, we fixed the failing
tests in them by disallowing the slot to become active for the required
cases (where rows_removed conflict could be generated).

Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16, where it was introduced
Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal
2025-04-08 09:38:02 +05:30
Bruce Momjian
46b4ba533c Fix PG 17 [NOT] NULL optimization bug for domains
A PG 17 optimization allowed columns with NOT NULL constraints to skip
table scans for IS NULL queries, and to skip IS NOT NULL checks for IS
NOT NULL queries.  This didn't work for domain types, since domain types
don't follow the IS NULL/IS NOT NULL constraint logic.  To fix, disable
this optimization for domains for PG 17+.

Reported-by: Jan Behrens

Diagnosed-by: Tom Lane

Discussion: https://postgr.es/m/Z37p0paENWWUarj-@momjian.us

Backpatch-through: 17
2025-04-07 21:33:42 -04:00
Michael Paquier
039549d70f Flush the IO statistics of active WAL senders more frequently
WAL senders do not flush their statistics until they exit, limiting the
monitoring possible for live processes.  This is penalizing when WAL
senders are running for a long time, like in streaming or logical
replication setups, because it is not possible to know the amount of IO
they generate while running.

This commit makes WAL senders more aggressive with their statistics
flush, using an internal of 1 second, with the flush timing calculated
based on the existing GetCurrentTimestamp() done before the sleeps done
to wait for some activity.  Note that the sleep done for logical and
physical WAL senders happens in two different code paths, so the stats
flushes need to happen in these two places.

One test is added for the physical WAL sender case, and one for the
logical WAL sender case.  This can be done in a stable fashion by
relying on the WAL generated by the TAP tests in combination with a
stats reset while a server is running, but only on HEAD as WAL data has
been added to pg_stat_io in a051e71e28a1.

This issue exists since a9c70b46dbe and the introduction of pg_stat_io,
so backpatch down to v16.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 16
2025-04-08 07:57:19 +09:00
Tomas Vondra
ba2a3c2302 Add pg_buffercache_numa view with NUMA node info
Introduces a new view pg_buffercache_numa, showing NUMA memory nodes
for individual buffers. For each buffer the view returns an entry for
each memory page, with the associated NUMA node.

The database blocks and OS memory pages may have different size - the
default block size is 8KB, while the memory page is 4K (on x86). But
other combinations are possible, depending on configure parameters,
platform, etc. This means buffers may overlap with multiple memory
pages, each associated with a different NUMA node.

To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.

The view may be relatively expensive, especially when accessed for the
first time in a backend, as it touches all memory pages to get reliable
information about the NUMA node. This may also force allocation of the
shared memory.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
2025-04-07 23:08:17 +02:00
Tomas Vondra
8cc139bec3 Introduce pg_shmem_allocations_numa view
Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.

The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all memory pages to get
reliable information about the NUMA node. This may also force allocation
of the shared memory.

Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
2025-04-07 23:08:17 +02:00
Tomas Vondra
65c298f61f Add support for basic NUMA awareness
Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.

A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with the NUMA library.

The main function introduced is pg_numa_query_pages(), which allows
determining the NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is more
efficient than get_mempolicy(2).

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
2025-04-07 23:08:17 +02:00
Álvaro Herrera
17bcf4f545
Use specific collation where needed in new test
Oversight in commit a379061a22a8.

Per Czech buildfarm members jay and hippopotamus.
2025-04-07 21:58:06 +02:00
Tom Lane
8cfbdf8f4d Fix some issues in contrib/spi/refint.c.
check_foreign_key incorrectly used a single cache entry for its saved
plans for a 'c' (cascade) trigger, although there are two different
queries to execute depending on whether it fires for an update or a
delete.  This caused the wrong things to be done if both types of
event occur in one session.  (This was indeed visible in the triggers
regression test, but apparently nobody ever questioned it.)  To fix,
add the operation type to the cache key.

Its debug log output failed to distinguish update from delete
events, too.

Also, change the intended trigger usage from BEFORE ROW to AFTER ROW,
and add checks insisting on that usage.  BEFORE is really rather
unsafe, since if there are other BEFORE triggers they might change or
cancel the operation we are trying to check.  AFTER triggers are the
standard way to propagate changes to other rows, so we should follow
that way here.

In passing, remove a useless duplicate lookup of the cache entry.

This code is mostly intended as a documentation example, so we
won't consider a back-patch.

Author: Dmitrii Bondar <d.bondar@postgrespro.ru>
Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Lilian Ontowhee <ontowhee@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/79755a2b18ed4fe5e29da6a87a1e00d1@postgrespro.ru
2025-04-07 15:54:16 -04:00
Andres Freund
8e293e689b aio: Make AIO more compatible with valgrind
In some edge cases valgrind flags issues with the memory referenced by
IOs. All of the cases addressed in this change are false positives.

Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer
data as inaccessible. This happens even though the AIO subsystem still holds a
pin. That's good, there shouldn't be accesses to the buffer outside of AIO
related code until it is pinned by "user" code again. But it requires some
explicit work - if the buffer is not pinned by the current backend, we need to
explicitly mark the buffer data accessible/inaccessible while executing
completion callbacks.

That however causes a cascading issue in IO workers: After the completion
callbacks for a buffer is executed, the page is marked as inaccessible. If
subsequently the same worker is executing IO targeting the same buffer, we
would get an error, as the memory is still marked inaccessible. To avoid that,
we need to explicitly mark the memory as accessible in IO workers.

Another issue is that IO executed in workers or via io_uring will not mark
memory as DEFINED. In the case of workers that is because valgrind does not
track memory definedness across processes. For io_uring that is because
valgrind does not understand io_uring, and therefore its IOs never mark memory
as defined, whether the completions are processed in the defining process or
in another context.  It's not entirely clear how to best solve that. The
current user of AIO is not affected, as it explicitly marks buffers as DEFINED
& NOACCESS anyway.  Defer solving this issue until we have a user with
different needs.

Per buildfarm animal skink.

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
2025-04-07 15:20:30 -04:00
Andres Freund
8ab4241b9f localbuf: Add Valgrind buffer access instrumentation
This mirrors 1e0dfd166b3 (+ 46ef520b9566), for temporary table buffers. This
is mainly interesting right now because the AIO work currently triggers
spurious valgrind errors, and the fix for that is cleaner if temp buffers
behave the same as shared buffers.

This requires one change beyond the annotations themselves, namely to pin
local buffers while writing them out in FlushRelationBuffers().

Reviewed-by: Noah Misch <noah@leadboat.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6
2025-04-07 15:20:30 -04:00
Masahiko Sawada
a13d49014d doc: Fix a typo in pg_recvlogical documentation.
Oversight in cf2655a9029a.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/OS3PR01MB5718DD1466E2B9043448AE5094AA2@OS3PR01MB5718.jpnprd01.prod.outlook.com
2025-04-07 12:13:08 -07:00
Tom Lane
969ab9d4f5 Follow-up fixes for SHA-2 patch (commit 749a9e20c).
This changes the check for valid characters in the salt string to
only allow plain ASCII letters and digits.  The previous coding was
locale-dependent which doesn't really seem like a great idea here;
moreover it could not work correctly in multibyte encodings.

This fixes a careless pointer-use-after-pfree, too.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Andres Freund <andres@anarazel.de>
Author: Bernd Helmle <mailings@oopsware.de>
Discussion: https://postgr.es/m/6fab35422df6b6b9727fdcc243c5fa1c667dd3b5.camel@oopsware.de
2025-04-07 14:14:28 -04:00
Tom Lane
b73e6d71a8 Fix erroneous construction of functions' dependencies on transforms.
The list of transform objects that a function should use is specified
in CREATE FUNCTION's TRANSFORM clause, and then represented indirectly
in pg_proc.protrftypes.  However, ProcedureCreate completely ignored
that for purposes of constructing pg_depend entries, and instead made
the function depend on any transforms that exist for its parameter or
return data types.  This is bad in both directions: the function could
be made dependent on a transform it does not actually use, or it
could try to use a transform that's since been dropped.  (The latter
scenario would require use of a transform that's not for any of the
parameter or return types, but that seems legit for cases where the
function performs SQL operations internally.)

To fix, pass in the list of transform objects that CreateFunction
identified, and build pg_depend entries from that not from the
parameter/return types.  This results in changes in the expected
test outputs in contrib/bool_plperl, which I guess are due to
different ordering of pg_depend entries -- that test case is
surely not exercising either of the problem scenarios.

This fix is not back-patchable as-is: changing the signature of
ProcedureCreate seems too risky in stable branches.  We could
do something like making ProcedureCreate a wrapper around
ProcedureCreateExt or so.  However, I'm more inclined to do
nothing in the back branches.  We had no field complaints up to
now, so the hazards don't seem to be a big issue in practice.
And we couldn't do anything about existing pg_depend entries,
so a back-patched fix would result in a mishmash of dependencies
created according to different rules.  That cure could be worse
than the disease, perhaps.

I bumped catversion just to lay down a marker that the expected
contents of pg_depend are a bit different than before.

Reported-by: Chapman Flack <jcflack@acm.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3112950.1743984111@sss.pgh.pa.us
2025-04-07 13:31:37 -04:00
Álvaro Herrera
a379061a22
Allow NOT NULL constraints to be added as NOT VALID
This allows them to be added without scanning the table, and validating
them afterwards without holding access exclusive lock on the table after
any violating rows have been deleted or fixed.

Doing ALTER TABLE ... SET NOT NULL for a column that has an invalid
not-null constraint validates that constraint.  ALTER TABLE .. VALIDATE
CONSTRAINT is also supported.  There are various checks on whether an
invalid constraint is allowed in a child table when the parent table has
a valid constraint; this should match what we do for enforced/not
enforced constraints.

pg_attribute.attnotnull is now only an indicator for whether a not-null
constraint exists for the column; whether it's valid or invalid must be
queried in pg_constraint.  Applications can continue to query
pg_attribute.attnotnull as before, but now it's possible that NULL rows
are present in the column even when that's set to true.

For backend internal purposes, we cache the nullability status in
CompactAttribute->attnullability that each tuple descriptor carries
(replacing CompactAttribute.attnotnull, which was a mirror of
Form_pg_attribute.attnotnull).  During the initial tuple descriptor
creation, based on the pg_attribute scan, we set this to UNRESTRICTED if
pg_attribute.attnotnull is false, or to UNKNOWN if it's true; then we
update the latter to VALID or INVALID depending on the pg_constraint
scan.  This flag is also copied when tupledescs are copied.

Comparing tuple descs for equality must also compare the
CompactAttribute.attnullability flag and return false in case of a
mismatch.

pg_dump deals with these constraints by storing the OIDs of invalid
not-null constraints in a separate array, and running a query to obtain
their properties.  The regular table creation SQL omits them entirely.
They are then dealt with in the same way as "separate" CHECK
constraints, and dumped after the data has been loaded.  Because no
additional pg_dump infrastructure was required, we don't bump its
version number.

I decided not to bump catversion either, because the old catalog state
works perfectly in the new world.  (Trying to run with new catalog state
and the old server version would likely run into issues, however.)

System catalogs do not support invalid not-null constraints (because
commit 14e87ffa5c54 didn't allow them to have pg_constraint rows
anyway.)

Author: Rushabh Lathia <rushabh.lathia@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Tested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAGPqQf0KitkNack4F5CFkFi-9Dqvp29Ro=EpcWt=4_hs-Rt+bQ@mail.gmail.com
2025-04-07 19:19:50 +02:00
Andrew Dunstan
b52a4a5f28 Clean up error messages from 1495eff7bdb
Quote file names, and mostly avoid hard coded file names. Along the way
make a few other minor improvements.

Discussion: https://postgr.es/m/20250407.152721.1397761902317499205.horikyota.ntt@gmail.com
2025-04-07 12:22:41 -04:00
Tom Lane
3516ea768c Add local-address escape "%L" to log_line_prefix.
This escape shows the numeric server IP address that the client
has connected to.  Unix-socket connections will show "[local]".
Non-client processes (e.g. background processes) will show "[none]".

We expect that this option will be of interest to only a fairly
small number of users.  Therefore the implementation is optimized
for the case where it's not used (that is, we don't do the string
conversion until we have to), and we've not added the field to
csvlog or jsonlog formats.

Author: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: David Steele <david@pgmasters.net>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKAnmmK-U+UicE-qbNU23K--Q5XTLdM6bj+gbkZBZkjyjrd3Ow@mail.gmail.com
2025-04-07 11:06:05 -04:00
Andrew Dunstan
8f5e419484 Revert "Use workaround of __builtin_setjmp only on MINGW on MSVCRT"
This reverts commit c313fa4602defe1be947370ab5b217ca163a1e3c.

This is found to cause issues on x86_64 Windows even when using UCRT.

Discussion: https://postgr.es/m/3312149.1744001936@sss.pgh.pa.us
2025-04-07 11:01:15 -04:00
Andres Freund
8ce79483dc read_stream: Fix overflow hazard with large shared buffers
If the limit returned by GetAdditionalPinLimit() is large, the buffer_limit
variable in read_stream_start_pending_read() can overflow. While the code is
careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of
forwarded buffers.

The overflow can lead to assertion failures, crashes or wrong query results
when using large shared buffers.

It seems easier to avoid this if we make the buffer_limit variable an int,
instead of an int16.  Do so, and clamp buffer_limit after adding the number of
forwarded buffers.

It's possible we might want to address this and related issues more widely by
changing to int instead of int16 more widely, but since the consequences of
this bug can be confusing, it seems better to fix it now.

This bug was introduced in ed0b87caaca.

Discussion: https://postgr.es/m/ewvz3cbtlhrwqk7h6ca6cctiqh7r64ol3pzb3iyjycn2r5nxk5@tnhw3a5zatlr
2025-04-07 09:45:00 -04:00
Alexander Korotkov
717d0e8dd9 Remove GUC_NOT_IN_SAMPLE from enable_self_join_elimination
fc069a3a6319 implements Self-Join Elimination (SJE) and provides a new GUC
variable: enable_self_join_elimination.  This new GUC variable was marked
as GUC_NOT_IN_SAMPLE.  However, enable_self_join_elimination is documented
and is not different from any other enable_* GUCs.  Thus, remove
GUC_NOT_IN_SAMPLE from it and add it to the postgresql.conf.sample.

Discussion: https://postgr.es/m/CAPpHfdsqMTEsmxk3aQwt6xPz%2BKpUELO%3D6fzmER9ZRGrbs4uMfA%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
2025-04-07 16:28:54 +03:00
Daniel Gustafsson
ae60947643 psql: Clarify help message for WATCH_INTERVAL
The help message for WATCH_INTERVAL was hard to interpret and didn't
follow the style of other messages, this updates it to nake it fit in
better and be easier to interpret.

Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Discussion: https://postgr.es/m/20250326.120732.1167093737847500721.horikyota.ntt@gmail.com
2025-04-07 13:44:58 +02:00
Michael Paquier
d6f118444d Fix grammar in log message of pg_restore.c
Introduced by 1495eff7bdb0.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20250407.151359.72428746612514925.horikyota.ntt@gmail.com
2025-04-07 15:37:34 +09:00
Michael Paquier
2c7bd2ba50 libpq: Fix some issues in TAP tests for service files
The valid service file was not correctly shaped, as append_to_file() was
called with an array as input.  This is changed so as the parameter and
value pairs from the valid connection string are appended to the valid
service file one by one.

Even with the first issue fixed, the tests should fail.  However, they
have been passing because all the connection attempts relied on the
default values given to PGPORT and PGHOST from the node when using
Cluster.pm's connect_ok() and connect_fails(), rather than the data in
the service file.  The test is updated to use an interesting trick: a
dummy node is initialized but not started, and all the connection
attempts are done through it.  This ensures that the data inside the
service file is used for all the connection tests.  Note that breaking
the contents of the valid service file on purpose makes all the tests
that rely on it fail.

Issues introduced by 72c2f36d5727.

Author: Andrew Jackson <andrewjackson947@gmail.com>
Discussion: https://postgr.es/m/CAKK5BkG_6_YSaebM6gG=8EuKaY7_VX1RFgYeySuwFPh8FZY73g@mail.gmail.com
2025-04-07 12:55:09 +09:00
Michael Paquier
c36eda2591 Clarify comment for worst-case allocation in quote_literal_cstr()
palloc() is invoked with a specific formula for its allocation size in
quote_literal_cstr().  This wastes some memory, but the size is large
enough to cover even the worst-case scenarios.

No explanations were given about the reasons behind these numbers.  This
commit adds more documentation about all that.

Author: Steve Chavez <steve@supabase.io>
Discussion: https://postgr.es/m/CAGRrpzZ9bToRWS+fAnjxDJrxwZN1QcJ-y1Pn2yg=Hst6rydLtw@mail.gmail.com
2025-04-07 10:02:12 +09:00
Michael Paquier
3191a593d6 Fix use-after-free in pgstat_fetch_stat_backend_by_pid()
stats_fetch_consistency set to "snapshot" causes the backend entry
"beentry" retrieved by pgstat_get_beentry_by_proc_number() to be reset
at the beginning of pgstat_fetch_stat_backend() when fetching the
backend pgstats entry.  As coded, "beentry" was being accessed after
being freed.  This commit moves all the accesses to "beentry" to happen
before calling pgstat_fetch_stat_backend(), fixing the problem.

This problem could be reached by calling the SQL functions
pg_stat_get_backend_io() or pg_stat_get_backend_wal().

Issue caught by valgrind.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/f1788cc0-253a-4a3a-aee0-1b8ab9538736@gmail.com
2025-04-07 09:51:40 +09:00
Fujii Masao
173c97812f Use XLOG_CONTROL_FILE macro consistently for control file name.
The XLOG_CONTROL_FILE macro (defined in access/xlog_internal.h)
represents the control file name. While some parts of the codebase already
use this macro, others previously hardcoded the file name as a string.

This commit replaces those hardcoded strings with the macro,
ensuring consistent usage throughout the code. This makes future
maintenance easier and improves searchability, for example when
grepping for control file usage.

Author: Anton A. Melnikov <a.melnikov@postgrespro.ru>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Masao Fujii <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/0841ec77-47e5-452a-adb4-c6fa55d605fc@postgrespro.ru
2025-04-07 09:27:33 +09:00
Daniel Gustafsson
a233a603ba doc: Clarify project naming
Clarify the project naming in the history section of the docs
to match the recent license preamble changes.

Backpatch to all supported versions.

Author: Dave Page <dpage@pgadmin.org>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CA+OCxozLzK2+Jc14XZyWXSp6L9Ot+3efwXUE35FJG=fsbib2EA@mail.gmail.com
Backpatch-through: 13
2025-04-07 00:03:18 +02:00
Andrew Dunstan
643a1a6198 Clean up checking for pg_dumpall output directory
Coverity objected to the original code, and in any case this is much
cleaner, using the existing routine pg_check_dir() instead of rolling
its own test.

Per suggestion from Tom Lane.
2025-04-06 17:04:58 -04:00
Tom Lane
218ab68275 Doc: fix PDF "contents ... exceed the available area" warnings.
Tweak column widths in a new table, similarly to some previous
fixes such as b62381d9a.

Per buildfarm.
2025-04-06 16:27:39 -04:00
Nathan Bossart
de48056ec7 pg_upgrade: Fix memory leak in check_for_unicode_update().
This function was initializing the "task" variable before a couple
of early returns.  To fix, postpone the initialization until just
before it's needed.

Per Coverity.

Discussion: https://postgr.es/m/Z_KMsUH2-FEbiNjC%40nathan
2025-04-06 15:11:41 -05:00
Andres Freund
57dec20fd4 aio: Avoid spurious coverity warning
PgAioResult.result is never accessed in the relevant path, but coverity
complains about an uninitialized access anyway. So just zero-initialize the
whole thing.  While at it, reduce the scope of the variable.

Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com
2025-04-06 12:07:02 -04:00
Tom Lane
8ab6ef2bb8 Fix memory leaks in px_crypt_shacrypt().
Per Coverity.  I don't think these are of any actual significance
since the function ought to be invoked in a short-lived context.
Still, if it's trying to be neat it should get it right.

Also const-ify a constant and fix up typedef formatting.
2025-04-06 11:57:22 -04:00
Tom Lane
2e4ccf1b45 Use "(void)" to mark pgstat_lock_entry(..., false) calls.
This should silence Coverity's complaints about the result being
sometimes ignored.

I'm inclined to think that these routines are simply misdesigned,
because sometimes it's okay to ignore the result and sometimes it
isn't, and we have no way to enforce the latter.  But for now
I just added a comment.
2025-04-06 11:37:09 -04:00
Andrew Dunstan
5e19154390 Avoid unnecessary copying of a string in pg_restore.c
Coverity complained about a possible overrun in the copy, but there is
no actual need to copy the string at all.
2025-04-06 09:21:09 -04:00
Andrew Dunstan
6d5417e634 Fix a couple of memory leaks in pg_restore.c
per complaint from Coverity.
2025-04-06 09:09:25 -04:00
Peter Eisentraut
a8025f5448 Relax ordering-related hardcoded btree requirements in planning
There were several places in ordering-related planning where a
requirement for btree was hardcoded but an amcanorder index could
suffice.  This fixes that.  We just need to do the necessary mapping
between strategy numbers and compare types and adjust some related
APIs so that this works independent of btree strategy numbers.  For
instance, non-btree amcanorder indexes can now be used to support
sorting and merge joins.  Also, predtest.c works independent of btree
strategy numbers now.

To avoid performance regressions, some details on btree and other
built-in index types are still hardcoded as shortcuts, but other index
types now have access to the same features by providing the required
flags and callbacks.

Author: Mark Dilger <mark.dilger@enterprisedb.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2025-04-06 14:43:51 +02:00
Alexander Korotkov
3a1a7c5a70 Revert "Put enable_self_join_elimination into postgresql.conf.sample"
This reverts commit c2d329260cd8.

Reported-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/D292EB44-806E-439A-82A4-491A1BA59E7A%40yesql.se
2025-04-06 14:30:20 +03:00
Alexander Korotkov
c2d329260c Put enable_self_join_elimination into postgresql.conf.sample
fc069a3a6319 implements Self-Join Elimination (SJE) and provides a new
GUC variable: enable_self_join_elimination.  This commit adds
enable_self_join_elimination to the postgresql.conf.sample, as it was
forgotten in the original commit.

Discussion: https://postgr.es/m/CAHewXN%3D%2Bghd6O6im46q7j2u6c3H6vkXtXmF%3D_v4CfGSnjje8PA%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
2025-04-06 13:24:16 +03:00
John Naylor
3c6e8c1238 Compute CRC32C using AVX-512 instructions where available
The previous implementation of CRC32C on x86 relied on the native
CRC32 instruction from the SSE 4.2 extension, which operates on
up to 8 bytes at a time. We can get a substantial speedup by using
carryless multiplication on SIMD registers, processing 64 bytes per
loop iteration. Shorter inputs fall back to ordinary CRC instructions.
On Intel Tiger Lake hardware (2020), CRC is now 50% faster for inputs
between 64 and 112 bytes, and 3x faster for 256 bytes.

The VPCLMULQDQ instruction on 512-bit registers has been available
on Intel hardware since 2019 and AMD since 2022. There is an older
variant for 128-bit registers, but at least on Zen 2 it performs worse
than normal CRC instructions for short inputs.

We must now do a runtime check, even for builds that target SSE
4.2. This doesn't matter in practice for WAL (arguably the most
critical case), because since commit e2809e3a1 the final computation
with the 20-byte WAL header is inlined and unrolled when targeting
that extension. Compared with two direct function calls, testing
showed equal or slightly faster performance in performing an indirect
function call on several dozen bytes followed by inlined instructions
on constant input of 20 bytes.

The MIT-licensed implementation was generated with the "generate"
program from

https://github.com/corsix/fast-crc32/

Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
Instruction" V. Gopal, E. Ozturk, et al., 2009

Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Co-authored-by: Paul Amonson <paul.d.amonson@intel.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Matthew Sterrett <matthewsterrett2@gmail.com> (earlier version)
Tested-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Tested-by: David Rowley <<dgrowleyml@gmail.com>> (earlier version)
Discussion: https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com
Discussion: https://postgr.es/m/PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com
2025-04-06 14:04:30 +07:00
Daniel Gustafsson
683df3f4de Quote filename in error message
Project standard is to quote filenames in error and log messages, which
commit 2da74d8d640 missed in two error messages.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/20250404.120328.103562371975971823.horikyota.ntt@gmail.com
2025-04-05 22:10:28 +02:00