An invalid toast index is skipped in reindex_relation(). These would be
remnants of a failed REINDEX CONCURRENTLY and they should never been
rebuilt as there can only be one valid toast index at a time.
REINDEX_REL_SUPPRESS_INDEX_USE, used by CLUSTER and VACUUM FULL, needs
to maintain a list of the indexes being processed. The list of indexes
is retrieved from the relation cache, and includes invalid indexes. The
code has missed that invalid toast indexes are ignored in
reindex_relation() as this leads to a hard failure in reindex_index(),
and they were left in the reindex pending list, making the list
inconsistent when rechecked. The incorrect memory access was happening
when scanning pg_class for the refresh of pg_database.datfrozenxid, when
doing a scan of pg_class.
This issue exists since REINDEX CONCURRENTLY exists, where invalid toast
indexes can exist, so backpatch all the way down.
Reported-by: Alexander Lakhin
Author: Tender Wang
Discussion: https://postgr.es/m/18630-9aed99c38830657d@postgresql.org
Backpatch-through: 12
This commit improves the catalog data in pg_proc for the three functions
for has_largeobject_privilege(), introduced in 4eada203a5:
- Fix their descriptions (typos and consistency).
- Reallocate OIDs to be within the 8000-9999 range as required by
a6417078c4.
Bump catalog version.
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/ZvUYR0V0dzWaLnsV@paquier.xyz
Creating, reindexing, and dropping an index concurrently could
entail accessing pg_index's TOAST table, which was recently added
in commit b52c4fc3c0. These code paths start and commit their own
transactions, but they do not always set an active snapshot. This
rightfully leads to assertion failures and ERRORs when trying to
access pg_index's TOAST table, such as the following:
ERROR: cannot fetch toast data without an active snapshot
To fix, push an active snapshot just before each section of code
that might require accessing pg_index's TOAST table, and pop it
shortly afterwards.
Reported-by: Alexander Lakhin
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/a97d7401-e7c9-f771-6a00-037379f0a8bb%40gmail.com
I wanted to avoid adjusting this code too much when converting
these tasks to use the new parallelization framework (see commit
40e2e5e92b), which is why this is being done as a follow-up commit.
These stylistic adjustments result in fewer lines of code and fewer
levels of indentation in some places.
While at it, add names to the UpgradeTaskSlotState enum and the
UpgradeTaskSlot struct. I'm not aware of any established project
policy in this area, but let's at least be consistent within the
same file.
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/ZunW7XHLd2uTts4f%40nathan
int_to_roman() only accepts plain "int" input, which is fine since
we're going to produce '###############' for any value above 3999
anyway. However, the numeric and int8 variants of to_char() would
throw an error if the given input exceeded the integer range, while
the float-input variants invoked undefined-per-C-standard behavior.
Fix things so that you uniformly get '###############' for out of
range input.
Also add test cases covering this code, plus the equally-untested
EEEE, V, and PL format codes.
Discussion: https://postgr.es/m/2956175.1725831136@sss.pgh.pa.us
There's no need to add another level of indentation to this status
message. pg_log() will put it in the right place.
Oversight in commit 347758b120.
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/ZunW7XHLd2uTts4f%40nathan
Backpatch-through: 17
The unwritten assumption of this code is that both events->head and
events->tail are NULL together (an empty list) or they aren't. So the
code was testing events->head for nullness and using that as a cue to
deference events->tail, which annoys the Svace static code analyzer.
We can silence it by testing events->tail member instead, and add an
assertion about events->head to ensure it's all consistent.
This code is very old and as far as we know, there's never been a bug
report related to this, so there's no need to backpatch.
This was found by the ALT Linux Team using Svace.
Author: Alexander Kuznetsov <kuznetsovam@altlinux.org>
Discussion: https://postgr.es/m/6d0323c3-3f5d-4137-af73-98a5ab90e77c@altlinux.org
Running vacuumdb with a non-superuser while another user has created a
temporary table would lead to a mid-flight permission failure,
interrupting the operation. vacuum_rel() skips temporary relations of
other backends, and it makes no sense for vacuumdb to know about these
relations, so let's switch it to ignore temporary relations entirely.
Adding a qual in the query based on relpersistence simplifies the
generation of its WHERE clause in vacuum_one_database(), per se the
removal of "has_where".
Author: VaibhaveS, Michael Paquier
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/CAM_eQjwfAR=y3G1fGyS1U9FTmc+FyJm9amNfY2QCZBnDDbNPZg@mail.gmail.com
Backpatch-through: 12
There are many grammar flavors that depend on the parse node
VariableSetStmt. This closes the gap in pg_stat_statements by providing
test coverage for what should be a large majority of them, improving more
the work begun in de2aca2885. This will be used to ease the
evaluation of a path towards more normalization of SET queries with
query jumbling.
Note that SET NAMES (grammar from the standard, synonym of SET
client_encoding) is omitted on purpose, this could use UTF8 with a
conditional script where UTF8 is supported, but that does not seem worth
the maintenance cost for the sake of these tests.
The author has submitted most of these in a TAP test (filled in any
holes I could spot), still queries in a SQL file of pg_stat_statements
is able to achieve the same goal while being easier to look at when
testing normalization patterns.
Author: Greg Sabino Mullane, Michael Paquier
Discussion: https://postgr.es/m/CAKAnmmJtJY2jzQN91=2QAD2eAJAA-Per61eyO48-TyxEg-q0Rg@mail.gmail.com
The previous commit fixed some ways of losing an inplace update. It
remained possible to lose one when a backend working toward a
heap_update() copied a tuple into memory just before inplace update of
that tuple. In catalogs eligible for inplace update, use LOCKTAG_TUPLE
to govern admission to the steps of copying an old tuple, modifying it,
and issuing heap_update(). This includes MERGE commands. To avoid
changing most of the pg_class DDL, don't require LOCKTAG_TUPLE when
holding a relation lock sufficient to exclude inplace updaters.
Back-patch to v12 (all supported versions). In v13 and v12, "UPDATE
pg_class" or "UPDATE pg_database" can still lose an inplace update. The
v14+ UPDATE fix needs commit 86dc90056d,
and it wasn't worth reimplementing that fix without such infrastructure.
Reviewed by Nitin Motiani and (in earlier versions) Heikki Linnakangas.
Discussion: https://postgr.es/m/20231027214946.79.nmisch@google.com
As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog. It could lose
the update. Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running. The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.
For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock. Back-patch to v12 (all supported
versions). In back branches, retain a deprecated heap_inplace_update(),
for extensions.
Reported by Smolkin Grigory. Reviewed by Nitin Motiani, (in earlier
versions) Heikki Linnakangas, and (in earlier versions) Alexander
Lakhin.
Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
The current use always releases this locktag. A planned use will
continue that intent. It will involve more areas of code, making unlock
omissions easier. Warn under debug_assertions, like we do for various
resource leaks. Back-patch to v12 (all supported versions), the plan
for the commit of the new use.
Reviewed by Heikki Linnakangas.
Discussion: https://postgr.es/m/20240512232923.aa.nmisch@google.com
Commit d1379ebf4 carelessly broke printACLColumn for pre-9.4 servers,
by using the cardinality() function which we introduced in 9.4.
We expect psql's describe-related commands to work back to 9.2, so
this is bad. Use the longstanding array_length() function instead.
Per report from Christoph Berg. Back-patch to v17.
Discussion: https://postgr.es/m/ZvLXYglRS6hMMhtr@msg.df7cb.de
Ensure that error paths within these functions do not leak a collator,
and return the result rather than using an out parameter. (Error paths
in the caller may still result in a leaked collator, which will be
addressed separately.)
In make_libc_collator(), if the first newlocale() succeeds and the
second one fails, close the first locale_t object.
The function make_icu_collator() doesn't have any external callers, so
change it to be static.
Discussion: https://postgr.es/m/54d20e812bd6c3e44c10eddcd757ec494ebf1803.camel@j-davis.com
When our XML-handling modules were first written, the SQL standard
lacked any error codes that were particularly intended for XML
error conditions. Unsurprisingly, this led to some rather random
choices of errcodes in those modules. Now the standard has a whole
SQLSTATE class, "Class 10 - XQuery Error", with a reasonably large
selection of relevant-looking errcodes.
In this patch I've chosen one fairly generic code defined by the
standard, 10608 = invalid_argument_for_xquery, and used it where
it seemed appropriate. I've also made an effort to replace
ERRCODE_INTERNAL_ERROR everywhere it was not clearly reporting
a coding problem; in particular, many of the existing uses look
like they can fairly be reported as ERRCODE_OUT_OF_MEMORY.
It might be interesting to try to map libxml2's error codes into
the standard's new collection, but I've not undertaken that here.
Discussion: https://postgr.es/m/417250.1726341268@sss.pgh.pa.us
The array->scan_key references fixed up at the end of preprocessing
start out as offsets into the arrayKeyData[] array (the array returned
by _bt_preprocess_array_keys at the start of preprocessing that involves
array scan keys). Offsets into the arrayKeyData[] array are no longer
guaranteed to be valid offsets into our original scan->keyData[] input
scan key array, but comments describing the array->scan_key references
still talked about scan->keyData[]. Update those comments.
Oversight in commit b5249741.
Since autovacuum does not trigger an ANALYZE for partitioned tables,
users must perform these manually. However, performing a manual ANALYZE
on a partitioned table would always result in recursively analyzing each
partition and that could be undesirable as autovacuum takes care of that.
For partitioned tables that contain a large number of partitions, having
to analyze each partition could take an unreasonably long time, especially
so for tables with a large number of columns.
Here we allow the ONLY keyword to prefix the name of the table to allow
users to have ANALYZE skip processing partitions. This option can also
be used with VACUUM, but there is no work to do if VACUUM ONLY is used on
a partitioned table.
This commit also changes the behavior of VACUUM and ANALYZE for
inheritance parents. Previously inheritance child tables would not be
processed when operating on the parent. Now, by default we *do* operate
on the child tables. ONLY can be used to obtain the old behavior.
The release notes should note this as an incompatibility. The default
behavior has not changed for partitioned tables as these always
recursively processed the partitions.
Author: Michael Harris <harmic@gmail.com>
Discussion: https://postgr.es/m/CADofcAWATx_haD=QkSxHbnTsAe6+e0Aw8Eh4H8cXyogGvn_kOg@mail.gmail.com
Discussion: https://postgr.es/m/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA@mail.gmail.com
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Attempting these commands for a non-partitioned table would result in a
failure when creating the relation in transformPartitionCmd(). This
gives the possibility to throw an error earlier with a much better error
message, thanks to d69a3f4d70.
The extra test cases are from me. Note that FINALIZE uses a different
subcommand and it had no coverage for its failure path with
non-partitioned tables.
Author: Álvaro Herrera, Michael Paquier
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/202409190803.tnis52adt2n5@alvherre.pgsql
Coverity pointed out that inc_lex_level() would leak memory
(not to mention corrupt the pstack data structure) if some
but not all of its three REALLOC's failed. To fix, store
successfully-updated pointers back into the pstack struct
immediately.
Oversight in 0785d1b8b, so no need for back-patch.
Commit c4d5cb71d2 introduced a couple asserts in the fast-path locking
code, upsetting Coverity.
The assert in InitProcGlobal() is clearly wrong, as it assigns instead
of checking the value. This is harmless, but doesn't check anything.
The asserts in FAST_PATH_ macros are written as if for signed values,
but the macros are only called for unsigned ones. That makes the check
for (val >= 0) useless. Checks written as ((uint32) x < max) work for
both signed and unsigned values. Negative values should wrap to values
greater than INT32_MAX.
Per Coverity, report by Tom Lane.
Reported-by: Tom Lane
Discussion: https://postgr.es/m/2891628.1727019959@sss.pgh.pa.us
This commit is similar to 95d6e9af07, expanding the idea to CTE scan,
table function scan and recursive union scan nodes so that the maximum
tuplestore memory or disk usage is shown with EXPLAIN ANALYZE command.
Also adjust show_storage_info() so that it accepts storage type and
storage size arguments instead of Tuplestorestate. This allows the
node types to share the formatting code using show_storage_info(). Due
to this show_material_info() and show_windowagg_info() are also
modified.
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/20240918.211246.1127161704188186085.ishii%40postgresql.org
pg_authid's only varlena column is rolpassword, which unfortunately
cannot be de-TOASTed during authentication because we haven't
selected a database yet and cannot read pg_class. By removing
pg_authid's TOAST table, attempts to set password hashes that
require out-of-line storage will fail with a "row is too big"
error instead. We may want to provide a more user-friendly error
in the future, but for now let's just remove the useless TOAST
table.
Bumps catversion.
Reported-by: Alexander Lakhin
Reviewed-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/89e8649c-eb74-db25-7945-6d6b23992394%40gmail.com
Replace the fixed-size array of fast-path locks with arrays, sized on
startup based on max_locks_per_transaction. This allows using fast-path
locking for workloads that need more locks.
The fast-path locking introduced in 9.2 allowed each backend to acquire
a small number (16) of weak relation locks cheaply. If a backend needs
to hold more locks, it has to insert them into the shared lock table.
This is considerably more expensive, and may be subject to contention
(especially on many-core systems).
The limit of 16 fast-path locks was always rather low, because we have
to lock all relations - not just tables, but also indexes, views, etc.
For planning we need to lock all relations that might be used in the
plan, not just those that actually get used in the final plan. So even
with rather simple queries and schemas, we often need significantly more
than 16 locks.
As partitioning gets used more widely, and the number of partitions
increases, this limit is trivial to hit. Complex queries may easily use
hundreds or even thousands of locks. For workloads doing a lot of I/O
this is not noticeable, but for workloads accessing only data in RAM,
the access to the shared lock table may be a serious issue.
This commit removes the hard-coded limit of the number of fast-path
locks. Instead, the size of the fast-path arrays is calculated at
startup, and can be set much higher than the original 16-lock limit.
The overall fast-path locking protocol remains unchanged.
The variable-sized fast-path arrays can no longer be part of PGPROC, but
are allocated as a separate chunk of shared memory and then references
from the PGPROC entries.
The fast-path slots are organized as a 16-way set associative cache. You
can imagine it as a hash table of 16-slot "groups". Each relation is
mapped to exactly one group using hash(relid), and the group is then
processed using linear search, just like the original fast-path cache.
With only 16 entries this is cheap, with good locality.
Treating this as a simple hash table with open addressing would not be
efficient, especially once the hash table gets almost full. The usual
remedy is to grow the table, but we can't do that here easily. The
access would also be more random, with worse locality.
The fast-path arrays are sized using the max_locks_per_transaction GUC.
We try to have enough capacity for the number of locks specified in the
GUC, using the traditional 2^n formula, with an upper limit of 1024 lock
groups (i.e. 16k locks). The default value of max_locks_per_transaction
is 64, which means those instances will have 64 fast-path slots.
The main purpose of the max_locks_per_transaction GUC is to size the
shared lock table. It is often set to the "average" number of locks
needed by backends, with some backends using significantly more locks.
This should not be a major issue, however. Some backens may have to
insert locks into the shared lock table, but there can't be too many of
them, limiting the contention.
The only solution is to increase the GUC, even if the shared lock table
already has sufficient capacity. That is not free, especially in terms
of memory usage (the shared lock table entries are fairly large). It
should only happen on machines with plenty of memory, though.
In the future we may consider a separate GUC for the number of fast-path
slots, but let's try without one first.
Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
Teach _bt_preprocess_array_keys to eliminate redundant array equality
scan keys directly, rather than just marking them as redundant. Its
_bt_preprocess_keys caller is no longer required to ignore input scan
keys that were marked redundant in this way. Oversights like the one
fixed by commit f22e17f7 are no longer possible.
The new scheme also makes it easier for _bt_preprocess_keys to output a
so.keyData[] scan key array with _more_ scan keys than it was passed in
its scan.keyData[] input scan key array. An upcoming patch that adds
skip scan optimizations to nbtree will take advantage of this.
In passing, remove and rename certain _bt_preprocess_keys variables to
make the difference between our input scan key array and our output scan
key array clearer.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
table_beginscan_parallel and index_beginscan_parallel contain
Asserts checking that the relation a worker will use in
a parallel scan is the same one the leader intended. However,
they were checking for relation OID match, which was not strong
enough to detect the mismatch problem fixed in 126ec0bc7.
What would be strong enough is to compare relfilenodes instead.
Arguably, that's a saner definition anyway, since a scan surely
operates on a physical relation not a logical one. Hence,
store and compare RelFileLocators not relation OIDs. Also
ensure that index_beginscan_parallel checks the index identity
not just the table identity.
Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made
parallel index scans work with the new design for arrays via explicit
scheduling of primitive index scans. Under this scheme a parallel index
scan with array keys will perform the same number of index descents as
an equivalent serial index scan (barring corner cases where an
individual parallel worker discovers that it can advance the scan's
array keys without anybody needing to perform another descent of the
index to get to the relevant page on the leaf level).
Despite all this, the pgstats accounting wasn't updated; it continued to
increment the total number of index scans for the rel once per _bt_first
call, no matter the details. As a result, the number of (primitive)
index scans could be over-counted during parallel scans.
To fix, delay incrementing the count of index scans until after we've
established that another descent of the index (using either _bt_search
or _bt_endpoint) is required. That way pg_stat_user_tables.idx_scan
always advances in the same way, regardless of whether or not the scan
makes use of parallelism.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wz=E7XrkvscBN0U6V81NK3Q-dQOmivvbEsjG-zwEfDdFpg@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com
Backpatch: 17-, where nbtree SAOP execution was enhanced.
We need to do RestoreRelationMap before loading catalog-derived
state, else the worker may end up with catalog relcache entries
containing stale relfilenode data. Move up RestoreReindexState
too, on the principle that that should also happen before we
do much of any catalog access.
I think ideally these things would happen even before InitPostgres,
but there are various problems standing in the way of that, notably
that the relmapper thinks "active" mappings should be discarded at
transaction end. The implication of this is that InitPostgres and
RestoreLibraryState will see the same catalog state as an independent
backend would see, which is probably fine; at least, it's been like
that all along.
Per report from Justin Pryzby. There is a case to be made that
this should be back-patched. But given the lack of complaints
before 6e086fa2e and the short amount of time remaining before
17.0 wraps, I'll just put it in HEAD for now.
Discussion: https://postgr.es/m/ZuoU_8EbSTE14o1U@pryzbyj2023
The implementation assumes that on multiple calls of these meta-commands
the last one wins. Multiple \g calls in-between mean multiple
executions.
There were no tests to check these properties, hence let's add
something.
Author: Jelte Fennema-Nio, Michael Paquier
Discussion: https://postgr.es/m/CAGECzQSTE7CoM=Gst56Xj8pOvjaPr09+7jjtWqTC40pGETyAuA@mail.gmail.com
FYI, during PDF builds, this link type generates a "Unresolved ID
reference found" warning because it is suppressed from the PDF output.
Backpatch-through: 12
This commit moves pg_wal_replay_wait() procedure to be a neighbor of
WAL-related functions in xlogfuncs.c. The implementation of LSN waiting
continues to reside in the same place.
By proposal from Michael Paquier.
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org
This fixes a couple of issues with the psql meta-commands mentioned
above when called repeatedly:
- The statement name is reset for each call. If a command errors out,
its send_mode would still be set, causing an incorrect path to be taken
when processing a query. For \bind_named, this could trigger an
assertion failure as a statement name is always expected for this
meta-command. This issue has been introduced by d55322b0da.
- The memory allocated for bind parameters can be leaked. This is a bug
enlarged by d55322b0da that exists since 5b66de3433, as it is also
possible to leak memory with \bind in v16 and v17. This requires a fix
that will be done on the affected branches separately. This issue is
taken care of here for HEAD.
This patch tightens the cleanup of the state used for the extended
protocol meta-commands (bind parameters, send mode, statement name) by
doing it before running each meta-command on top of doing it once a
query has been processed, avoiding any leaks and the inconsistencies
when mixing calls, by refactoring the cleanup in a single routine used
in all the code paths where this step is required.
Reported-by: Alexander Lakhin
Author: Anthonin Bonnefoy
Discussion: https://postgr.es/m/2e5b89af-a351-ff0a-000c-037ac28314ab@gmail.com
Partitioned tables and normal tables have been relying on ATT_TABLE in
ATSimplePermissions() to produce error messages that depend on the
relation's relkind, because both relkinds currently support the same set
of ALTER TABLE subcommands.
A patch to restrict SET LOGGED/UNLOGGED for partitioned tables is under
discussion, and introducing ATT_PARTITIONED_TABLE makes subcommand
restrictions for partitioned tables easier to deal with, so let's add
one. There is no functional change.
Author: Michael Paquier
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/Zt6cDnwSvnuLLnak@paquier.xyz
nodeRecursiveunion.c makes use of two tuplestores and, until now, would
delete and recreate one of these tuplestores after every recursive
iteration.
Here we adjust that behavior and instead reuse one of the existing
tuplestores and just empty it of all tuples using tuplestore_clear().
This saves some free/malloc roundtrips and has shown a 25-30% performance
improvement for queries that perform very little work between recursive
iterations.
This also paves the way to add some EXPLAIN ANALYZE telemetry output for
recursive common table expressions, similar to what was done in 1eff8279d
and 95d6e9af0. Previously calling tuplestore_end() would have caused
the maximum storage space used to be lost.
Reviewed-by: Tatsuo Ishii
Discussion: https://postgr.es/m/CAApHDvr9yW0YRiK8A2J7nvyT8g17YzbSfOviEWrghazKZbHbig@mail.gmail.com