need be returned. We keep a heap of the current best N tuples and sift-up
new tuples into it as we scan the input. For M input tuples this means
only about M*log(N) comparisons instead of M*log(M), not to mention a lot
less workspace when N is small --- avoiding spill-to-disk for large M
is actually the most attractive thing about it. Patch includes planner
and executor support for invoking this facility in ORDER BY ... LIMIT
queries. Greg Stark, with some editorialization by moi.
pages it intends to zero immediately. Just to show there is some use for that
function besides WAL recovery :-).
Along the way, fold _hash_checkpage and _hash_pageinit calls into _hash_getbuf
and friends, instead of expecting callers to do that separately.
for local symbols, that shouldn't be exported. This patch excludes them,
cutting down about 10,000 exported symbols and decreasing the binary size
by 20%.
ReadOrZeroBuffer to fetch pages from beyond physical EOF. This would
usually work, but would cause problems for md.c if writes occurred
beyond a segment boundary when the previous segment file hadn't been
fully extended.
from the WAL data, don't bother to physically read it; just have bufmgr.c
return a zeroed-out buffer instead. This speeds recovery significantly,
and also avoids unnecessary failures when a page-to-be-overwritten has corrupt
page headers on disk. This replaces a former kluge that accomplished the
latter by pretending zero_damaged_pages was always ON during WAL recovery;
which was OK when the kluge was put in, but is unsafe when restoring a WAL
log that was written with full_page_writes off.
Heikki Linnakangas
true at the very end of its processing, the update is broadcast via a
shared-cache-inval message for the index; without this, existing backends that
already have relcache entries for the index might never see it become valid.
Also, force a relcache inval on the index's parent table at the same time,
so that any cached plans for that table are re-planned; this ensures that
the newly valid index will be used if appropriate. Aside from making
C.I.C. behave more reasonably, this is necessary infrastructure for some
aspects of the HOT patch. Pavan Deolasee, with a little further stuff from
me.
and TimestampDifference, to make coding clearer. I think this should also fix
the failure to start workers in platforms with low resolution timers, as
reported by Itagaki Takahiro.
isn't any place to throw the error to. If so, we should treat the error
as FATAL, just as we would have if it'd been thrown outside the PG_TRY
block to begin with.
Although this is clearly a *potential* source of bugs, it is not clear
at the moment whether it is an *actual* source of bugs; there may not
presently be any PG_TRY blocks in code that can be reached with no outer
longjmp catcher. So for the moment I'm going to be conservative and not
back-patch this. The change breaks ABI for users of PG_RE_THROW and hence
might create compatibility problems for loadable modules, so we should not
put it into released branches without proof that it's needed.
wrong thing when inlining polymorphic SQL functions, because it was using the
function's declared return type where it should have used the actual result
type of the current call. In 8.1 and 8.2 this causes obvious failures even if
you don't have assertions turned on; in 8.0 and 7.4 it would only be a problem
if the inlined expression were used as an input to a function that did
run-time type determination on its inputs. Add a regression test, since this
is evidently an under-tested area.
from time_t to TimestampTz representation. This provides full gettimeofday()
resolution of the timestamps, which might be useful when attempting to
do point-in-time recovery --- previously it was not possible to specify
the stop point with sub-second resolution. But mostly this is to get
rid of TimestampTz-to-time_t conversion overhead during commit. Per my
proposal of a day or two back.
messages to the stats collector. This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden. We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).
In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present. That's not done in this patch, but will be committed
separately.
We can just palloc, instead of using makeNode, when we are going to
overwrite the whole node anyway in the FLATCOPY macro. Also, use
FLATCOPY instead of copyObject for common node types Var and Const.
look through a freelist for a chunk of adequate size. For a long time
now, all elements of a given freelist have been exactly the same
allocated size, so we don't need a loop. Since the loop never iterated
more than once, you'd think this wouldn't matter much, but it makes a
noticeable savings in a simple test --- perhaps because the compiler
isn't optimizing on a mistaken assumption that the loop would repeat.
AllocSetAlloc is called often enough that saving even a couple of
instructions is worthwhile.
types of unspecified parameters when submitted via extended query protocol.
This worked in 8.2 but I had broken it during plancache changes. DECLARE
CURSOR is now treated almost exactly like a plain SELECT through parse
analysis, rewrite, and planning; only just before sending to the executor
do we divert it away to ProcessUtility. This requires a special-case check
in a number of places, but practically all of them were already special-casing
SELECT INTO, so it's not too ugly. (Maybe it would be a good idea to merge
the two by treating IntoClause as a form of utility statement? Not going to
worry about that now, though.) That approach doesn't work for EXPLAIN,
however, so for that I punted and used a klugy solution of running parse
analysis an extra time if under extended query protocol.
is in progress on the same hashtable. This seems the least invasive way to
fix the recently-recognized problem that a split could cause the scan to
visit entries twice or (with much lower probability) miss them entirely.
The only field-reported problem caused by this is the "failed to re-find
shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky,
which was caused by multiply visited entries. However, it seems certain
that mdsync() is vulnerable to missing required fsync's due to missed
entries, and I am fearful that RelationCacheInitializePhase2() might be at
risk as well. Because of that and the generalized hazard presented by this
bug, back-patch all the supported branches.
Along the way, fix pg_prepared_statement() and pg_cursor() to not assume
that the hashtables they are examining will stay static between calls.
This is risky regardless of the newly noted dynahash problem, because
hash_seq_search() has never promised to cope with deletion of table entries
other than the just-returned one. There may be no bug here because the only
supported way to call these functions is via ExecMakeTableFunctionResult()
which will cycle them to completion before doing anything very interesting,
but it seems best to get rid of the assumption. This affects 8.2 and HEAD
only, since those functions weren't there earlier.
we can complete "TABLE". The previous coding only looked for "CREATE TEMP".
Note that I didn't add TEMPORARY to the list of suggested completions
after we've seen "CREATE", since TEMP is equivalent and more concise. But
if the user has already manually typed TEMPORARY, we may as well
complete TABLE for them.
RESET SESSION, RESET PLANS, and RESET TEMP are now DISCARD ALL,
DISCARD PLANS, and DISCARD TEMP, respectively. This is to avoid
confusion with the pre-existing RESET variants: the DISCARD
commands are not actually similar to RESET. Patch from Marko
Kreen, with some minor editorialization.
(it's so nice to have a buildfarm member that actively rejects naked
uses of strcasecmp). This coding is still pretty awful, though, since
it's going to be O(N^2) in the number of guc variables. May I direct
your attention to bsearch?
are mostly excluded by constraints: do the CE test a bit earlier to save
some adjust_appendrel_attrs() work on excluded children, and arrange to
use array indexing rather than rt_fetch() to fetch RTEs in the main body
of the planner. The latter is something I'd wanted to do for awhile anyway,
but seeing list_nth_cell() as 35% of the runtime gets one's attention.
child attnums are the same, before it grovels through each and every child
column looking for a name match. Saves some time in large inheritance trees,
per example from Greg.
values: don't throw away perfectly good hash bits, and increase the shift
distances so as to provide more separation in the common case where some of
the key values are small integers (and so their hashes are too, because
hashfunc.c doesn't try all that hard). This reduces the runtime of
SearchCatCache by a factor of 4 in an example provided by Greg Stark,
in which the planner spends a whole lot of time searching the two-key
STATRELATT cache. It seems unlikely to hurt in other cases, but maybe
we could do even better?
when a relation is opened multiple times in the same transaction. This is
particularly useful for system catalogs, which we may heap_open or index_open
many times in a transaction, and it doesn't really cost anything extra even
if the rel is touched but once. Motivated by study of an example from Greg
Stark, in which pgstat_initstats() accounted for an unreasonably large
fraction of the runtime.
This is needed to allow a security-definer function to set a truly secure
value of search_path. Without it, a malicious user can use temporary objects
to execute code with the privileges of the security-definer function. Even
pushing the temp schema to the back of the search path is not quite good
enough, because a function or operator at the back of the path might still
capture control from one nearer the front due to having a more exact datatype
match. Hence, disable searching the temp schema altogether for functions and
operators.
Security: CVE-2007-2138
failed (due to lock conflicts or out-of-space). We might have already
extended the index's filesystem EOF before failing, causing the EOF to be
beyond what the metapage says is the last used page. Hence the invariant
maintained by the code needs to be "EOF is at or beyond last used page",
not "EOF is exactly the last used page". Problem was created by my patch
of 2006-11-19 that attempted to repair bug #2737. Since that was
back-patched to 7.4, this needs to be as well. Per report and test case
from Vlastimil Krejcir.
pass-by-reference data type and the RETURN statement is within an EXCEPTION
block. Bug introduced by my fix of 2007-01-28 to use per-subtransaction
ExprContexts/EStates; since that wasn't back-patched into older branches,
only 8.2 and HEAD are affected. Per report from Gary Winslow.
competing alternatives for indexes to use in a bitmap scan. The former
coding took estimated selectivity as an overriding factor, causing it to
sometimes choose indexes that were much slower to scan than ones with a
slightly worse selectivity. It was also too narrow-minded about which
combinations of indexes to consider ANDing. The rewrite makes it pay more
attention to index scan cost than selectivity; this seems sane since it's
impossible to have very bad selectivity with low cost, whereas the reverse
isn't true. Also, we now consider each index alone, as well as adding
each index to an AND-group led by each prior index, for a total of about
O(N^2) rather than O(N) combinations considered. This makes the results
much less dependent on the exact order in which the indexes are
considered. It's still a lot cheaper than an O(2^N) exhaustive search.
A prefilter step eliminates all but the cheapest of those indexes using
the same set of WHERE conditions, to keep the effective value of N down in
scenarios where the DBA has created lots of partially-redundant indexes.
sequence for dumping without also selecting its owning table. Make it not try
to emit ALTER SEQUENCE OWNED BY in this situation.
Per report from Michael Nolan.
processes to be running simultaneously. Also, now autovacuum processes do not
count towards the max_connections limit; they are counted separately from
regular processes, and are limited by the new GUC variable
autovacuum_max_workers.
The launcher now has intelligence to launch workers on each database every
autovacuum_naptime seconds, limited only on the max amount of worker slots
available.
Also, the global worker I/O utilization is limited by the vacuum cost-based
delay feature. Workers are "balanced" so that the total I/O consumption does
not exceed the established limit. This part of the patch was contributed by
ITAGAKI Takahiro.
Per discussion.
a replan. I had originally thought this was not necessary, but the new
SPI facilities create a path whereby queries planned with non-default
options can get into the cache, so it is necessary.
access to the planner's cursor-related planning options, and provide new
FETCH/MOVE routines that allow access to the full power of those commands.
Small refactoring of planner(), pg_plan_query(), and pg_plan_queries()
APIs to make it convenient to pass the planning options down from SPI.
This is the core-code portion of Pavel Stehule's patch for scrollable
cursor support in plpgsql; I'll review and apply the plpgsql changes
separately.
possibly be any useful pathkeys --- to wit, queries with neither any
join clauses nor any ORDER BY request. It's nearly free to check for
this case and it saves a useful fraction of the planning time for simple
queries.
If this breaks things due to missing libxslt, then I'll have to
revert it, but let's see if it breaks the buildfarm.
Workarounds in case libxslt is missing include:
. don't configure with libxml, or
. don't build contrib modules from the contrib Makefile (use the individual module Makefiles instead), or
. change the xml2 Makefile
fast flow of new fsync requests can prevent mdsync() from ever completing.
This was an unforeseen consequence of a patch added in Mar 2006 to prevent
the fsync request queue from overflowing. Problem identified by Heikki
Linnakangas and independently by ITAGAKI Takahiro; fix based on ideas from
Takahiro-san, Heikki, and Tom.
Back-patch as far as 8.1 because a previous back-patch introduced the problem
into 8.1 ...
reviewed by Neil Conway. This patch adds the following DDL command
variants: RESET SESSION, RESET TEMP, RESET PLANS, CLOSE ALL, and
DEALLOCATE ALL. RESET SESSION is intended for use by connection
pool software and the like, in order to reset a client session
to something close to its initial state.
Note that while most of these command variants can be executed
inside a transaction block (but are not transaction-aware!),
RESET SESSION cannot. While this is inconsistent, it is intended
to catch programmer mistakes: RESET SESSION in an open transaction
block is probably unintended.
(original code *always* created a full-page image for the left page, thus
leaving the intended savings unrealized), avoid risk of not having enough room
on the page during xlog restore, squeeze out another couple bytes in the xlog
record, clean up neglected comments.
index types can be reliably distinguished by examining the special space
on an index page. Per my earlier proposal, plus the realization that
there's no need for btree's vacuum cycle ID to cycle through every possible
16-bit value. Restricting its range a little costs nearly nothing and
eliminates the possibility of collisions.
Memo to self: remember to make bitmap indexes play along with this scheme,
assuming that patch ever gets accepted.
right, there seems precious little reason to have a pile of hand-maintained
endianness definitions in src/include/port/*.h. Get rid of those, and make
the couple of places that used them depend on WORDS_BIGENDIAN instead.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields. While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.
Greg Stark with some help from Tom Lane
It removes last remaining casts inside struct definitions.
Such usage is bad practice, as it hides problems from compiler.
Reason for the cast is popular practice in some circles
to define functions as foo(MyObj *) instead of foo(PyObject *)
thus avoiding a local variable inside functions and make
direct calling easier. As pl/python does not use such style,
the casts were unnecessary from the start.
Marko Kreen
are in their commit critical sections via flags in the ProcArray. Checkpoint
can watch the ProcArray to determine when it's safe to proceed. This is
a considerably better solution to the original problem of race conditions
between checkpoint and transaction commit: it speeds up commit, since there's
one less lock to fool with, and it prevents the problem of checkpoint being
delayed indefinitely when there's a constant flow of commits. Heikki, with
some kibitzing from Tom.
Add the latter to the values checked in pg_control, since it can't be changed
without invalidating toast table content. This commit in itself shouldn't
change any behavior, but it lays some necessary groundwork for experimentation
with these toast-control numbers.
Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some
thought still needs to be given to needs_toast_table() in toasting.c before
unleashing random changes.
return void ends with a SELECT, if that SELECT has a single result that is
also of type void. Without this, it's hard to write a void function that
calls another void function. Per gripe from Peter.
Back-patch as far as 8.0.
will be released by transaction abort before _bt_end_vacuum gets called.
If either of these "can't happen" errors actually happened, we'd freeze up
trying to acquire an already-held lock. Latest word is that this does
not explain Martin Pitt's trouble report, but it still looks like a bug.
ecpglib supports it.
Change configure (patch from Bruce) and msvc build system to no longer require
pthreads on win32, since all parts of postgresql can be thread-safe using the
native platform functions.
its table list and then rechecks pgstat before vacuuming each table to
verify that no one has vacuumed the table in the meantime.
In the current autovacuum world this only means that a worker will not
vacuum a table that a user has vacuumed manually after the worker started.
When support for multiple autovacuum workers is introduced, this will reduce
the probability of simultaneous workers on the same database doing redundant
work.
seen by code inspecting the expression. The best way to do this seems
to be to drop the original representation as a function invocation, and
instead make a special expression node type that represents applying
the element-type coercion function to each array element. In this way
the element function is exposed and will be checked for volatility.
Per report from Guillaume Smet.
A DBA is allowed to create a language in his database if it's marked
"tmpldbacreate" in pg_pltemplate. The factory default is that this is set
for all standard trusted languages, but of course a superuser may adjust
the settings. In service of this, add the long-foreseen owner column to
pg_language; renaming, dropping, and altering owner of a PL now follow
normal ownership rules instead of being superuser-only.
Jeremy Drake, with some editorialization by Tom Lane.
reset event, namely invalidate everything. This oversight probably
explains the rare failures that some buildfarm machines have been
showing for the plancache regression test.
if possible. I had left this undone in the first pass at the API change
for ProcessUtility, but forgot to revisit it after the plancache changes
made it possible to do it.
Vadim had included this restriction in the original design of the SPI code,
but I'm darned if I can see a reason for it.
I left the macro definition of SPI_ERROR_CURSOR in place, so as not to
needlessly break any SPI callers that are checking for it, but that code
will never actually be returned anymore.
pointer" in every Snapshot struct. This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though). More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.
There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.
Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots. I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches. Might want to reconsider
doing that later.
to the left of the actual bool value. While in most cases there won't be
any, our support for old-style user-defined functions violates the C spec
to the extent of calling functions that might return char or short through
a function pointer declared to return "char *", which we then coerce to
Datum. It is not surprising that the result might contain garbage
high-order bits ... what is surprising is that we didn't see such cases
long ago. Per report from Magnus.
search_path that was active when the plan was first made. To do this,
improve namespace.c to support a stack of "override" search path settings
(we must have a stack since nested replan events are entirely possible).
This facility replaces the "special namespace" hack formerly used by
CREATE SCHEMA, and should be able to support per-function search path
settings as well.