mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-12-15 08:20:16 +08:00
Fix "failed to re-find parent key" btree VACUUM failure by revising page
deletion code to avoid the case where an upper-level btree page remains "half dead" for a significant period of time, and to block insertions into a key range that is in process of being re-assigned to the right sibling of the deleted page's parent. This prevents the scenario reported by Ed L. wherein index keys could become out-of-order in the grandparent index level. Since this is a moderately invasive fix, I'm applying it only to HEAD. The bug exists back to 7.4, but the back branches will get a different patch.
This commit is contained in:
parent
19d0c46def
commit
70ce5c9082
@ -1,4 +1,4 @@
|
||||
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.13 2006/07/25 19:13:00 tgl Exp $
|
||||
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.14 2006/11/01 19:43:17 tgl Exp $
|
||||
|
||||
This directory contains a correct implementation of Lehman and Yao's
|
||||
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
|
||||
@ -201,26 +201,25 @@ When we delete the last remaining child of a parent page, we mark the
|
||||
parent page "half-dead" as part of the atomic update that deletes the
|
||||
child page. This implicitly transfers the parent's key space to its right
|
||||
sibling (which it must have, since we never delete the overall-rightmost
|
||||
page of a level). No future insertions into the parent level are allowed
|
||||
to insert keys into the half-dead page --- they must move right to its
|
||||
sibling, instead. The parent remains empty and can be deleted in a
|
||||
separate atomic action. (However, if it's the rightmost child of its own
|
||||
parent, it might have to stay half-dead for awhile, until it's also the
|
||||
only child.)
|
||||
|
||||
Note that an empty leaf page is a valid tree state, but an empty interior
|
||||
page is not legal (an interior page must have children to delegate its
|
||||
key space to). So an interior page *must* be marked half-dead as soon
|
||||
as its last child is deleted.
|
||||
page of a level). Searches ignore the half-dead page and immediately move
|
||||
right. We need not worry about insertions into a half-dead page --- insertions
|
||||
into upper tree levels happen only as a result of splits of child pages, and
|
||||
the half-dead page no longer has any children that could split. Therefore
|
||||
the page stays empty even when we don't have lock on it, and we can complete
|
||||
its deletion in a second atomic action.
|
||||
|
||||
The notion of a half-dead page means that the key space relationship between
|
||||
the half-dead page's level and its parent's level may be a little out of
|
||||
whack: key space that appears to belong to the half-dead page's parent on the
|
||||
parent level may really belong to its right sibling. We can tolerate this,
|
||||
however, because insertions and deletions on upper tree levels are always
|
||||
done by reference to child page numbers, not keys. The only cost is that
|
||||
searches may sometimes descend to the half-dead page and then have to move
|
||||
right, rather than going directly to the sibling page.
|
||||
parent level may really belong to its right sibling. To prevent any possible
|
||||
problems, we hold lock on the deleted child page until we have finished
|
||||
deleting any now-half-dead parent page(s). This prevents any insertions into
|
||||
the transferred keyspace until the operation is complete. The reason for
|
||||
doing this is that a sufficiently large number of insertions into the
|
||||
transferred keyspace, resulting in multiple page splits, could propagate keys
|
||||
from that keyspace into the parent level, resulting in transiently
|
||||
out-of-order keys in that level. It is thought that that wouldn't cause any
|
||||
serious problem, but it seems too risky to allow.
|
||||
|
||||
A deleted page cannot be reclaimed immediately, since there may be other
|
||||
processes waiting to reference it (ie, search processes that just left the
|
||||
|
@ -8,7 +8,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.144 2006/10/04 00:29:48 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.145 2006/11/01 19:43:17 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@ -1337,8 +1337,8 @@ _bt_insert_parent(Relation rel,
|
||||
|
||||
/* Check for error only after writing children */
|
||||
if (pbuf == InvalidBuffer)
|
||||
elog(ERROR, "failed to re-find parent key in \"%s\"",
|
||||
RelationGetRelationName(rel));
|
||||
elog(ERROR, "failed to re-find parent key in \"%s\" for split pages %u/%u",
|
||||
RelationGetRelationName(rel), bknum, rbknum);
|
||||
|
||||
/* Recursively update the parent */
|
||||
_bt_insertonpg(rel, pbuf, stack->bts_parent,
|
||||
|
@ -9,7 +9,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.100 2006/10/04 00:29:49 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.101 2006/11/01 19:43:17 tgl Exp $
|
||||
*
|
||||
* NOTES
|
||||
* Postgres btree pages look like ordinary relation pages. The opaque
|
||||
@ -723,7 +723,93 @@ _bt_delitems(Relation rel, Buffer buf,
|
||||
}
|
||||
|
||||
/*
|
||||
* _bt_pagedel() -- Delete a page from the b-tree.
|
||||
* Subroutine to pre-check whether a page deletion is safe, that is, its
|
||||
* parent page would be left in a valid or deletable state.
|
||||
*
|
||||
* "target" is the page we wish to delete, and "stack" is a search stack
|
||||
* leading to it (approximately). Note that we will update the stack
|
||||
* entry(s) to reflect current downlink positions --- this is harmless and
|
||||
* indeed saves later search effort in _bt_pagedel.
|
||||
*
|
||||
* Note: it's OK to release page locks after checking, because a safe
|
||||
* deletion can't become unsafe due to concurrent activity. A non-rightmost
|
||||
* page cannot become rightmost unless there's a concurrent page deletion,
|
||||
* but only VACUUM does page deletion and we only allow one VACUUM on an index
|
||||
* at a time. An only child could acquire a sibling (of the same parent) only
|
||||
* by being split ... but that would make it a non-rightmost child so the
|
||||
* deletion is still safe.
|
||||
*/
|
||||
static bool
|
||||
_bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
|
||||
{
|
||||
BlockNumber parent;
|
||||
OffsetNumber poffset,
|
||||
maxoff;
|
||||
Buffer pbuf;
|
||||
Page page;
|
||||
BTPageOpaque opaque;
|
||||
|
||||
/*
|
||||
* In recovery mode, assume the deletion being replayed is valid. We
|
||||
* can't always check it because we won't have a full search stack,
|
||||
* and we should complain if there's a problem, anyway.
|
||||
*/
|
||||
if (InRecovery)
|
||||
return true;
|
||||
|
||||
/* Locate the parent's downlink (updating the stack entry if needed) */
|
||||
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
|
||||
pbuf = _bt_getstackbuf(rel, stack, BT_READ);
|
||||
if (pbuf == InvalidBuffer)
|
||||
elog(ERROR, "failed to re-find parent key in \"%s\" for deletion target page %u",
|
||||
RelationGetRelationName(rel), target);
|
||||
parent = stack->bts_blkno;
|
||||
poffset = stack->bts_offset;
|
||||
|
||||
page = BufferGetPage(pbuf);
|
||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||
maxoff = PageGetMaxOffsetNumber(page);
|
||||
|
||||
/*
|
||||
* If the target is the rightmost child of its parent, then we can't
|
||||
* delete, unless it's also the only child.
|
||||
*/
|
||||
if (poffset >= maxoff)
|
||||
{
|
||||
/* It's rightmost child... */
|
||||
if (poffset == P_FIRSTDATAKEY(opaque))
|
||||
{
|
||||
/*
|
||||
* It's only child, so safe if parent would itself be removable.
|
||||
* We have to check the parent itself, and then recurse to
|
||||
* test the conditions at the parent's parent.
|
||||
*/
|
||||
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque))
|
||||
{
|
||||
_bt_relbuf(rel, pbuf);
|
||||
return false;
|
||||
}
|
||||
|
||||
_bt_relbuf(rel, pbuf);
|
||||
return _bt_parent_deletion_safe(rel, parent, stack->bts_parent);
|
||||
}
|
||||
else
|
||||
{
|
||||
/* Unsafe to delete */
|
||||
_bt_relbuf(rel, pbuf);
|
||||
return false;
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
/* Not rightmost child, so safe to delete */
|
||||
_bt_relbuf(rel, pbuf);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* _bt_pagedel() -- Delete a page from the b-tree, if legal to do so.
|
||||
*
|
||||
* This action unlinks the page from the b-tree structure, removing all
|
||||
* pointers leading to it --- but not touching its own left and right links.
|
||||
@ -731,19 +817,25 @@ _bt_delitems(Relation rel, Buffer buf,
|
||||
* may currently be trying to follow links leading to the page; they have to
|
||||
* be allowed to use its right-link to recover. See nbtree/README.
|
||||
*
|
||||
* On entry, the target buffer must be pinned and read-locked. This lock and
|
||||
* pin will be dropped before exiting.
|
||||
* On entry, the target buffer must be pinned and locked (either read or write
|
||||
* lock is OK). This lock and pin will be dropped before exiting.
|
||||
*
|
||||
* Returns the number of pages successfully deleted (zero on failure; could
|
||||
* be more than one if parent blocks were deleted).
|
||||
* The "stack" argument can be a search stack leading (approximately) to the
|
||||
* target page, or NULL --- outside callers typically pass NULL since they
|
||||
* have not done such a search, but internal recursion cases pass the stack
|
||||
* to avoid duplicated search effort.
|
||||
*
|
||||
* Returns the number of pages successfully deleted (zero if page cannot
|
||||
* be deleted now; could be more than one if parent pages were deleted too).
|
||||
*
|
||||
* NOTE: this leaks memory. Rather than trying to clean up everything
|
||||
* carefully, it's better to run it in a temp context that can be reset
|
||||
* frequently.
|
||||
*/
|
||||
int
|
||||
_bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
_bt_pagedel(Relation rel, Buffer buf, BTStack stack, bool vacuum_full)
|
||||
{
|
||||
int result;
|
||||
BlockNumber target,
|
||||
leftsib,
|
||||
rightsib,
|
||||
@ -756,7 +848,6 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
IndexTuple targetkey,
|
||||
itup;
|
||||
ScanKey itup_scankey;
|
||||
BTStack stack;
|
||||
Buffer lbuf,
|
||||
rbuf,
|
||||
pbuf;
|
||||
@ -778,6 +869,9 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
|
||||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
|
||||
{
|
||||
/* Should never fail to delete a half-dead page */
|
||||
Assert(!P_ISHALFDEAD(opaque));
|
||||
|
||||
_bt_relbuf(rel, buf);
|
||||
return 0;
|
||||
}
|
||||
@ -793,36 +887,79 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
|
||||
|
||||
/*
|
||||
* We need to get an approximate pointer to the page's parent page. Use
|
||||
* the standard search mechanism to search for the page's high key; this
|
||||
* will give us a link to either the current parent or someplace to its
|
||||
* left (if there are multiple equal high keys). To avoid deadlocks, we'd
|
||||
* better drop the target page lock first.
|
||||
* To avoid deadlocks, we'd better drop the target page lock before
|
||||
* going further.
|
||||
*/
|
||||
_bt_relbuf(rel, buf);
|
||||
/* we need an insertion scan key to do our search, so build one */
|
||||
itup_scankey = _bt_mkscankey(rel, targetkey);
|
||||
/* find the leftmost leaf page containing this key */
|
||||
stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
|
||||
&lbuf, BT_READ);
|
||||
/* don't need a pin on that either */
|
||||
_bt_relbuf(rel, lbuf);
|
||||
|
||||
/*
|
||||
* If we are trying to delete an interior page, _bt_search did more than
|
||||
* we needed. Locate the stack item pointing to our parent level.
|
||||
* We need an approximate pointer to the page's parent page. We use
|
||||
* the standard search mechanism to search for the page's high key; this
|
||||
* will give us a link to either the current parent or someplace to its
|
||||
* left (if there are multiple equal high keys). In recursion cases,
|
||||
* the caller already generated a search stack and we can just re-use
|
||||
* that work.
|
||||
*/
|
||||
ilevel = 0;
|
||||
for (;;)
|
||||
if (stack == NULL)
|
||||
{
|
||||
if (stack == NULL)
|
||||
elog(ERROR, "not enough stack items");
|
||||
if (ilevel == targetlevel)
|
||||
break;
|
||||
stack = stack->bts_parent;
|
||||
ilevel++;
|
||||
if (!InRecovery)
|
||||
{
|
||||
/* we need an insertion scan key to do our search, so build one */
|
||||
itup_scankey = _bt_mkscankey(rel, targetkey);
|
||||
/* find the leftmost leaf page containing this key */
|
||||
stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
|
||||
&lbuf, BT_READ);
|
||||
/* don't need a pin on that either */
|
||||
_bt_relbuf(rel, lbuf);
|
||||
|
||||
/*
|
||||
* If we are trying to delete an interior page, _bt_search did
|
||||
* more than we needed. Locate the stack item pointing to our
|
||||
* parent level.
|
||||
*/
|
||||
ilevel = 0;
|
||||
for (;;)
|
||||
{
|
||||
if (stack == NULL)
|
||||
elog(ERROR, "not enough stack items");
|
||||
if (ilevel == targetlevel)
|
||||
break;
|
||||
stack = stack->bts_parent;
|
||||
ilevel++;
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
/*
|
||||
* During WAL recovery, we can't use _bt_search (for one reason,
|
||||
* it might invoke user-defined comparison functions that expect
|
||||
* facilities not available in recovery mode). Instead, just
|
||||
* set up a dummy stack pointing to the left end of the parent
|
||||
* tree level, from which _bt_getstackbuf will walk right to the
|
||||
* parent page. Painful, but we don't care too much about
|
||||
* performance in this scenario.
|
||||
*/
|
||||
pbuf = _bt_get_endpoint(rel, targetlevel + 1, false);
|
||||
stack = (BTStack) palloc(sizeof(BTStackData));
|
||||
stack->bts_blkno = BufferGetBlockNumber(pbuf);
|
||||
stack->bts_offset = InvalidOffsetNumber;
|
||||
/* bts_btentry will be initialized below */
|
||||
stack->bts_parent = NULL;
|
||||
_bt_relbuf(rel, pbuf);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* We cannot delete a page that is the rightmost child of its immediate
|
||||
* parent, unless it is the only child --- in which case the parent has
|
||||
* to be deleted too, and the same condition applies recursively to it.
|
||||
* We have to check this condition all the way up before trying to delete.
|
||||
* We don't need to re-test when deleting a non-leaf page, though.
|
||||
*/
|
||||
if (targetlevel == 0 &&
|
||||
!_bt_parent_deletion_safe(rel, target, stack))
|
||||
return 0;
|
||||
|
||||
/*
|
||||
* We have to lock the pages we need to modify in the standard order:
|
||||
* moving right, then up. Else we will deadlock against other writers.
|
||||
@ -898,15 +1035,16 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
|
||||
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
|
||||
if (pbuf == InvalidBuffer)
|
||||
elog(ERROR, "failed to re-find parent key in \"%s\"",
|
||||
RelationGetRelationName(rel));
|
||||
elog(ERROR, "failed to re-find parent key in \"%s\" for deletion target page %u",
|
||||
RelationGetRelationName(rel), target);
|
||||
parent = stack->bts_blkno;
|
||||
poffset = stack->bts_offset;
|
||||
|
||||
/*
|
||||
* If the target is the rightmost child of its parent, then we can't
|
||||
* delete, unless it's also the only child --- in which case the parent
|
||||
* changes to half-dead status.
|
||||
* changes to half-dead status. The "can't delete" case should have been
|
||||
* detected by _bt_parent_deletion_safe, so complain if we see it now.
|
||||
*/
|
||||
page = BufferGetPage(pbuf);
|
||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||
@ -918,14 +1056,8 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
if (poffset == P_FIRSTDATAKEY(opaque))
|
||||
parent_half_dead = true;
|
||||
else
|
||||
{
|
||||
_bt_relbuf(rel, pbuf);
|
||||
_bt_relbuf(rel, rbuf);
|
||||
_bt_relbuf(rel, buf);
|
||||
if (BufferIsValid(lbuf))
|
||||
_bt_relbuf(rel, lbuf);
|
||||
return 0;
|
||||
}
|
||||
elog(ERROR, "failed to delete rightmost child %u of %u in \"%s\"",
|
||||
target, parent, RelationGetRelationName(rel));
|
||||
}
|
||||
else
|
||||
{
|
||||
@ -940,10 +1072,13 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
* might be possible to push the fast root even further down, but the odds
|
||||
* of doing so are slim, and the locking considerations daunting.)
|
||||
*
|
||||
* We don't support handling this in the case where the parent is
|
||||
* becoming half-dead, even though it theoretically could occur.
|
||||
*
|
||||
* We can safely acquire a lock on the metapage here --- see comments for
|
||||
* _bt_newroot().
|
||||
*/
|
||||
if (leftsib == P_NONE)
|
||||
if (leftsib == P_NONE && !parent_half_dead)
|
||||
{
|
||||
page = BufferGetPage(rbuf);
|
||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||
@ -1031,6 +1166,7 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
*/
|
||||
page = BufferGetPage(buf);
|
||||
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
||||
opaque->btpo_flags &= ~BTP_HALF_DEAD;
|
||||
opaque->btpo_flags |= BTP_DELETED;
|
||||
opaque->btpo.xact =
|
||||
vacuum_full ? FrozenTransactionId : ReadNewTransactionId();
|
||||
@ -1085,6 +1221,8 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
nextrdata++;
|
||||
xlinfo = XLOG_BTREE_DELETE_PAGE_META;
|
||||
}
|
||||
else if (parent_half_dead)
|
||||
xlinfo = XLOG_BTREE_DELETE_PAGE_HALF;
|
||||
else
|
||||
xlinfo = XLOG_BTREE_DELETE_PAGE;
|
||||
|
||||
@ -1138,34 +1276,52 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
|
||||
|
||||
END_CRIT_SECTION();
|
||||
|
||||
/* release buffers; send out relcache inval if metapage changed */
|
||||
/* release metapage; send out relcache inval if metapage changed */
|
||||
if (BufferIsValid(metabuf))
|
||||
{
|
||||
CacheInvalidateRelcache(rel);
|
||||
_bt_relbuf(rel, metabuf);
|
||||
}
|
||||
_bt_relbuf(rel, pbuf);
|
||||
_bt_relbuf(rel, rbuf);
|
||||
_bt_relbuf(rel, buf);
|
||||
/* can always release leftsib immediately */
|
||||
if (BufferIsValid(lbuf))
|
||||
_bt_relbuf(rel, lbuf);
|
||||
|
||||
/*
|
||||
* If parent became half dead, recurse to try to delete it. Otherwise, if
|
||||
* If parent became half dead, recurse to delete it. Otherwise, if
|
||||
* right sibling is empty and is now the last child of the parent, recurse
|
||||
* to try to delete it. (These cases cannot apply at the same time,
|
||||
* though the second case might itself recurse to the first.)
|
||||
*
|
||||
* When recursing to parent, we hold the lock on the target page until
|
||||
* done. This delays any insertions into the keyspace that was just
|
||||
* effectively reassigned to the parent's right sibling. If we allowed
|
||||
* that, and there were enough such insertions before we finish deleting
|
||||
* the parent, page splits within that keyspace could lead to inserting
|
||||
* out-of-order keys into the grandparent level. It is thought that that
|
||||
* wouldn't have any serious consequences, but it still seems like a
|
||||
* pretty bad idea.
|
||||
*/
|
||||
if (parent_half_dead)
|
||||
{
|
||||
buf = _bt_getbuf(rel, parent, BT_READ);
|
||||
return _bt_pagedel(rel, buf, vacuum_full) + 1;
|
||||
/* recursive call will release pbuf */
|
||||
_bt_relbuf(rel, rbuf);
|
||||
result = _bt_pagedel(rel, pbuf, stack->bts_parent, vacuum_full) + 1;
|
||||
_bt_relbuf(rel, buf);
|
||||
}
|
||||
if (parent_one_child && rightsib_empty)
|
||||
else if (parent_one_child && rightsib_empty)
|
||||
{
|
||||
buf = _bt_getbuf(rel, rightsib, BT_READ);
|
||||
return _bt_pagedel(rel, buf, vacuum_full) + 1;
|
||||
_bt_relbuf(rel, pbuf);
|
||||
_bt_relbuf(rel, buf);
|
||||
/* recursive call will release rbuf */
|
||||
result = _bt_pagedel(rel, rbuf, stack, vacuum_full) + 1;
|
||||
}
|
||||
else
|
||||
{
|
||||
_bt_relbuf(rel, pbuf);
|
||||
_bt_relbuf(rel, buf);
|
||||
_bt_relbuf(rel, rbuf);
|
||||
result = 1;
|
||||
}
|
||||
|
||||
return 1;
|
||||
return result;
|
||||
}
|
||||
|
@ -12,7 +12,7 @@
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.152 2006/10/04 00:29:49 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.153 2006/11/01 19:43:17 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@ -804,8 +804,7 @@ restart:
|
||||
if (blkno != orig_blkno)
|
||||
{
|
||||
if (_bt_page_recyclable(page) ||
|
||||
P_ISDELETED(opaque) ||
|
||||
(opaque->btpo_flags & BTP_HALF_DEAD) ||
|
||||
P_IGNORE(opaque) ||
|
||||
!P_ISLEAF(opaque) ||
|
||||
opaque->btpo_cycleid != vstate->cycleid)
|
||||
{
|
||||
@ -828,7 +827,7 @@ restart:
|
||||
/* Already deleted, but can't recycle yet */
|
||||
stats->pages_deleted++;
|
||||
}
|
||||
else if (opaque->btpo_flags & BTP_HALF_DEAD)
|
||||
else if (P_ISHALFDEAD(opaque))
|
||||
{
|
||||
/* Half-dead, try to delete */
|
||||
delete_now = true;
|
||||
@ -939,7 +938,7 @@ restart:
|
||||
MemoryContextReset(vstate->pagedelcontext);
|
||||
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
|
||||
|
||||
ndel = _bt_pagedel(rel, buf, info->vacuum_full);
|
||||
ndel = _bt_pagedel(rel, buf, NULL, info->vacuum_full);
|
||||
|
||||
/* count only this page, else may double-count parent */
|
||||
if (ndel)
|
||||
|
@ -8,7 +8,7 @@
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.38 2006/10/04 00:29:49 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.39 2006/11/01 19:43:17 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@ -22,31 +22,41 @@
|
||||
* them manually if they are not seen in the WAL log during replay. This
|
||||
* makes it safe for page insertion to be a multiple-WAL-action process.
|
||||
*
|
||||
* Similarly, deletion of an only child page and deletion of its parent page
|
||||
* form multiple WAL log entries, and we have to be prepared to follow through
|
||||
* with the deletion if the log ends between.
|
||||
*
|
||||
* The data structure is a simple linked list --- this should be good enough,
|
||||
* since we don't expect a page split to remain incomplete for long.
|
||||
* since we don't expect a page split or multi deletion to remain incomplete
|
||||
* for long. In any case we need to respect the order of operations.
|
||||
*/
|
||||
typedef struct bt_incomplete_split
|
||||
typedef struct bt_incomplete_action
|
||||
{
|
||||
RelFileNode node; /* the index */
|
||||
bool is_split; /* T = pending split, F = pending delete */
|
||||
/* these fields are for a split: */
|
||||
bool is_root; /* we split the root */
|
||||
BlockNumber leftblk; /* left half of split */
|
||||
BlockNumber rightblk; /* right half of split */
|
||||
bool is_root; /* we split the root */
|
||||
} bt_incomplete_split;
|
||||
/* these fields are for a delete: */
|
||||
BlockNumber delblk; /* parent block to be deleted */
|
||||
} bt_incomplete_action;
|
||||
|
||||
static List *incomplete_splits;
|
||||
static List *incomplete_actions;
|
||||
|
||||
|
||||
static void
|
||||
log_incomplete_split(RelFileNode node, BlockNumber leftblk,
|
||||
BlockNumber rightblk, bool is_root)
|
||||
{
|
||||
bt_incomplete_split *split = palloc(sizeof(bt_incomplete_split));
|
||||
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
|
||||
|
||||
split->node = node;
|
||||
split->leftblk = leftblk;
|
||||
split->rightblk = rightblk;
|
||||
split->is_root = is_root;
|
||||
incomplete_splits = lappend(incomplete_splits, split);
|
||||
action->node = node;
|
||||
action->is_split = true;
|
||||
action->is_root = is_root;
|
||||
action->leftblk = leftblk;
|
||||
action->rightblk = rightblk;
|
||||
incomplete_actions = lappend(incomplete_actions, action);
|
||||
}
|
||||
|
||||
static void
|
||||
@ -54,17 +64,50 @@ forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
|
||||
{
|
||||
ListCell *l;
|
||||
|
||||
foreach(l, incomplete_splits)
|
||||
foreach(l, incomplete_actions)
|
||||
{
|
||||
bt_incomplete_split *split = (bt_incomplete_split *) lfirst(l);
|
||||
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
|
||||
|
||||
if (RelFileNodeEquals(node, split->node) &&
|
||||
downlink == split->rightblk)
|
||||
if (RelFileNodeEquals(node, action->node) &&
|
||||
action->is_split &&
|
||||
downlink == action->rightblk)
|
||||
{
|
||||
if (is_root != split->is_root)
|
||||
if (is_root != action->is_root)
|
||||
elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
|
||||
split->is_root, is_root);
|
||||
incomplete_splits = list_delete_ptr(incomplete_splits, split);
|
||||
action->is_root, is_root);
|
||||
incomplete_actions = list_delete_ptr(incomplete_actions, action);
|
||||
pfree(action);
|
||||
break; /* need not look further */
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
static void
|
||||
log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
|
||||
{
|
||||
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
|
||||
|
||||
action->node = node;
|
||||
action->is_split = false;
|
||||
action->delblk = delblk;
|
||||
incomplete_actions = lappend(incomplete_actions, action);
|
||||
}
|
||||
|
||||
static void
|
||||
forget_matching_deletion(RelFileNode node, BlockNumber delblk)
|
||||
{
|
||||
ListCell *l;
|
||||
|
||||
foreach(l, incomplete_actions)
|
||||
{
|
||||
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
|
||||
|
||||
if (RelFileNodeEquals(node, action->node) &&
|
||||
!action->is_split &&
|
||||
delblk == action->delblk)
|
||||
{
|
||||
incomplete_actions = list_delete_ptr(incomplete_actions, action);
|
||||
pfree(action);
|
||||
break; /* need not look further */
|
||||
}
|
||||
}
|
||||
@ -389,8 +432,7 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
|
||||
}
|
||||
|
||||
static void
|
||||
btree_xlog_delete_page(bool ismeta,
|
||||
XLogRecPtr lsn, XLogRecord *record)
|
||||
btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
|
||||
{
|
||||
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) XLogRecGetData(record);
|
||||
Relation reln;
|
||||
@ -427,6 +469,7 @@ btree_xlog_delete_page(bool ismeta,
|
||||
poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
|
||||
if (poffset >= PageGetMaxOffsetNumber(page))
|
||||
{
|
||||
Assert(info == XLOG_BTREE_DELETE_PAGE_HALF);
|
||||
Assert(poffset == P_FIRSTDATAKEY(pageop));
|
||||
PageIndexTupleDelete(page, poffset);
|
||||
pageop->btpo_flags |= BTP_HALF_DEAD;
|
||||
@ -437,6 +480,7 @@ btree_xlog_delete_page(bool ismeta,
|
||||
IndexTuple itup;
|
||||
OffsetNumber nextoffset;
|
||||
|
||||
Assert(info != XLOG_BTREE_DELETE_PAGE_HALF);
|
||||
itemid = PageGetItemId(page, poffset);
|
||||
itup = (IndexTuple) PageGetItem(page, itemid);
|
||||
ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
|
||||
@ -523,7 +567,7 @@ btree_xlog_delete_page(bool ismeta,
|
||||
UnlockReleaseBuffer(buffer);
|
||||
|
||||
/* Update metapage if needed */
|
||||
if (ismeta)
|
||||
if (info == XLOG_BTREE_DELETE_PAGE_META)
|
||||
{
|
||||
xl_btree_metadata md;
|
||||
|
||||
@ -533,6 +577,13 @@ btree_xlog_delete_page(bool ismeta,
|
||||
md.root, md.level,
|
||||
md.fastroot, md.fastlevel);
|
||||
}
|
||||
|
||||
/* Forget any completed deletion */
|
||||
forget_matching_deletion(xlrec->target.node, target);
|
||||
|
||||
/* If parent became half-dead, remember it for deletion */
|
||||
if (info == XLOG_BTREE_DELETE_PAGE_HALF)
|
||||
log_incomplete_deletion(xlrec->target.node, parent);
|
||||
}
|
||||
|
||||
static void
|
||||
@ -620,10 +671,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
|
||||
btree_xlog_delete(lsn, record);
|
||||
break;
|
||||
case XLOG_BTREE_DELETE_PAGE:
|
||||
btree_xlog_delete_page(false, lsn, record);
|
||||
break;
|
||||
case XLOG_BTREE_DELETE_PAGE_META:
|
||||
btree_xlog_delete_page(true, lsn, record);
|
||||
case XLOG_BTREE_DELETE_PAGE_HALF:
|
||||
btree_xlog_delete_page(info, lsn, record);
|
||||
break;
|
||||
case XLOG_BTREE_NEWROOT:
|
||||
btree_xlog_newroot(lsn, record);
|
||||
@ -724,6 +774,7 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
|
||||
}
|
||||
case XLOG_BTREE_DELETE_PAGE:
|
||||
case XLOG_BTREE_DELETE_PAGE_META:
|
||||
case XLOG_BTREE_DELETE_PAGE_HALF:
|
||||
{
|
||||
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
|
||||
|
||||
@ -752,7 +803,7 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
|
||||
void
|
||||
btree_xlog_startup(void)
|
||||
{
|
||||
incomplete_splits = NIL;
|
||||
incomplete_actions = NIL;
|
||||
}
|
||||
|
||||
void
|
||||
@ -760,45 +811,60 @@ btree_xlog_cleanup(void)
|
||||
{
|
||||
ListCell *l;
|
||||
|
||||
foreach(l, incomplete_splits)
|
||||
foreach(l, incomplete_actions)
|
||||
{
|
||||
bt_incomplete_split *split = (bt_incomplete_split *) lfirst(l);
|
||||
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
|
||||
Relation reln;
|
||||
Buffer lbuf,
|
||||
rbuf;
|
||||
Page lpage,
|
||||
rpage;
|
||||
BTPageOpaque lpageop,
|
||||
rpageop;
|
||||
bool is_only;
|
||||
|
||||
reln = XLogOpenRelation(split->node);
|
||||
lbuf = XLogReadBuffer(reln, split->leftblk, false);
|
||||
/* failure should be impossible because we wrote this page earlier */
|
||||
if (!BufferIsValid(lbuf))
|
||||
elog(PANIC, "btree_xlog_cleanup: left block unfound");
|
||||
lpage = (Page) BufferGetPage(lbuf);
|
||||
lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
|
||||
rbuf = XLogReadBuffer(reln, split->rightblk, false);
|
||||
/* failure should be impossible because we wrote this page earlier */
|
||||
if (!BufferIsValid(rbuf))
|
||||
elog(PANIC, "btree_xlog_cleanup: right block unfound");
|
||||
rpage = (Page) BufferGetPage(rbuf);
|
||||
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
|
||||
reln = XLogOpenRelation(action->node);
|
||||
if (action->is_split)
|
||||
{
|
||||
/* finish an incomplete split */
|
||||
Buffer lbuf,
|
||||
rbuf;
|
||||
Page lpage,
|
||||
rpage;
|
||||
BTPageOpaque lpageop,
|
||||
rpageop;
|
||||
bool is_only;
|
||||
|
||||
/* if the two pages are all of their level, it's a only-page split */
|
||||
is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
|
||||
lbuf = XLogReadBuffer(reln, action->leftblk, false);
|
||||
/* failure is impossible because we wrote this page earlier */
|
||||
if (!BufferIsValid(lbuf))
|
||||
elog(PANIC, "btree_xlog_cleanup: left block unfound");
|
||||
lpage = (Page) BufferGetPage(lbuf);
|
||||
lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
|
||||
rbuf = XLogReadBuffer(reln, action->rightblk, false);
|
||||
/* failure is impossible because we wrote this page earlier */
|
||||
if (!BufferIsValid(rbuf))
|
||||
elog(PANIC, "btree_xlog_cleanup: right block unfound");
|
||||
rpage = (Page) BufferGetPage(rbuf);
|
||||
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
|
||||
|
||||
_bt_insert_parent(reln, lbuf, rbuf, NULL,
|
||||
split->is_root, is_only);
|
||||
/* if the pages are all of their level, it's a only-page split */
|
||||
is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
|
||||
|
||||
_bt_insert_parent(reln, lbuf, rbuf, NULL,
|
||||
action->is_root, is_only);
|
||||
}
|
||||
else
|
||||
{
|
||||
/* finish an incomplete deletion (of a half-dead page) */
|
||||
Buffer buf;
|
||||
|
||||
buf = XLogReadBuffer(reln, action->delblk, false);
|
||||
if (BufferIsValid(buf))
|
||||
if (_bt_pagedel(reln, buf, NULL, true) == 0)
|
||||
elog(PANIC, "btree_xlog_cleanup: _bt_pagdel failed");
|
||||
}
|
||||
}
|
||||
incomplete_splits = NIL;
|
||||
incomplete_actions = NIL;
|
||||
}
|
||||
|
||||
bool
|
||||
btree_safe_restartpoint(void)
|
||||
{
|
||||
if (incomplete_splits)
|
||||
if (incomplete_actions)
|
||||
return false;
|
||||
return true;
|
||||
}
|
||||
|
@ -7,7 +7,7 @@
|
||||
* Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.105 2006/10/04 00:30:07 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.106 2006/11/01 19:43:17 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@ -163,6 +163,7 @@ typedef struct BTMetaPageData
|
||||
#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF)
|
||||
#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
|
||||
#define P_ISDELETED(opaque) ((opaque)->btpo_flags & BTP_DELETED)
|
||||
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
|
||||
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
|
||||
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
|
||||
|
||||
@ -203,8 +204,10 @@ typedef struct BTMetaPageData
|
||||
#define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */
|
||||
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuple */
|
||||
#define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */
|
||||
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, plus update metapage */
|
||||
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */
|
||||
#define XLOG_BTREE_NEWROOT 0xA0 /* new root page */
|
||||
#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes
|
||||
* parent half-dead */
|
||||
|
||||
/*
|
||||
* All that we need to find changed index tuple
|
||||
@ -501,7 +504,8 @@ extern void _bt_pageinit(Page page, Size size);
|
||||
extern bool _bt_page_recyclable(Page page);
|
||||
extern void _bt_delitems(Relation rel, Buffer buf,
|
||||
OffsetNumber *itemnos, int nitems);
|
||||
extern int _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full);
|
||||
extern int _bt_pagedel(Relation rel, Buffer buf,
|
||||
BTStack stack, bool vacuum_full);
|
||||
|
||||
/*
|
||||
* prototypes for functions in nbtsearch.c
|
||||
|
Loading…
Reference in New Issue
Block a user