mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-12-21 08:29:39 +08:00
Add some internals documentation for hash indexes, including an
explanation of the remarkably confusing page addressing scheme. The file also includes my planned-but-not-yet-implemented revision of the hash index locking scheme.
This commit is contained in:
parent
2fb6e84bae
commit
eaeb8621f8
414
src/backend/access/hash/README
Normal file
414
src/backend/access/hash/README
Normal file
@ -0,0 +1,414 @@
|
||||
$Header: /cvsroot/pgsql/src/backend/access/hash/README,v 1.1 2003/09/01 20:24:49 tgl Exp $
|
||||
|
||||
This directory contains an implementation of hash indexing for Postgres.
|
||||
|
||||
A hash index consists of two or more "buckets", into which tuples are
|
||||
placed whenever their hash key maps to the bucket number. The
|
||||
key-to-bucket-number mapping is chosen so that the index can be
|
||||
incrementally expanded. When a new bucket is to be added to the index,
|
||||
exactly one existing bucket will need to be "split", with some of its
|
||||
tuples being transferred to the new bucket according to the updated
|
||||
key-to-bucket-number mapping. This is essentially the same hash table
|
||||
management technique embodied in src/backend/utils/hash/dynahash.c for
|
||||
in-memory hash tables.
|
||||
|
||||
Each bucket in the hash index comprises one or more index pages. The
|
||||
bucket's first page is permanently assigned to it when the bucket is
|
||||
created. Additional pages, called "overflow pages", are added if the
|
||||
bucket receives too many tuples to fit in the primary bucket page.
|
||||
The pages of a bucket are chained together in a doubly-linked list
|
||||
using fields in the index page special space.
|
||||
|
||||
There is currently no provision to shrink a hash index, other than by
|
||||
rebuilding it with REINDEX. Overflow pages can be recycled for reuse
|
||||
in other buckets, but we never give them back to the operating system.
|
||||
There is no provision for reducing the number of buckets, either.
|
||||
|
||||
|
||||
Page addressing
|
||||
---------------
|
||||
|
||||
There are four kinds of pages in a hash index: the meta page (page zero),
|
||||
which contains statically allocated control information; primary bucket
|
||||
pages; overflow pages; and bitmap pages, which keep track of overflow
|
||||
pages that have been freed and are available for re-use. For addressing
|
||||
purposes, bitmap pages are regarded as a subset of the overflow pages.
|
||||
|
||||
Primary bucket pages and overflow pages are allocated independently (since
|
||||
any given index might need more or fewer overflow pages relative to its
|
||||
number of buckets). The hash code uses an interesting set of addressing
|
||||
rules to support a variable number of overflow pages while not having to
|
||||
move primary bucket pages around after they are created.
|
||||
|
||||
Primary bucket pages (henceforth just "bucket pages") are allocated in
|
||||
power-of-2 groups, called "split points" in the code. Buckets 0 and 1
|
||||
are created when the index is initialized. At the first split, buckets 2
|
||||
and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
|
||||
when bucket 8 is needed, buckets 8-15 are allocated; etc. All the bucket
|
||||
pages of a power-of-2 group appear consecutively in the index. This
|
||||
addressing scheme allows the physical location of a bucket page to be
|
||||
computed from the bucket number relatively easily, using only a small
|
||||
amount of control information. We take the log2() of the bucket number
|
||||
to determine which split point S the bucket belongs to, and then simply
|
||||
add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
|
||||
metapage) to compute the physical address. hashm_spares[S] can be
|
||||
interpreted as the total number of overflow pages that have been allocated
|
||||
before the bucket pages of splitpoint S. hashm_spares[0] is always 0,
|
||||
so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
|
||||
block numbers 1 and 2, just after the meta page. We always have
|
||||
hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
|
||||
former. The difference between the two represents the number of overflow
|
||||
pages appearing between the bucket page groups of splitpoints N and N+1.
|
||||
|
||||
When S splitpoints exist altogether, the array entries hashm_spares[0]
|
||||
through hashm_spares[S] are valid; hashm_spares[S] records the current
|
||||
total number of overflow pages. New overflow pages are created as needed
|
||||
at the end of the index, and recorded by incrementing hashm_spares[S].
|
||||
When it is time to create a new splitpoint's worth of bucket pages, we
|
||||
copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
|
||||
stored in the hashm_ovflpoint field of the meta page). This has the
|
||||
effect of reserving the correct number of bucket pages at the end of the
|
||||
index, and preparing to allocate additional overflow pages after those
|
||||
bucket pages. hashm_spares[] entries before S cannot change anymore,
|
||||
since that would require moving already-created bucket pages.
|
||||
|
||||
Since overflow pages may be recycled if enough tuples are deleted from
|
||||
their bucket, we need a way to keep track of currently-free overflow
|
||||
pages. The state of each overflow page (0 = available, 1 = not available)
|
||||
is recorded in "bitmap" pages dedicated to this purpose. The entries in
|
||||
the bitmap are indexed by "bit number", a zero-based count in which every
|
||||
overflow page has a unique entry. We can convert between an overflow
|
||||
page's physical block number and its bit number using the information in
|
||||
hashm_spares[] (see hashovfl.c for details). The bit number sequence
|
||||
includes the bitmap pages, which is the reason for saying that bitmap
|
||||
pages are a subset of the overflow pages. It turns out in fact that each
|
||||
bitmap page's first bit represents itself --- this is not an essential
|
||||
property, but falls out of the fact that we only allocate another bitmap
|
||||
page when we really need one. Bit number zero always corresponds to block
|
||||
number 3, which is the first bitmap page and is allocated during index
|
||||
creation.
|
||||
|
||||
|
||||
Lock definitions
|
||||
----------------
|
||||
|
||||
We use both lmgr locks ("heavyweight" locks) and buffer context locks
|
||||
(LWLocks) to control access to a hash index. lmgr locks are needed for
|
||||
long-term locking since there is a (small) risk of deadlock, which we must
|
||||
be able to detect. Buffer context locks are used for short-term access
|
||||
control to individual pages of the index.
|
||||
|
||||
We define the following lmgr locks for a hash index:
|
||||
|
||||
LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
|
||||
mapping. A process attempting to enlarge the hash table by splitting a
|
||||
bucket must exclusive-lock this lock before modifying the metapage data
|
||||
representing the mapping. Processes intending to access a particular
|
||||
bucket must share-lock this lock until they have acquired lock on the
|
||||
correct target bucket.
|
||||
|
||||
LockPage(rel, page), where page is the page number of a hash bucket page,
|
||||
represents the right to split or compact an individual bucket. A process
|
||||
splitting a bucket must exclusive-lock both old and new halves of the
|
||||
bucket until it is done. A process doing VACUUM must exclusive-lock the
|
||||
bucket it is currently purging tuples from. Processes doing scans or
|
||||
insertions must share-lock the bucket they are scanning or inserting into.
|
||||
(It is okay to allow concurrent scans and insertions.)
|
||||
|
||||
The lmgr lock IDs corresponding to overflow pages are currently unused.
|
||||
These are available for possible future refinements.
|
||||
|
||||
Note that these lock definitions are conceptually distinct from any sort
|
||||
of lock on the pages whose numbers they share. A process must also obtain
|
||||
read or write buffer lock on the metapage or bucket page before accessing
|
||||
said page.
|
||||
|
||||
Processes performing hash index scans must hold share lock on the bucket
|
||||
they are scanning throughout the scan. This seems to be essential, since
|
||||
there is no reasonable way for a scan to cope with its bucket being split
|
||||
underneath it. This creates a possibility of deadlock external to the
|
||||
hash index code, since a process holding one of these locks could block
|
||||
waiting for an unrelated lock held by another process. If that process
|
||||
then does something that requires exclusive lock on the bucket, we have
|
||||
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
|
||||
can be detected and recovered from. This also forces the page-zero lock
|
||||
to be an lmgr lock, because as we'll see below it is held while attempting
|
||||
to acquire a bucket lock, and so it could also participate in a deadlock.
|
||||
|
||||
Processes must obtain read (share) buffer context lock on any hash index
|
||||
page while reading it, and write (exclusive) lock while modifying it.
|
||||
To prevent deadlock we enforce these coding rules: no buffer lock may be
|
||||
held long term (across index AM calls), nor may any buffer lock be held
|
||||
while waiting for an lmgr lock, nor may more than one buffer lock
|
||||
be held at a time by any one process. (The third restriction is probably
|
||||
stronger than necessary, but it makes the proof of no deadlock obvious.)
|
||||
|
||||
|
||||
Pseudocode algorithms
|
||||
---------------------
|
||||
|
||||
The operations we need to support are: readers scanning the index for
|
||||
entries of a particular hash code (which by definition are all in the same
|
||||
bucket); insertion of a new tuple into the correct bucket; enlarging the
|
||||
hash table by splitting an existing bucket; and garbage collection
|
||||
(deletion of dead tuples and compaction of buckets). Bucket splitting is
|
||||
done at conclusion of any insertion that leaves the hash table more full
|
||||
than the target load factor, but it is convenient to consider it as an
|
||||
independent operation. Note that we do not have a bucket-merge operation
|
||||
--- the number of buckets never shrinks. Insertion, splitting, and
|
||||
garbage collection may all need access to freelist management, which keeps
|
||||
track of available overflow pages.
|
||||
|
||||
The reader algorithm is:
|
||||
|
||||
share-lock page 0 (to prevent active split)
|
||||
read/sharelock meta page
|
||||
compute bucket number for target hash key
|
||||
release meta page
|
||||
share-lock bucket page (to prevent split/compact of this bucket)
|
||||
release page 0 share-lock
|
||||
-- then, per read request:
|
||||
read/sharelock current page of bucket
|
||||
step to next page if necessary (no chaining of locks)
|
||||
get tuple
|
||||
release current page
|
||||
-- at scan shutdown:
|
||||
release bucket share-lock
|
||||
|
||||
By holding the page-zero lock until lock on the target bucket is obtained,
|
||||
the reader ensures that the target bucket calculation is valid (otherwise
|
||||
the bucket might be split before the reader arrives at it, and the target
|
||||
entries might go into the new bucket). Holding the bucket sharelock for
|
||||
the remainder of the scan prevents the reader's current-tuple pointer from
|
||||
being invalidated by other processes. Notice though that the reader need
|
||||
not prevent other buckets from being split or compacted.
|
||||
|
||||
The insertion algorithm is rather similar:
|
||||
|
||||
share-lock page 0 (to prevent active split)
|
||||
read/sharelock meta page
|
||||
compute bucket number for target hash key
|
||||
release meta page
|
||||
share-lock bucket page (to prevent split/compact this bucket)
|
||||
release page 0 share-lock
|
||||
-- (so far same as reader)
|
||||
read/exclusive-lock current page of bucket
|
||||
if full, release, read/exclusive-lock next page; repeat as needed
|
||||
>> see below if no space in any page of bucket
|
||||
insert tuple
|
||||
write/release current page
|
||||
release bucket share-lock
|
||||
read/exclusive-lock meta page
|
||||
increment tuple count, decide if split needed
|
||||
write/release meta page
|
||||
done if no split needed, else enter Split algorithm below
|
||||
|
||||
It is okay for an insertion to take place in a bucket that is being
|
||||
actively scanned, because it does not change the position of any existing
|
||||
item in the bucket, so scan states are not invalidated. We only need the
|
||||
short-term buffer locks to ensure that readers do not see a
|
||||
partially-updated page.
|
||||
|
||||
It is clearly impossible for readers and inserters to deadlock, and in
|
||||
fact this algorithm allows them a very high degree of concurrency.
|
||||
(The exclusive metapage lock taken to update the tuple count is stronger
|
||||
than necessary, since readers do not care about the tuple count, but the
|
||||
lock is held for such a short time that this is probably not an issue.)
|
||||
|
||||
When an inserter cannot find space in any existing page of a bucket, it
|
||||
must obtain an overflow page and add that page to the bucket's chain.
|
||||
Details of that part of the algorithm appear later.
|
||||
|
||||
The page split algorithm is entered whenever an inserter observes that the
|
||||
index is overfull (has a higher-than-wanted ratio of tuples to buckets).
|
||||
The algorithm attempts, but does not necessarily succeed, to split one
|
||||
existing bucket in two, thereby lowering the fill ratio:
|
||||
|
||||
exclusive-lock page 0 (assert the right to begin a split)
|
||||
read/exclusive-lock meta page
|
||||
check split still needed
|
||||
if split not needed anymore, drop locks and exit
|
||||
decide which bucket to split
|
||||
Attempt to X-lock new bucket number (shouldn't fail, but...)
|
||||
Attempt to X-lock old bucket number (definitely could fail)
|
||||
if above fail, drop locks and exit
|
||||
update meta page to reflect new number of buckets
|
||||
write/release meta page
|
||||
release X-lock on page 0
|
||||
-- now, accesses to all other buckets can proceed.
|
||||
Perform actual split of bucket, moving tuples as needed
|
||||
>> see below about acquiring needed extra space
|
||||
Release X-locks of old and new buckets
|
||||
|
||||
Note the page zero and metapage locks are not held while the actual tuple
|
||||
rearrangement is performed, so accesses to other buckets can proceed in
|
||||
parallel; in fact, it's possible for multiple bucket splits to proceed
|
||||
in parallel.
|
||||
|
||||
Split's attempt to X-lock the old bucket number could fail if another
|
||||
process holds S-lock on it. We do not want to wait if that happens, first
|
||||
because we don't want to wait while holding the metapage exclusive-lock,
|
||||
and second because it could very easily result in deadlock. (The other
|
||||
process might be out of the hash AM altogether, and could do something
|
||||
that blocks on another lock this process holds; so even if the hash
|
||||
algorithm itself is deadlock-free, a user-induced deadlock could occur.)
|
||||
So, this is a conditional LockAcquire operation, and if it fails we just
|
||||
abandon the attempt to split. This is all right since the index is
|
||||
overfull but perfectly functional. Every subsequent inserter will try to
|
||||
split, and eventually one will succeed. If multiple inserters failed to
|
||||
split, the index might still be overfull, but eventually, the index will
|
||||
not be overfull and split attempts will stop. (We could make a successful
|
||||
splitter loop to see if the index is still overfull, but it seems better to
|
||||
distribute the split overhead across successive insertions.)
|
||||
|
||||
It may be wise to make the initial exclusive-lock-page-zero operation a
|
||||
conditional one as well, although the odds of a deadlock failure are quite
|
||||
low. (AFAICS it could only deadlock against a VACUUM operation that is
|
||||
trying to X-lock a bucket that the current process has a stopped indexscan
|
||||
in.)
|
||||
|
||||
A problem is that if a split fails partway through (eg due to insufficient
|
||||
disk space) the index is left corrupt. The probability of that could be
|
||||
made quite low if we grab a free page or two before we update the meta
|
||||
page, but the only real solution is to treat a split as a WAL-loggable,
|
||||
must-complete action. I'm not planning to teach hash about WAL in this
|
||||
go-round.
|
||||
|
||||
The fourth operation is garbage collection (bulk deletion):
|
||||
|
||||
next bucket := 0
|
||||
read/sharelock meta page
|
||||
fetch current max bucket number
|
||||
release meta page
|
||||
while next bucket <= max bucket do
|
||||
Acquire X lock on target bucket
|
||||
Scan and remove tuples, compact free space as needed
|
||||
Release X lock
|
||||
next bucket ++
|
||||
end loop
|
||||
exclusive-lock meta page
|
||||
check if number of buckets changed
|
||||
if so, release lock and return to for-each-bucket loop
|
||||
else update metapage tuple count
|
||||
write/release meta page
|
||||
|
||||
Note that this is designed to allow concurrent splits. If a split occurs,
|
||||
tuples relocated into the new bucket will be visited twice by the scan,
|
||||
but that does no harm. (We must however be careful about the statistics
|
||||
reported by the VACUUM operation. What we can do is count the number of
|
||||
tuples scanned, and believe this in preference to the stored tuple count
|
||||
if the stored tuple count and number of buckets did *not* change at any
|
||||
time during the scan. This provides a way of correcting the stored tuple
|
||||
count if it gets out of sync for some reason. But if a split or insertion
|
||||
does occur concurrently, the scan count is untrustworthy; instead,
|
||||
subtract the number of tuples deleted from the stored tuple count and
|
||||
use that.)
|
||||
|
||||
The exclusive lock request could deadlock in some strange scenarios, but
|
||||
we can just error out without any great harm being done.
|
||||
|
||||
|
||||
Free space management
|
||||
---------------------
|
||||
|
||||
Free space management consists of two sub-algorithms, one for reserving
|
||||
an overflow page to add to a bucket chain, and one for returning an empty
|
||||
overflow page to the free pool.
|
||||
|
||||
Obtaining an overflow page:
|
||||
|
||||
read/exclusive-lock meta page
|
||||
determine next bitmap page number; if none, exit loop
|
||||
release meta page lock
|
||||
read/exclusive-lock bitmap page
|
||||
search for a free page (zero bit in bitmap)
|
||||
if found:
|
||||
set bit in bitmap
|
||||
write/release bitmap page
|
||||
read/exclusive-lock meta page
|
||||
if first-free-bit value did not change,
|
||||
update it and write meta page
|
||||
release meta page
|
||||
return page number
|
||||
else (not found):
|
||||
release bitmap page
|
||||
loop back to try next bitmap page, if any
|
||||
-- here when we have checked all bitmap pages; we hold meta excl. lock
|
||||
extend index to add another bitmap page; update meta information
|
||||
write/release meta page
|
||||
return page number
|
||||
|
||||
It is slightly annoying to release and reacquire the metapage lock
|
||||
multiple times, but it seems best to do it that way to minimize loss of
|
||||
concurrency against processes just entering the index. We don't want
|
||||
to hold the metapage exclusive lock while reading in a bitmap page.
|
||||
(We can at least avoid repeated buffer pin/unpin here.)
|
||||
|
||||
The portion of tuple insertion that calls the above subroutine looks
|
||||
like this:
|
||||
|
||||
-- having determined that no space is free in the target bucket:
|
||||
remember last page of bucket, drop write lock on it
|
||||
call free-page-acquire routine
|
||||
re-write-lock last page of bucket
|
||||
if it is not last anymore, step to the last page
|
||||
update (former) last page to point to new page
|
||||
write-lock and initialize new page, with back link to former last page
|
||||
write and release former last page
|
||||
insert tuple into new page
|
||||
-- etc.
|
||||
|
||||
Notice this handles the case where two concurrent inserters try to extend
|
||||
the same bucket. They will end up with a valid, though perhaps
|
||||
space-inefficient, configuration: two overflow pages will be added to the
|
||||
bucket, each containing one tuple.
|
||||
|
||||
The last part of this violates the rule about holding write lock on two
|
||||
pages concurrently, but it should be okay to write-lock the previously
|
||||
free page; there can be no other process holding lock on it.
|
||||
|
||||
Bucket splitting uses a similar algorithm if it has to extend the new
|
||||
bucket, but it need not worry about concurrent extension since it has
|
||||
exclusive lock on the new bucket.
|
||||
|
||||
Freeing an overflow page is done by garbage collection and by bucket
|
||||
splitting (the old bucket may contain no-longer-needed overflow pages).
|
||||
In both cases, the process holds exclusive lock on the containing bucket,
|
||||
so need not worry about other accessors of pages in the bucket. The
|
||||
algorithm is:
|
||||
|
||||
delink overflow page from bucket chain
|
||||
(this requires read/update/write/release of fore and aft siblings)
|
||||
read/share-lock meta page
|
||||
determine which bitmap page contains the free space bit for page
|
||||
release meta page
|
||||
read/exclusive-lock bitmap page
|
||||
update bitmap bit
|
||||
write/release bitmap page
|
||||
if page number is less than what we saw as first-free-bit in meta:
|
||||
read/exclusive-lock meta page
|
||||
if page number is still less than first-free-bit,
|
||||
update first-free-bit field and write meta page
|
||||
release meta page
|
||||
|
||||
We have to do it this way because we must clear the bitmap bit before
|
||||
changing the first-free-bit field.
|
||||
|
||||
All the freespace operations should be called while holding no buffer
|
||||
locks. Since they need no lmgr locks, deadlock is not possible.
|
||||
|
||||
|
||||
Other notes
|
||||
-----------
|
||||
|
||||
All the shenanigans with locking prevent a split occurring while *another*
|
||||
process is stopped in a given bucket. They do not ensure that one of
|
||||
our *own* backend's scans is not stopped in the bucket, because lmgr
|
||||
doesn't consider a process's own locks to conflict. So the Split
|
||||
algorithm must check for that case separately before deciding it can go
|
||||
ahead with the split. VACUUM does not have this problem since nothing
|
||||
else can be happening within the vacuuming backend.
|
||||
|
||||
Should we instead try to fix the state of any conflicting local scan?
|
||||
Seems mighty ugly --- got to move the held bucket S-lock as well as lots
|
||||
of other messiness. For now, just punt and don't split.
|
Loading…
Reference in New Issue
Block a user