mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-01-06 15:24:56 +08:00
Update README-SSI. Add a section to describe the "dangerous structure" that
SSI is based on, as well as the optimizations about relative commit times and read-only transactions. Plus a bunch of other misc fixes and improvements. Dan Ports
This commit is contained in:
parent
f3008c31f0
commit
78475b0eca
@ -51,13 +51,13 @@ if a transaction can be shown to always do the right thing when it is
|
|||||||
run alone (before or after any other transaction), it will always do
|
run alone (before or after any other transaction), it will always do
|
||||||
the right thing in any mix of concurrent serializable transactions.
|
the right thing in any mix of concurrent serializable transactions.
|
||||||
Where conflicts with other transactions would result in an
|
Where conflicts with other transactions would result in an
|
||||||
inconsistent state within the database, or an inconsistent view of
|
inconsistent state within the database or an inconsistent view of
|
||||||
the data, a serializable transaction will block or roll back to
|
the data, a serializable transaction will block or roll back to
|
||||||
prevent the anomaly. The SQL standard provides a specific SQLSTATE
|
prevent the anomaly. The SQL standard provides a specific SQLSTATE
|
||||||
for errors generated when a transaction rolls back for this reason,
|
for errors generated when a transaction rolls back for this reason,
|
||||||
so that transactions can be retried automatically.
|
so that transactions can be retried automatically.
|
||||||
|
|
||||||
Before version 9.1 PostgreSQL did not support a full serializable
|
Before version 9.1, PostgreSQL did not support a full serializable
|
||||||
isolation level. A request for serializable transaction isolation
|
isolation level. A request for serializable transaction isolation
|
||||||
actually provided snapshot isolation. This has well known anomalies
|
actually provided snapshot isolation. This has well known anomalies
|
||||||
which can allow data corruption or inconsistent views of the data
|
which can allow data corruption or inconsistent views of the data
|
||||||
@ -77,7 +77,7 @@ Serializable Isolation Implementation Strategies
|
|||||||
|
|
||||||
Techniques for implementing full serializable isolation have been
|
Techniques for implementing full serializable isolation have been
|
||||||
published and in use in many database products for decades. The
|
published and in use in many database products for decades. The
|
||||||
primary technique which has been used is Strict 2 Phase Locking
|
primary technique which has been used is Strict Two-Phase Locking
|
||||||
(S2PL), which operates by blocking writes against data which has been
|
(S2PL), which operates by blocking writes against data which has been
|
||||||
read by concurrent transactions and blocking any access (read or
|
read by concurrent transactions and blocking any access (read or
|
||||||
write) against data which has been written by concurrent
|
write) against data which has been written by concurrent
|
||||||
@ -112,54 +112,90 @@ visualize the difference between the serializable implementations
|
|||||||
described above, is to consider that among transactions executing at
|
described above, is to consider that among transactions executing at
|
||||||
the serializable transaction isolation level, the results are
|
the serializable transaction isolation level, the results are
|
||||||
required to be consistent with some serial (one-at-a-time) execution
|
required to be consistent with some serial (one-at-a-time) execution
|
||||||
of the transactions[1]. How is that order determined in each?
|
of the transactions [1]. How is that order determined in each?
|
||||||
|
|
||||||
S2PL locks rows used by the transaction in a way which blocks
|
In S2PL, each transaction locks any data it accesses. It holds the
|
||||||
conflicting access, so that at the moment of a successful commit it
|
locks until committing, preventing other transactions from making
|
||||||
is certain that no conflicting access has occurred. Some transactions
|
conflicting accesses to the same data in the interim. Some
|
||||||
may have blocked, essentially being partially serialized with the
|
transactions may have to be rolled back to prevent deadlock. But
|
||||||
committing transaction, to allow this. Some transactions may have
|
successful transactions can always be viewed as having occurred
|
||||||
been rolled back, due to cycles in the blocking. But with S2PL,
|
sequentially, in the order they committed.
|
||||||
transactions can always be viewed as having occurred serially, in the
|
|
||||||
order of successful commit.
|
|
||||||
|
|
||||||
With snapshot isolation, reads never block writes, nor vice versa, so
|
With snapshot isolation, reads never block writes, nor vice versa, so
|
||||||
there is much less actual serialization. The order in which
|
more concurrency is possible. The order in which transactions appear
|
||||||
transactions appear to have executed is determined by something more
|
to have executed is determined by something more subtle than in S2PL:
|
||||||
subtle than in S2PL: read/write dependencies. If a transaction
|
read/write dependencies. If a transaction reads data, it appears to
|
||||||
attempts to read data which is not visible to it because the
|
execute after the transaction that wrote the data it is reading.
|
||||||
transaction which wrote it (or will later write it) is concurrent
|
Similarly, if it updates data, it appears to execute after the
|
||||||
(one of them was running when the other acquired its snapshot), then
|
transaction that wrote the previous version. These dependencies, which
|
||||||
the reading transaction appears to have executed first, regardless of
|
we call "wr-dependencies" and "ww-dependencies", are consistent with
|
||||||
the actual sequence of transaction starts or commits (since it sees a
|
the commit order, because the first transaction must have committed
|
||||||
database state prior to that in which the other transaction leaves
|
before the second starts. However, there can also be dependencies
|
||||||
it). If one transaction has both rw-dependencies in (meaning that a
|
between two *concurrent* transactions, i.e. where one was running when
|
||||||
concurrent transaction attempts to read data it writes) and out
|
the other acquired its snapshot. These "rw-conflicts" occur when one
|
||||||
(meaning it attempts to read data a concurrent transaction writes),
|
transaction attempts to read data which is not visible to it because
|
||||||
and a couple other conditions are met, there can appear to be a cycle
|
the transaction which wrote it (or will later write it) is
|
||||||
in execution order of the transactions. This is when the anomalies
|
concurrent. The reading transaction appears to have executed first,
|
||||||
occur.
|
regardless of the actual sequence of transaction starts or commits,
|
||||||
|
because it sees a database state prior to that in which the other
|
||||||
|
transaction leaves it.
|
||||||
|
|
||||||
SSI works by watching for the conditions mentioned above, and rolling
|
Anomalies occur when a cycle is created in the graph of dependencies:
|
||||||
back a transaction when needed to prevent any anomaly. The apparent
|
when a dependency or series of dependencies causes transaction A to
|
||||||
order of execution will always be consistent with any actual
|
appear to have executed before transaction B, but another series of
|
||||||
serialization (i.e., a transaction which run by itself can always be
|
dependencies causes B to appear before A. If that's the case, then
|
||||||
considered to have run after any transactions committed before it
|
the results can't be consistent with any serial execution of the
|
||||||
started and before any transacton which starts after it commits); but
|
transactions.
|
||||||
among concurrent transactions it will appear that the transaction on
|
|
||||||
the read side of a rw-dependency executed before the transaction on
|
|
||||||
the write side.
|
SSI Algorithm
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Serializable transaction in PostgreSQL are implemented using
|
||||||
|
Serializable Snapshot Isolation (SSI), based on the work of Cahill
|
||||||
|
et al. Fundamentally, this allows snapshot isolation to run as it
|
||||||
|
has, while monitoring for conditions which could create a serialization
|
||||||
|
anomaly.
|
||||||
|
|
||||||
|
SSI is based on the observation [2] that each snapshot isolation
|
||||||
|
anomaly corresponds to a cycle that contains a "dangerous structure"
|
||||||
|
of two adjacent rw-conflict edges:
|
||||||
|
|
||||||
|
Tin ------> Tpivot ------> Tout
|
||||||
|
rw rw
|
||||||
|
|
||||||
|
SSI works by watching for this dangerous structure, and rolling
|
||||||
|
back a transaction when needed to prevent any anomaly. This means it
|
||||||
|
only needs to track rw-conflicts between concurrent transactions, not
|
||||||
|
wr- and ww-dependencies. It also means there is a risk of false
|
||||||
|
positives, because not every dangerous structure corresponds to an
|
||||||
|
actual serialization failure.
|
||||||
|
|
||||||
|
The PostgreSQL implementation uses two additional optimizations:
|
||||||
|
|
||||||
|
* Tout must commit before any other transaction in the cycle
|
||||||
|
(see proof of Theorem 2.1 of [2]). We only roll back a transaction
|
||||||
|
if Tout commits before Tpivot and Tin.
|
||||||
|
|
||||||
|
* if Tin is read-only, there can only be an anomaly if Tout committed
|
||||||
|
before Tin takes its snapshot. This optimization is an original
|
||||||
|
one. Proof:
|
||||||
|
|
||||||
|
- Because there is a cycle, there must be some transaction T0 that
|
||||||
|
precedes Tin in the serial order. (T0 might be the same as Tout).
|
||||||
|
|
||||||
|
- The dependency between T0 and Tin can't be a rw-conflict,
|
||||||
|
because Tin was read-only, so it must be a wr-dependency.
|
||||||
|
Those can only occur if T0 committed before Tin started.
|
||||||
|
|
||||||
|
- Because Tout must commit before any other transaction in the
|
||||||
|
cycle, it must commit before T0 commits -- and thus before Tin
|
||||||
|
starts.
|
||||||
|
|
||||||
|
|
||||||
PostgreSQL Implementation
|
PostgreSQL Implementation
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
The implementation of serializable transactions for PostgreSQL is
|
|
||||||
accomplished through Serializable Snapshot Isolation (SSI), based on
|
|
||||||
the work of Cahill, et al. Fundamentally, this allows snapshot
|
|
||||||
isolation to run as it has, while monitoring for conditions which
|
|
||||||
could create a serialization anomaly.
|
|
||||||
|
|
||||||
* Since this technique is based on Snapshot Isolation (SI), those
|
* Since this technique is based on Snapshot Isolation (SI), those
|
||||||
areas in PostgreSQL which don't use SI can't be brought under SSI.
|
areas in PostgreSQL which don't use SI can't be brought under SSI.
|
||||||
This includes system tables, temporary tables, sequences, hint bit
|
This includes system tables, temporary tables, sequences, hint bit
|
||||||
@ -180,7 +216,7 @@ lock or to use SELECT FOR SHARE or SELECT FOR UPDATE.
|
|||||||
* Those who want to continue to use snapshot isolation without
|
* Those who want to continue to use snapshot isolation without
|
||||||
the additional protections of SSI (and the associated costs of
|
the additional protections of SSI (and the associated costs of
|
||||||
enforcing those protections), can use the REPEATABLE READ transaction
|
enforcing those protections), can use the REPEATABLE READ transaction
|
||||||
isolation level. This level will retain its legacy behavior, which
|
isolation level. This level retains its legacy behavior, which
|
||||||
is identical to the old SERIALIZABLE implementation and fully
|
is identical to the old SERIALIZABLE implementation and fully
|
||||||
consistent with the standard's requirements for the REPEATABLE READ
|
consistent with the standard's requirements for the REPEATABLE READ
|
||||||
transaction isolation level.
|
transaction isolation level.
|
||||||
@ -236,7 +272,7 @@ in PostgreSQL, but tailored to the needs of SIREAD predicate locking,
|
|||||||
are used. These refer to physical objects actually accessed in the
|
are used. These refer to physical objects actually accessed in the
|
||||||
course of executing the query, to model the predicates through
|
course of executing the query, to model the predicates through
|
||||||
inference. Anyone interested in this subject should review the
|
inference. Anyone interested in this subject should review the
|
||||||
Hellerstein, Stonebraker and Hamilton paper[2], along with the
|
Hellerstein, Stonebraker and Hamilton paper [3], along with the
|
||||||
locking papers referenced from that and the Cahill papers.
|
locking papers referenced from that and the Cahill papers.
|
||||||
|
|
||||||
Because the SIREAD locks don't block, traditional locking techniques
|
Because the SIREAD locks don't block, traditional locking techniques
|
||||||
@ -273,6 +309,15 @@ transaction already holds a write lock on any tuple representing the
|
|||||||
row, since a rw-dependency would also create a ww-dependency which
|
row, since a rw-dependency would also create a ww-dependency which
|
||||||
has more aggressive enforcement and will thus prevent any anomaly.
|
has more aggressive enforcement and will thus prevent any anomaly.
|
||||||
|
|
||||||
|
* Modifying a heap tuple creates a rw-conflict with any transaction
|
||||||
|
that holds a SIREAD lock on that tuple, or on the page or relation
|
||||||
|
that contains it.
|
||||||
|
|
||||||
|
* Inserting a new tuple creates a rw-conflict with any transaction
|
||||||
|
holding a SIREAD lock on the entire relation. It doesn't conflict with
|
||||||
|
page-level locks, because page-level locks are only used to aggregate
|
||||||
|
tuple locks. Unlike index page locks, they don't lock "gaps" on the page.
|
||||||
|
|
||||||
|
|
||||||
Index AM implementations
|
Index AM implementations
|
||||||
------------------------
|
------------------------
|
||||||
@ -296,13 +341,13 @@ need not generate a conflict, although an update which "moves" a row
|
|||||||
into the scan must generate a conflict. While correctness allows
|
into the scan must generate a conflict. While correctness allows
|
||||||
false positives, they should be minimized for performance reasons.
|
false positives, they should be minimized for performance reasons.
|
||||||
|
|
||||||
Several optimizations are possible:
|
Several optimizations are possible, though not all implemented yet:
|
||||||
|
|
||||||
* An index scan which is just finding the right position for an
|
* An index scan which is just finding the right position for an
|
||||||
index insertion or deletion need not acquire a predicate lock.
|
index insertion or deletion needs not acquire a predicate lock.
|
||||||
|
|
||||||
* An index scan which is comparing for equality on the entire key
|
* An index scan which is comparing for equality on the entire key
|
||||||
for a unique index need not acquire a predicate lock as long as a key
|
for a unique index needs not acquire a predicate lock as long as a key
|
||||||
is found corresponding to a visible tuple which has not been modified
|
is found corresponding to a visible tuple which has not been modified
|
||||||
by another transaction -- there are no "between or around" gaps to
|
by another transaction -- there are no "between or around" gaps to
|
||||||
cover.
|
cover.
|
||||||
@ -317,10 +362,10 @@ x = 1 AND x = 2), then no predicate lock is needed.
|
|||||||
|
|
||||||
Other index AM implementation considerations:
|
Other index AM implementation considerations:
|
||||||
|
|
||||||
* If a btree search discovers that no root page has yet been
|
* B-tree index searches acquire predicate locks only on the
|
||||||
created, a predicate lock on the index relation is required;
|
index *leaf* pages needed to lock the appropriate index range. If,
|
||||||
otherwise btree searches must get to the leaf level to determine
|
however, a search discovers that no root page has yet been created, a
|
||||||
which tuples match, so predicate locks go there.
|
predicate lock on the index relation is required.
|
||||||
|
|
||||||
* GiST searches can determine that there are no matches at any
|
* GiST searches can determine that there are no matches at any
|
||||||
level of the index, so there must be a predicate lock at each index
|
level of the index, so there must be a predicate lock at each index
|
||||||
@ -346,11 +391,6 @@ to be added from scratch.
|
|||||||
|
|
||||||
2. The existing in-memory lock structures were not suitable for
|
2. The existing in-memory lock structures were not suitable for
|
||||||
tracking SIREAD locks.
|
tracking SIREAD locks.
|
||||||
* The database products used for the prototype
|
|
||||||
implementations for the papers used update-in-place with a rollback
|
|
||||||
log for their MVCC implementations, while PostgreSQL leaves the old
|
|
||||||
version of a row in place and adds a new tuple to represent the row
|
|
||||||
at a new location.
|
|
||||||
* In PostgreSQL, tuple level locks are not held in RAM for
|
* In PostgreSQL, tuple level locks are not held in RAM for
|
||||||
any length of time; lock information is written to the tuples
|
any length of time; lock information is written to the tuples
|
||||||
involved in the transactions.
|
involved in the transactions.
|
||||||
@ -450,18 +490,19 @@ there can't be a rw-conflict from T3 to T0.
|
|||||||
|
|
||||||
o In both cases, we didn't need the T1 -> T3 edge.
|
o In both cases, we didn't need the T1 -> T3 edge.
|
||||||
|
|
||||||
* Predicate locking in PostgreSQL will start at the tuple level
|
* Predicate locking in PostgreSQL starts at the tuple level
|
||||||
when possible, with automatic conversion of multiple fine-grained
|
when possible. Multiple fine-grained locks are promoted to a single
|
||||||
locks to coarser granularity as need to avoid resource exhaustion.
|
coarser-granularity lock as needed to avoid resource exhaustion. The
|
||||||
The amount of memory used for these structures will be configurable,
|
amount of memory used for these structures is configurable, to balance
|
||||||
to balance RAM usage against SIREAD lock granularity.
|
RAM usage against SIREAD lock granularity.
|
||||||
|
|
||||||
* A process-local copy of locks held by a process and the coarser
|
* Each backend keeps a process-local table of the locks it holds.
|
||||||
covering locks with counts, are kept to support granularity promotion
|
To support granularity promotion decisions with low CPU and locking
|
||||||
decisions with low CPU and locking overhead.
|
overhead, this table also includes the coarser covering locks and the
|
||||||
|
number of finer-granularity locks they cover.
|
||||||
|
|
||||||
* Conflicts will be identified by looking for predicate locks
|
* Conflicts are identified by looking for predicate locks
|
||||||
when tuples are written and looking at the MVCC information when
|
when tuples are written, and by looking at the MVCC information when
|
||||||
tuples are read. There is no matching between two RAM-based locks.
|
tuples are read. There is no matching between two RAM-based locks.
|
||||||
|
|
||||||
* Because write locks are stored in the heap tuples rather than a
|
* Because write locks are stored in the heap tuples rather than a
|
||||||
@ -493,12 +534,12 @@ to be READ ONLY.)
|
|||||||
o We can more aggressively clean up conflicts, predicate
|
o We can more aggressively clean up conflicts, predicate
|
||||||
locks, and SSI transaction information.
|
locks, and SSI transaction information.
|
||||||
|
|
||||||
* Allow a READ ONLY transaction to "opt out" of SSI if there are
|
* We allow a READ ONLY transaction to "opt out" of SSI if there are
|
||||||
no READ WRITE transactions which could cause the READ ONLY
|
no READ WRITE transactions which could cause the READ ONLY
|
||||||
transaction to ever become part of a "dangerous structure" of
|
transaction to ever become part of a "dangerous structure" of
|
||||||
overlapping transaction dependencies.
|
overlapping transaction dependencies.
|
||||||
|
|
||||||
* Allow the user to request that a READ ONLY transaction wait
|
* We allow the user to request that a READ ONLY transaction wait
|
||||||
until the conditions are right for it to start in the "opt out" state
|
until the conditions are right for it to start in the "opt out" state
|
||||||
described above. We add a DEFERRABLE state to transactions, which is
|
described above. We add a DEFERRABLE state to transactions, which is
|
||||||
specified and maintained in a way similar to READ ONLY. It is
|
specified and maintained in a way similar to READ ONLY. It is
|
||||||
@ -538,12 +579,6 @@ address it?
|
|||||||
replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc.
|
replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc.
|
||||||
This is related to the "WAL file replay" issue.
|
This is related to the "WAL file replay" issue.
|
||||||
|
|
||||||
* Weak-memory-ordering machines. Make sure that shared memory
|
|
||||||
access which involves visibility across multiple transactions uses
|
|
||||||
locks as needed to avoid problems. On the other hand, ensure that we
|
|
||||||
really need volatile where we're using it.
|
|
||||||
http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php
|
|
||||||
|
|
||||||
* UNIQUE btree search for equality on all columns. Since a search
|
* UNIQUE btree search for equality on all columns. Since a search
|
||||||
of a UNIQUE index using equality tests on all columns will lock the
|
of a UNIQUE index using equality tests on all columns will lock the
|
||||||
heap tuple if an entry is found, it appears that there is no need to
|
heap tuple if an entry is found, it appears that there is no need to
|
||||||
@ -551,15 +586,6 @@ get a predicate lock on the index in that case. A predicate lock is
|
|||||||
still needed for such a search if a matching index entry which points
|
still needed for such a search if a matching index entry which points
|
||||||
to a visible tuple is not found.
|
to a visible tuple is not found.
|
||||||
|
|
||||||
* Planner index probes. To avoid problems with data skew at the
|
|
||||||
ends of an index which have historically caused bad plans, the
|
|
||||||
planner now probes the end of an index to see what the maximum or
|
|
||||||
minimum value is when a query appears to be requesting a range of
|
|
||||||
data outside what statistics shows is present. These planner checks
|
|
||||||
don't require predicate locking, but there's currently no easy way to
|
|
||||||
avoid it. What can we do to avoid predicate locking for such planner
|
|
||||||
activity?
|
|
||||||
|
|
||||||
* Minimize touching of shared memory. Should lists in shared
|
* Minimize touching of shared memory. Should lists in shared
|
||||||
memory push entries which have just been returned to the front of the
|
memory push entries which have just been returned to the front of the
|
||||||
available list, so they will be popped back off soon and some memory
|
available list, so they will be popped back off soon and some memory
|
||||||
@ -573,13 +599,17 @@ Footnotes
|
|||||||
[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
|
[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
|
||||||
Search for serial execution to find the relevant section.
|
Search for serial execution to find the relevant section.
|
||||||
|
|
||||||
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
|
[2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM
|
||||||
Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
|
Transactions on Database Systems 30:2, Jun. 2005.
|
||||||
|
http://dx.doi.org/10.1145/1071610.1071615
|
||||||
|
|
||||||
|
[3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
|
||||||
Architecture of a Database System. Foundations and Trends(R) in
|
Architecture of a Database System. Foundations and Trends(R) in
|
||||||
Databases Vol. 1, No. 2 (2007) 141-259.
|
Databases Vol. 1, No. 2 (2007) 141-259.
|
||||||
|
http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
|
||||||
Of particular interest:
|
Of particular interest:
|
||||||
* 6.1 A Note on ACID
|
* 6.1 A Note on ACID
|
||||||
* 6.2 A Brief Review of Serializability
|
* 6.2 A Brief Review of Serializability
|
||||||
* 6.3 Locking and Latching
|
* 6.3 Locking and Latching
|
||||||
* 6.3.1 Transaction Isolation Levels
|
* 6.3.1 Transaction Isolation Levels
|
||||||
* 6.5.3 Next-Key Locking: Physical Surrogates for Logical
|
* 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties
|
||||||
|
Loading…
Reference in New Issue
Block a user