mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-12-15 08:20:16 +08:00
Update assorted TOAST-related documentation.
While working on documentation for expanded arrays, I noticed a number of details in the TOAST-related documentation that were already inaccurate or obsolete. This should be fixed independently of whether expanded arrays get in or not. One issue is that the already existing indirect-pointer facility was not documented at all. Also, the documentation says that you only need to use VARSIZE/SET_VARSIZE if you've made your variable-length type TOAST-aware, but actually we've forced that business on all varlena types even if they've opted out of TOAST by setting storage = plain. Wordsmith a few other things too, like an amusingly archaic claim that there are few 64-bit machines. I thought about back-patching this, but since all this doco is oriented to hackers and C-coded extension authors, fixing it in HEAD is probably good enough.
This commit is contained in:
parent
56a79a869b
commit
9bb955c828
@ -329,15 +329,17 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
|
|||||||
to <literal>VARIABLE</literal>. (Internally, this is represented
|
to <literal>VARIABLE</literal>. (Internally, this is represented
|
||||||
by setting <literal>typlen</> to -1.) The internal representation of all
|
by setting <literal>typlen</> to -1.) The internal representation of all
|
||||||
variable-length types must start with a 4-byte integer giving the total
|
variable-length types must start with a 4-byte integer giving the total
|
||||||
length of this value of the type.
|
length of this value of the type. (Note that the length field is often
|
||||||
|
encoded, as described in <xref linkend="storage-toast">; it's unwise
|
||||||
|
to access it directly.)
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The optional flag <literal>PASSEDBYVALUE</literal> indicates that
|
The optional flag <literal>PASSEDBYVALUE</literal> indicates that
|
||||||
values of this data type are passed by value, rather than by
|
values of this data type are passed by value, rather than by
|
||||||
reference. You cannot pass by value types whose internal
|
reference. Types passed by value must be fixed-length, and their internal
|
||||||
representation is larger than the size of the <type>Datum</> type
|
representation cannot be larger than the size of the <type>Datum</> type
|
||||||
(4 bytes on most machines, 8 bytes on a few).
|
(4 bytes on some machines, 8 bytes on others).
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
@ -367,6 +369,17 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
|
|||||||
<literal>external</literal> items.)
|
<literal>external</literal> items.)
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
All <replaceable class="parameter">storage</replaceable> values other
|
||||||
|
than <literal>plain</literal> imply that the functions of the data type
|
||||||
|
can handle values that have been <firstterm>toasted</>, as described
|
||||||
|
in <xref linkend="storage-toast"> and <xref linkend="xtypes-toast">.
|
||||||
|
The specific other value given merely determines the default TOAST
|
||||||
|
storage strategy for columns of a toastable data type; users can pick
|
||||||
|
other strategies for individual columns using <literal>ALTER TABLE
|
||||||
|
SET STORAGE</>.
|
||||||
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The <replaceable class="parameter">like_type</replaceable> parameter
|
The <replaceable class="parameter">like_type</replaceable> parameter
|
||||||
provides an alternative method for specifying the basic representation
|
provides an alternative method for specifying the basic representation
|
||||||
@ -465,8 +478,8 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
|
|||||||
identical things, and you want to allow these things to be accessed
|
identical things, and you want to allow these things to be accessed
|
||||||
directly by subscripting, in addition to whatever operations you plan
|
directly by subscripting, in addition to whatever operations you plan
|
||||||
to provide for the type as a whole. For example, type <type>point</>
|
to provide for the type as a whole. For example, type <type>point</>
|
||||||
is represented as just two floating-point numbers, each can be accessed using
|
is represented as just two floating-point numbers, which can be accessed
|
||||||
<literal>point[0]</> and <literal>point[1]</>.
|
using <literal>point[0]</> and <literal>point[1]</>.
|
||||||
Note that
|
Note that
|
||||||
this facility only works for fixed-length types whose internal form
|
this facility only works for fixed-length types whose internal form
|
||||||
is exactly a sequence of identical fixed-length fields. A subscriptable
|
is exactly a sequence of identical fixed-length fields. A subscriptable
|
||||||
|
@ -303,25 +303,33 @@ Oversized-Attribute Storage Technique).
|
|||||||
|
|
||||||
<para>
|
<para>
|
||||||
<productname>PostgreSQL</productname> uses a fixed page size (commonly
|
<productname>PostgreSQL</productname> uses a fixed page size (commonly
|
||||||
8 kB), and does not allow tuples to span multiple pages. Therefore, it is
|
8 kB), and does not allow tuples to span multiple pages. Therefore, it is
|
||||||
not possible to store very large field values directly. To overcome
|
not possible to store very large field values directly. To overcome
|
||||||
this limitation, large field values are compressed and/or broken up into
|
this limitation, large field values are compressed and/or broken up into
|
||||||
multiple physical rows. This happens transparently to the user, with only
|
multiple physical rows. This happens transparently to the user, with only
|
||||||
small impact on most of the backend code. The technique is affectionately
|
small impact on most of the backend code. The technique is affectionately
|
||||||
known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
|
known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
|
||||||
|
The <acronym>TOAST</> infrastructure is also used to improve handling of
|
||||||
|
large data values in-memory.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
Only certain data types support <acronym>TOAST</> — there is no need to
|
Only certain data types support <acronym>TOAST</> — there is no need to
|
||||||
impose the overhead on data types that cannot produce large field values.
|
impose the overhead on data types that cannot produce large field values.
|
||||||
To support <acronym>TOAST</>, a data type must have a variable-length
|
To support <acronym>TOAST</>, a data type must have a variable-length
|
||||||
(<firstterm>varlena</>) representation, in which the first 32-bit word of any
|
(<firstterm>varlena</>) representation, in which, ordinarily, the first
|
||||||
stored value contains the total length of the value in bytes (including
|
four-byte word of any stored value contains the total length of the value in
|
||||||
itself). <acronym>TOAST</> does not constrain the rest of the representation.
|
bytes (including itself). <acronym>TOAST</> does not constrain the rest
|
||||||
All the C-level functions supporting a <acronym>TOAST</>-able data type must
|
of the data type's representation. The special representations collectively
|
||||||
be careful to handle <acronym>TOAST</>ed input values. (This is normally done
|
called <firstterm><acronym>TOAST</>ed values</firstterm> work by modifying or
|
||||||
by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
|
reinterpreting this initial length word. Therefore, the C-level functions
|
||||||
value, but in some cases more efficient approaches are possible.)
|
supporting a <acronym>TOAST</>-able data type must be careful about how they
|
||||||
|
handle potentially <acronym>TOAST</>ed input values: an input might not
|
||||||
|
actually consist of a four-byte length word and contents until after it's
|
||||||
|
been <firstterm>detoasted</>. (This is normally done by invoking
|
||||||
|
<function>PG_DETOAST_DATUM</> before doing anything with an input value,
|
||||||
|
but in some cases more efficient approaches are possible.
|
||||||
|
See <xref linkend="xtypes-toast"> for more detail.)
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
@ -333,58 +341,84 @@ the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
|
|||||||
the remaining bits of the length word give the total datum size (including
|
the remaining bits of the length word give the total datum size (including
|
||||||
length word) in bytes. When the highest-order or lowest-order bit is set,
|
length word) in bytes. When the highest-order or lowest-order bit is set,
|
||||||
the value has only a single-byte header instead of the normal four-byte
|
the value has only a single-byte header instead of the normal four-byte
|
||||||
header, and the remaining bits give the total datum size (including length
|
header, and the remaining bits of that byte give the total datum size
|
||||||
byte) in bytes. As a special case, if the remaining bits are all zero
|
(including length byte) in bytes. This alternative supports space-efficient
|
||||||
(which would be impossible for a self-inclusive length), the value is a
|
storage of values shorter than 127 bytes, while still allowing the data type
|
||||||
pointer to out-of-line data stored in a separate TOAST table. (The size of
|
to grow to 1 GB at need. Values with single-byte headers aren't aligned on
|
||||||
a TOAST pointer is given in the second byte of the datum.)
|
any particular boundary, whereas values with four-byte headers are aligned on
|
||||||
Values with single-byte headers aren't aligned on any particular
|
at least a four-byte boundary; this omission of alignment padding provides
|
||||||
boundary, either. Lastly, when the highest-order or lowest-order bit is
|
additional space savings that is significant compared to short values.
|
||||||
clear but the adjacent bit is set, the content of the datum has been
|
As a special case, if the remaining bits of a single-byte header are all
|
||||||
compressed and must be decompressed before use. In this case the remaining
|
zero (which would be impossible for a self-inclusive length), the value is
|
||||||
bits of the length word give the total size of the compressed datum, not the
|
a pointer to out-of-line data, with several possible alternatives as
|
||||||
|
described below. The type and size of such a <firstterm>TOAST pointer</>
|
||||||
|
are determined by a code stored in the second byte of the datum.
|
||||||
|
Lastly, when the highest-order or lowest-order bit is clear but the adjacent
|
||||||
|
bit is set, the content of the datum has been compressed and must be
|
||||||
|
decompressed before use. In this case the remaining bits of the four-byte
|
||||||
|
length word give the total size of the compressed datum, not the
|
||||||
original data. Note that compression is also possible for out-of-line data
|
original data. Note that compression is also possible for out-of-line data
|
||||||
but the varlena header does not tell whether it has occurred —
|
but the varlena header does not tell whether it has occurred —
|
||||||
the content of the TOAST pointer tells that, instead.
|
the content of the <acronym>TOAST</> pointer tells that, instead.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
As mentioned, there are multiple types of <acronym>TOAST</> pointer datums.
|
||||||
|
The oldest and most common type is a pointer to out-of-line data stored in
|
||||||
|
a <firstterm><acronym>TOAST</> table</firstterm> that is separate from, but
|
||||||
|
associated with, the table containing the <acronym>TOAST</> pointer datum
|
||||||
|
itself. These <firstterm>on-disk</> pointer datums are created by the
|
||||||
|
<acronym>TOAST</> management code (in <filename>access/heap/tuptoaster.c</>)
|
||||||
|
when a tuple to be stored on disk is too large to be stored as-is.
|
||||||
|
Further details appear in <xref linkend="storage-toast-ondisk">.
|
||||||
|
Alternatively, a <acronym>TOAST</> pointer datum can contain a pointer to
|
||||||
|
out-of-line data that appears elsewhere in memory. Such datums are
|
||||||
|
necessarily short-lived, and will never appear on-disk, but they are very
|
||||||
|
useful for avoiding copying and redundant processing of large data values.
|
||||||
|
Further details appear in <xref linkend="storage-toast-inmemory">.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The compression technique used for either in-line or out-of-line compressed
|
||||||
|
data is a fairly simple and very fast member
|
||||||
|
of the LZ family of compression techniques. See
|
||||||
|
<filename>src/common/pg_lzcompress.c</> for the details.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<sect2 id="storage-toast-ondisk">
|
||||||
|
<title>Out-of-line, on-disk TOAST storage</title>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
If any of the columns of a table are <acronym>TOAST</>-able, the table will
|
If any of the columns of a table are <acronym>TOAST</>-able, the table will
|
||||||
have an associated <acronym>TOAST</> table, whose OID is stored in the table's
|
have an associated <acronym>TOAST</> table, whose OID is stored in the table's
|
||||||
<structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
|
<structname>pg_class</>.<structfield>reltoastrelid</> entry. On-disk
|
||||||
<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
|
<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
|
||||||
described in more detail below.
|
described in more detail below.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
|
||||||
The compression technique used is a fairly simple and very fast member
|
|
||||||
of the LZ family of compression techniques. See
|
|
||||||
<filename>src/common/pg_lzcompress.c</> for the details.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
Out-of-line values are divided (after compression if used) into chunks of at
|
Out-of-line values are divided (after compression if used) into chunks of at
|
||||||
most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
|
most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
|
||||||
so that four chunk rows will fit on a page, making it about 2000 bytes).
|
so that four chunk rows will fit on a page, making it about 2000 bytes).
|
||||||
Each chunk is stored
|
Each chunk is stored as a separate row in the <acronym>TOAST</> table
|
||||||
as a separate row in the <acronym>TOAST</> table for the owning table. Every
|
belonging to the owning table. Every
|
||||||
<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
|
<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
|
||||||
identifying the particular <acronym>TOAST</>ed value),
|
identifying the particular <acronym>TOAST</>ed value),
|
||||||
<structfield>chunk_seq</> (a sequence number for the chunk within its value),
|
<structfield>chunk_seq</> (a sequence number for the chunk within its value),
|
||||||
and <structfield>chunk_data</> (the actual data of the chunk). A unique index
|
and <structfield>chunk_data</> (the actual data of the chunk). A unique index
|
||||||
on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
|
on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
|
||||||
retrieval of the values. A pointer datum representing an out-of-line
|
retrieval of the values. A pointer datum representing an out-of-line on-disk
|
||||||
<acronym>TOAST</>ed value therefore needs to store the OID of the
|
<acronym>TOAST</>ed value therefore needs to store the OID of the
|
||||||
<acronym>TOAST</> table in which to look and the OID of the specific value
|
<acronym>TOAST</> table in which to look and the OID of the specific value
|
||||||
(its <structfield>chunk_id</>). For convenience, pointer datums also store the
|
(its <structfield>chunk_id</>). For convenience, pointer datums also store the
|
||||||
logical datum size (original uncompressed data length) and actual stored size
|
logical datum size (original uncompressed data length) and physical stored size
|
||||||
(different if compression was applied). Allowing for the varlena header bytes,
|
(different if compression was applied). Allowing for the varlena header bytes,
|
||||||
the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
|
the total size of an on-disk <acronym>TOAST</> pointer datum is therefore 18
|
||||||
regardless of the actual size of the represented value.
|
bytes regardless of the actual size of the represented value.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The <acronym>TOAST</> code is triggered only
|
The <acronym>TOAST</> management code is triggered only
|
||||||
when a row value to be stored in a table is wider than
|
when a row value to be stored in a table is wider than
|
||||||
<symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
|
<symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
|
||||||
The <acronym>TOAST</> code will compress and/or move
|
The <acronym>TOAST</> code will compress and/or move
|
||||||
@ -397,8 +431,8 @@ none of the out-of-line values change.
|
|||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The <acronym>TOAST</> code recognizes four different strategies for storing
|
The <acronym>TOAST</> management code recognizes four different strategies
|
||||||
<acronym>TOAST</>-able columns:
|
for storing <acronym>TOAST</>-able columns on disk:
|
||||||
|
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem>
|
<listitem>
|
||||||
@ -460,6 +494,41 @@ pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
|
|||||||
comparison table, in which all the HTML pages were cut down to 7 kB to fit.
|
comparison table, in which all the HTML pages were cut down to 7 kB to fit.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="storage-toast-inmemory">
|
||||||
|
<title>Out-of-line, in-memory TOAST storage</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<acronym>TOAST</> pointers can point to data that is not on disk, but is
|
||||||
|
elsewhere in the memory of the current server process. Such pointers
|
||||||
|
obviously cannot be long-lived, but they are nonetheless useful. There
|
||||||
|
is currently just one sub-case:
|
||||||
|
pointers to <firstterm>indirect</> data.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Indirect <acronym>TOAST</> pointers simply point at a non-indirect varlena
|
||||||
|
value stored somewhere in memory. This case was originally created merely
|
||||||
|
as a proof of concept, but it is currently used during logical decoding to
|
||||||
|
avoid possibly having to create physical tuples exceeding 1 GB (as pulling
|
||||||
|
all out-of-line field values into the tuple might do). The case is of
|
||||||
|
limited use since the creator of the pointer datum is entirely responsible
|
||||||
|
that the referenced data survives for as long as the pointer could exist,
|
||||||
|
and there is no infrastructure to help with this.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
For all types of in-memory <acronym>TOAST</> pointer, the <acronym>TOAST</>
|
||||||
|
management code ensures that no such pointer datum can accidentally get
|
||||||
|
stored on disk. In-memory <acronym>TOAST</> pointers are automatically
|
||||||
|
expanded to normal in-line varlena values before storage — and then
|
||||||
|
possibly converted to on-disk <acronym>TOAST</> pointers, if the containing
|
||||||
|
tuple would otherwise be too big.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect2>
|
||||||
|
|
||||||
</sect1>
|
</sect1>
|
||||||
|
|
||||||
<sect1 id="storage-fsm">
|
<sect1 id="storage-fsm">
|
||||||
|
@ -234,35 +234,49 @@ CREATE TYPE complex (
|
|||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
|
If the internal representation of the data type is variable-length, the
|
||||||
|
internal representation must follow the standard layout for variable-length
|
||||||
|
data: the first four bytes must be a <type>char[4]</type> field which is
|
||||||
|
never accessed directly (customarily named <structfield>vl_len_</>). You
|
||||||
|
must use the <function>SET_VARSIZE()</function> macro to store the total
|
||||||
|
size of the datum (including the length field itself) in this field
|
||||||
|
and <function>VARSIZE()</function> to retrieve it. (These macros exist
|
||||||
|
because the length field may be encoded depending on platform.)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
For further details see the description of the
|
||||||
|
<xref linkend="sql-createtype"> command.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<sect2 id="xtypes-toast">
|
||||||
|
<title>TOAST Considerations</title>
|
||||||
<indexterm>
|
<indexterm>
|
||||||
<primary>TOAST</primary>
|
<primary>TOAST</primary>
|
||||||
<secondary>and user-defined types</secondary>
|
<secondary>and user-defined types</secondary>
|
||||||
</indexterm>
|
</indexterm>
|
||||||
If the values of your data type vary in size (in internal form), you should
|
|
||||||
make the data type <acronym>TOAST</>-able (see <xref
|
<para>
|
||||||
linkend="storage-toast">). You should do this even if the data are always
|
If the values of your data type vary in size (in internal form), it's
|
||||||
|
usually desirable to make the data type <acronym>TOAST</>-able (see <xref
|
||||||
|
linkend="storage-toast">). You should do this even if the values are always
|
||||||
too small to be compressed or stored externally, because
|
too small to be compressed or stored externally, because
|
||||||
<acronym>TOAST</> can save space on small data too, by reducing header
|
<acronym>TOAST</> can save space on small data too, by reducing header
|
||||||
overhead.
|
overhead.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
To do this, the internal representation must follow the standard layout for
|
To support <acronym>TOAST</> storage, the C functions operating on the data
|
||||||
variable-length data: the first four bytes must be a <type>char[4]</type>
|
type must always be careful to unpack any toasted values they are handed
|
||||||
field which is never accessed directly (customarily named
|
by using <function>PG_DETOAST_DATUM</>. (This detail is customarily hidden
|
||||||
<structfield>vl_len_</>). You
|
by defining type-specific <function>GETARG_DATATYPE_P</function> macros.)
|
||||||
must use <function>SET_VARSIZE()</function> to store the size of the datum
|
Then, when running the <command>CREATE TYPE</command> command, specify the
|
||||||
in this field and <function>VARSIZE()</function> to retrieve it. The C
|
internal length as <literal>variable</> and select some appropriate storage
|
||||||
functions operating on the data type must always be careful to unpack any
|
option other than <literal>plain</>.
|
||||||
toasted values they are handed, by using <function>PG_DETOAST_DATUM</>.
|
|
||||||
(This detail is customarily hidden by defining type-specific
|
|
||||||
<function>GETARG_DATATYPE_P</function> macros.) Then, when running the
|
|
||||||
<command>CREATE TYPE</command> command, specify the internal length as
|
|
||||||
<literal>variable</> and select the appropriate storage option.
|
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
If the alignment is unimportant (either just for a specific function or
|
If data alignment is unimportant (either just for a specific function or
|
||||||
because the data type specifies byte alignment anyway) then it's possible
|
because the data type specifies byte alignment anyway) then it's possible
|
||||||
to avoid some of the overhead of <function>PG_DETOAST_DATUM</>. You can use
|
to avoid some of the overhead of <function>PG_DETOAST_DATUM</>. You can use
|
||||||
<function>PG_DETOAST_DATUM_PACKED</> instead (customarily hidden by
|
<function>PG_DETOAST_DATUM_PACKED</> instead (customarily hidden by
|
||||||
@ -286,8 +300,6 @@ CREATE TYPE complex (
|
|||||||
</para>
|
</para>
|
||||||
</note>
|
</note>
|
||||||
|
|
||||||
<para>
|
</sect2>
|
||||||
For further details see the description of the
|
|
||||||
<xref linkend="sql-createtype"> command.
|
|
||||||
</para>
|
|
||||||
</sect1>
|
</sect1>
|
||||||
|
Loading…
Reference in New Issue
Block a user