mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-12-15 08:20:16 +08:00
Update assorted TOAST-related documentation.
While working on documentation for expanded arrays, I noticed a number of details in the TOAST-related documentation that were already inaccurate or obsolete. This should be fixed independently of whether expanded arrays get in or not. One issue is that the already existing indirect-pointer facility was not documented at all. Also, the documentation says that you only need to use VARSIZE/SET_VARSIZE if you've made your variable-length type TOAST-aware, but actually we've forced that business on all varlena types even if they've opted out of TOAST by setting storage = plain. Wordsmith a few other things too, like an amusingly archaic claim that there are few 64-bit machines. I thought about back-patching this, but since all this doco is oriented to hackers and C-coded extension authors, fixing it in HEAD is probably good enough.
This commit is contained in:
parent
56a79a869b
commit
9bb955c828
@ -329,15 +329,17 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
|
||||
to <literal>VARIABLE</literal>. (Internally, this is represented
|
||||
by setting <literal>typlen</> to -1.) The internal representation of all
|
||||
variable-length types must start with a 4-byte integer giving the total
|
||||
length of this value of the type.
|
||||
length of this value of the type. (Note that the length field is often
|
||||
encoded, as described in <xref linkend="storage-toast">; it's unwise
|
||||
to access it directly.)
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The optional flag <literal>PASSEDBYVALUE</literal> indicates that
|
||||
values of this data type are passed by value, rather than by
|
||||
reference. You cannot pass by value types whose internal
|
||||
representation is larger than the size of the <type>Datum</> type
|
||||
(4 bytes on most machines, 8 bytes on a few).
|
||||
reference. Types passed by value must be fixed-length, and their internal
|
||||
representation cannot be larger than the size of the <type>Datum</> type
|
||||
(4 bytes on some machines, 8 bytes on others).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -367,6 +369,17 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
|
||||
<literal>external</literal> items.)
|
||||
</para>
|
||||
|
||||
<para>
|
||||
All <replaceable class="parameter">storage</replaceable> values other
|
||||
than <literal>plain</literal> imply that the functions of the data type
|
||||
can handle values that have been <firstterm>toasted</>, as described
|
||||
in <xref linkend="storage-toast"> and <xref linkend="xtypes-toast">.
|
||||
The specific other value given merely determines the default TOAST
|
||||
storage strategy for columns of a toastable data type; users can pick
|
||||
other strategies for individual columns using <literal>ALTER TABLE
|
||||
SET STORAGE</>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <replaceable class="parameter">like_type</replaceable> parameter
|
||||
provides an alternative method for specifying the basic representation
|
||||
@ -465,8 +478,8 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
|
||||
identical things, and you want to allow these things to be accessed
|
||||
directly by subscripting, in addition to whatever operations you plan
|
||||
to provide for the type as a whole. For example, type <type>point</>
|
||||
is represented as just two floating-point numbers, each can be accessed using
|
||||
<literal>point[0]</> and <literal>point[1]</>.
|
||||
is represented as just two floating-point numbers, which can be accessed
|
||||
using <literal>point[0]</> and <literal>point[1]</>.
|
||||
Note that
|
||||
this facility only works for fixed-length types whose internal form
|
||||
is exactly a sequence of identical fixed-length fields. A subscriptable
|
||||
|
@ -309,19 +309,27 @@ this limitation, large field values are compressed and/or broken up into
|
||||
multiple physical rows. This happens transparently to the user, with only
|
||||
small impact on most of the backend code. The technique is affectionately
|
||||
known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
|
||||
The <acronym>TOAST</> infrastructure is also used to improve handling of
|
||||
large data values in-memory.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Only certain data types support <acronym>TOAST</> — there is no need to
|
||||
impose the overhead on data types that cannot produce large field values.
|
||||
To support <acronym>TOAST</>, a data type must have a variable-length
|
||||
(<firstterm>varlena</>) representation, in which the first 32-bit word of any
|
||||
stored value contains the total length of the value in bytes (including
|
||||
itself). <acronym>TOAST</> does not constrain the rest of the representation.
|
||||
All the C-level functions supporting a <acronym>TOAST</>-able data type must
|
||||
be careful to handle <acronym>TOAST</>ed input values. (This is normally done
|
||||
by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
|
||||
value, but in some cases more efficient approaches are possible.)
|
||||
(<firstterm>varlena</>) representation, in which, ordinarily, the first
|
||||
four-byte word of any stored value contains the total length of the value in
|
||||
bytes (including itself). <acronym>TOAST</> does not constrain the rest
|
||||
of the data type's representation. The special representations collectively
|
||||
called <firstterm><acronym>TOAST</>ed values</firstterm> work by modifying or
|
||||
reinterpreting this initial length word. Therefore, the C-level functions
|
||||
supporting a <acronym>TOAST</>-able data type must be careful about how they
|
||||
handle potentially <acronym>TOAST</>ed input values: an input might not
|
||||
actually consist of a four-byte length word and contents until after it's
|
||||
been <firstterm>detoasted</>. (This is normally done by invoking
|
||||
<function>PG_DETOAST_DATUM</> before doing anything with an input value,
|
||||
but in some cases more efficient approaches are possible.
|
||||
See <xref linkend="xtypes-toast"> for more detail.)
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -333,58 +341,84 @@ the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
|
||||
the remaining bits of the length word give the total datum size (including
|
||||
length word) in bytes. When the highest-order or lowest-order bit is set,
|
||||
the value has only a single-byte header instead of the normal four-byte
|
||||
header, and the remaining bits give the total datum size (including length
|
||||
byte) in bytes. As a special case, if the remaining bits are all zero
|
||||
(which would be impossible for a self-inclusive length), the value is a
|
||||
pointer to out-of-line data stored in a separate TOAST table. (The size of
|
||||
a TOAST pointer is given in the second byte of the datum.)
|
||||
Values with single-byte headers aren't aligned on any particular
|
||||
boundary, either. Lastly, when the highest-order or lowest-order bit is
|
||||
clear but the adjacent bit is set, the content of the datum has been
|
||||
compressed and must be decompressed before use. In this case the remaining
|
||||
bits of the length word give the total size of the compressed datum, not the
|
||||
header, and the remaining bits of that byte give the total datum size
|
||||
(including length byte) in bytes. This alternative supports space-efficient
|
||||
storage of values shorter than 127 bytes, while still allowing the data type
|
||||
to grow to 1 GB at need. Values with single-byte headers aren't aligned on
|
||||
any particular boundary, whereas values with four-byte headers are aligned on
|
||||
at least a four-byte boundary; this omission of alignment padding provides
|
||||
additional space savings that is significant compared to short values.
|
||||
As a special case, if the remaining bits of a single-byte header are all
|
||||
zero (which would be impossible for a self-inclusive length), the value is
|
||||
a pointer to out-of-line data, with several possible alternatives as
|
||||
described below. The type and size of such a <firstterm>TOAST pointer</>
|
||||
are determined by a code stored in the second byte of the datum.
|
||||
Lastly, when the highest-order or lowest-order bit is clear but the adjacent
|
||||
bit is set, the content of the datum has been compressed and must be
|
||||
decompressed before use. In this case the remaining bits of the four-byte
|
||||
length word give the total size of the compressed datum, not the
|
||||
original data. Note that compression is also possible for out-of-line data
|
||||
but the varlena header does not tell whether it has occurred —
|
||||
the content of the TOAST pointer tells that, instead.
|
||||
the content of the <acronym>TOAST</> pointer tells that, instead.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
As mentioned, there are multiple types of <acronym>TOAST</> pointer datums.
|
||||
The oldest and most common type is a pointer to out-of-line data stored in
|
||||
a <firstterm><acronym>TOAST</> table</firstterm> that is separate from, but
|
||||
associated with, the table containing the <acronym>TOAST</> pointer datum
|
||||
itself. These <firstterm>on-disk</> pointer datums are created by the
|
||||
<acronym>TOAST</> management code (in <filename>access/heap/tuptoaster.c</>)
|
||||
when a tuple to be stored on disk is too large to be stored as-is.
|
||||
Further details appear in <xref linkend="storage-toast-ondisk">.
|
||||
Alternatively, a <acronym>TOAST</> pointer datum can contain a pointer to
|
||||
out-of-line data that appears elsewhere in memory. Such datums are
|
||||
necessarily short-lived, and will never appear on-disk, but they are very
|
||||
useful for avoiding copying and redundant processing of large data values.
|
||||
Further details appear in <xref linkend="storage-toast-inmemory">.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The compression technique used for either in-line or out-of-line compressed
|
||||
data is a fairly simple and very fast member
|
||||
of the LZ family of compression techniques. See
|
||||
<filename>src/common/pg_lzcompress.c</> for the details.
|
||||
</para>
|
||||
|
||||
<sect2 id="storage-toast-ondisk">
|
||||
<title>Out-of-line, on-disk TOAST storage</title>
|
||||
|
||||
<para>
|
||||
If any of the columns of a table are <acronym>TOAST</>-able, the table will
|
||||
have an associated <acronym>TOAST</> table, whose OID is stored in the table's
|
||||
<structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
|
||||
<structname>pg_class</>.<structfield>reltoastrelid</> entry. On-disk
|
||||
<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
|
||||
described in more detail below.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The compression technique used is a fairly simple and very fast member
|
||||
of the LZ family of compression techniques. See
|
||||
<filename>src/common/pg_lzcompress.c</> for the details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Out-of-line values are divided (after compression if used) into chunks of at
|
||||
most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
|
||||
so that four chunk rows will fit on a page, making it about 2000 bytes).
|
||||
Each chunk is stored
|
||||
as a separate row in the <acronym>TOAST</> table for the owning table. Every
|
||||
Each chunk is stored as a separate row in the <acronym>TOAST</> table
|
||||
belonging to the owning table. Every
|
||||
<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
|
||||
identifying the particular <acronym>TOAST</>ed value),
|
||||
<structfield>chunk_seq</> (a sequence number for the chunk within its value),
|
||||
and <structfield>chunk_data</> (the actual data of the chunk). A unique index
|
||||
on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
|
||||
retrieval of the values. A pointer datum representing an out-of-line
|
||||
retrieval of the values. A pointer datum representing an out-of-line on-disk
|
||||
<acronym>TOAST</>ed value therefore needs to store the OID of the
|
||||
<acronym>TOAST</> table in which to look and the OID of the specific value
|
||||
(its <structfield>chunk_id</>). For convenience, pointer datums also store the
|
||||
logical datum size (original uncompressed data length) and actual stored size
|
||||
logical datum size (original uncompressed data length) and physical stored size
|
||||
(different if compression was applied). Allowing for the varlena header bytes,
|
||||
the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
|
||||
regardless of the actual size of the represented value.
|
||||
the total size of an on-disk <acronym>TOAST</> pointer datum is therefore 18
|
||||
bytes regardless of the actual size of the represented value.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <acronym>TOAST</> code is triggered only
|
||||
The <acronym>TOAST</> management code is triggered only
|
||||
when a row value to be stored in a table is wider than
|
||||
<symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
|
||||
The <acronym>TOAST</> code will compress and/or move
|
||||
@ -397,8 +431,8 @@ none of the out-of-line values change.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <acronym>TOAST</> code recognizes four different strategies for storing
|
||||
<acronym>TOAST</>-able columns:
|
||||
The <acronym>TOAST</> management code recognizes four different strategies
|
||||
for storing <acronym>TOAST</>-able columns on disk:
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
@ -460,6 +494,41 @@ pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
|
||||
comparison table, in which all the HTML pages were cut down to 7 kB to fit.
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="storage-toast-inmemory">
|
||||
<title>Out-of-line, in-memory TOAST storage</title>
|
||||
|
||||
<para>
|
||||
<acronym>TOAST</> pointers can point to data that is not on disk, but is
|
||||
elsewhere in the memory of the current server process. Such pointers
|
||||
obviously cannot be long-lived, but they are nonetheless useful. There
|
||||
is currently just one sub-case:
|
||||
pointers to <firstterm>indirect</> data.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Indirect <acronym>TOAST</> pointers simply point at a non-indirect varlena
|
||||
value stored somewhere in memory. This case was originally created merely
|
||||
as a proof of concept, but it is currently used during logical decoding to
|
||||
avoid possibly having to create physical tuples exceeding 1 GB (as pulling
|
||||
all out-of-line field values into the tuple might do). The case is of
|
||||
limited use since the creator of the pointer datum is entirely responsible
|
||||
that the referenced data survives for as long as the pointer could exist,
|
||||
and there is no infrastructure to help with this.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For all types of in-memory <acronym>TOAST</> pointer, the <acronym>TOAST</>
|
||||
management code ensures that no such pointer datum can accidentally get
|
||||
stored on disk. In-memory <acronym>TOAST</> pointers are automatically
|
||||
expanded to normal in-line varlena values before storage — and then
|
||||
possibly converted to on-disk <acronym>TOAST</> pointers, if the containing
|
||||
tuple would otherwise be too big.
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="storage-fsm">
|
||||
|
@ -234,35 +234,49 @@ CREATE TYPE complex (
|
||||
</para>
|
||||
|
||||
<para>
|
||||
If the internal representation of the data type is variable-length, the
|
||||
internal representation must follow the standard layout for variable-length
|
||||
data: the first four bytes must be a <type>char[4]</type> field which is
|
||||
never accessed directly (customarily named <structfield>vl_len_</>). You
|
||||
must use the <function>SET_VARSIZE()</function> macro to store the total
|
||||
size of the datum (including the length field itself) in this field
|
||||
and <function>VARSIZE()</function> to retrieve it. (These macros exist
|
||||
because the length field may be encoded depending on platform.)
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For further details see the description of the
|
||||
<xref linkend="sql-createtype"> command.
|
||||
</para>
|
||||
|
||||
<sect2 id="xtypes-toast">
|
||||
<title>TOAST Considerations</title>
|
||||
<indexterm>
|
||||
<primary>TOAST</primary>
|
||||
<secondary>and user-defined types</secondary>
|
||||
</indexterm>
|
||||
If the values of your data type vary in size (in internal form), you should
|
||||
make the data type <acronym>TOAST</>-able (see <xref
|
||||
linkend="storage-toast">). You should do this even if the data are always
|
||||
|
||||
<para>
|
||||
If the values of your data type vary in size (in internal form), it's
|
||||
usually desirable to make the data type <acronym>TOAST</>-able (see <xref
|
||||
linkend="storage-toast">). You should do this even if the values are always
|
||||
too small to be compressed or stored externally, because
|
||||
<acronym>TOAST</> can save space on small data too, by reducing header
|
||||
overhead.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
To do this, the internal representation must follow the standard layout for
|
||||
variable-length data: the first four bytes must be a <type>char[4]</type>
|
||||
field which is never accessed directly (customarily named
|
||||
<structfield>vl_len_</>). You
|
||||
must use <function>SET_VARSIZE()</function> to store the size of the datum
|
||||
in this field and <function>VARSIZE()</function> to retrieve it. The C
|
||||
functions operating on the data type must always be careful to unpack any
|
||||
toasted values they are handed, by using <function>PG_DETOAST_DATUM</>.
|
||||
(This detail is customarily hidden by defining type-specific
|
||||
<function>GETARG_DATATYPE_P</function> macros.) Then, when running the
|
||||
<command>CREATE TYPE</command> command, specify the internal length as
|
||||
<literal>variable</> and select the appropriate storage option.
|
||||
To support <acronym>TOAST</> storage, the C functions operating on the data
|
||||
type must always be careful to unpack any toasted values they are handed
|
||||
by using <function>PG_DETOAST_DATUM</>. (This detail is customarily hidden
|
||||
by defining type-specific <function>GETARG_DATATYPE_P</function> macros.)
|
||||
Then, when running the <command>CREATE TYPE</command> command, specify the
|
||||
internal length as <literal>variable</> and select some appropriate storage
|
||||
option other than <literal>plain</>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
If the alignment is unimportant (either just for a specific function or
|
||||
If data alignment is unimportant (either just for a specific function or
|
||||
because the data type specifies byte alignment anyway) then it's possible
|
||||
to avoid some of the overhead of <function>PG_DETOAST_DATUM</>. You can use
|
||||
<function>PG_DETOAST_DATUM_PACKED</> instead (customarily hidden by
|
||||
@ -286,8 +300,6 @@ CREATE TYPE complex (
|
||||
</para>
|
||||
</note>
|
||||
|
||||
<para>
|
||||
For further details see the description of the
|
||||
<xref linkend="sql-createtype"> command.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
Loading…
Reference in New Issue
Block a user