Update assorted TOAST-related documentation.

While working on documentation for expanded arrays, I noticed a number of
details in the TOAST-related documentation that were already inaccurate or
obsolete.  This should be fixed independently of whether expanded arrays
get in or not.  One issue is that the already existing indirect-pointer
facility was not documented at all.  Also, the documentation says that you
only need to use VARSIZE/SET_VARSIZE if you've made your variable-length
type TOAST-aware, but actually we've forced that business on all varlena
types even if they've opted out of TOAST by setting storage = plain.
Wordsmith a few other things too, like an amusingly archaic claim that
there are few 64-bit machines.

I thought about back-patching this, but since all this doco is oriented
to hackers and C-coded extension authors, fixing it in HEAD is probably
good enough.
This commit is contained in:
Tom Lane 2015-02-18 22:33:39 -05:00
parent 56a79a869b
commit 9bb955c828
3 changed files with 158 additions and 64 deletions

View File

@ -329,15 +329,17 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
to <literal>VARIABLE</literal>. (Internally, this is represented to <literal>VARIABLE</literal>. (Internally, this is represented
by setting <literal>typlen</> to -1.) The internal representation of all by setting <literal>typlen</> to -1.) The internal representation of all
variable-length types must start with a 4-byte integer giving the total variable-length types must start with a 4-byte integer giving the total
length of this value of the type. length of this value of the type. (Note that the length field is often
encoded, as described in <xref linkend="storage-toast">; it's unwise
to access it directly.)
</para> </para>
<para> <para>
The optional flag <literal>PASSEDBYVALUE</literal> indicates that The optional flag <literal>PASSEDBYVALUE</literal> indicates that
values of this data type are passed by value, rather than by values of this data type are passed by value, rather than by
reference. You cannot pass by value types whose internal reference. Types passed by value must be fixed-length, and their internal
representation is larger than the size of the <type>Datum</> type representation cannot be larger than the size of the <type>Datum</> type
(4 bytes on most machines, 8 bytes on a few). (4 bytes on some machines, 8 bytes on others).
</para> </para>
<para> <para>
@ -367,6 +369,17 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
<literal>external</literal> items.) <literal>external</literal> items.)
</para> </para>
<para>
All <replaceable class="parameter">storage</replaceable> values other
than <literal>plain</literal> imply that the functions of the data type
can handle values that have been <firstterm>toasted</>, as described
in <xref linkend="storage-toast"> and <xref linkend="xtypes-toast">.
The specific other value given merely determines the default TOAST
storage strategy for columns of a toastable data type; users can pick
other strategies for individual columns using <literal>ALTER TABLE
SET STORAGE</>.
</para>
<para> <para>
The <replaceable class="parameter">like_type</replaceable> parameter The <replaceable class="parameter">like_type</replaceable> parameter
provides an alternative method for specifying the basic representation provides an alternative method for specifying the basic representation
@ -465,8 +478,8 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
identical things, and you want to allow these things to be accessed identical things, and you want to allow these things to be accessed
directly by subscripting, in addition to whatever operations you plan directly by subscripting, in addition to whatever operations you plan
to provide for the type as a whole. For example, type <type>point</> to provide for the type as a whole. For example, type <type>point</>
is represented as just two floating-point numbers, each can be accessed using is represented as just two floating-point numbers, which can be accessed
<literal>point[0]</> and <literal>point[1]</>. using <literal>point[0]</> and <literal>point[1]</>.
Note that Note that
this facility only works for fixed-length types whose internal form this facility only works for fixed-length types whose internal form
is exactly a sequence of identical fixed-length fields. A subscriptable is exactly a sequence of identical fixed-length fields. A subscriptable

View File

@ -303,25 +303,33 @@ Oversized-Attribute Storage Technique).
<para> <para>
<productname>PostgreSQL</productname> uses a fixed page size (commonly <productname>PostgreSQL</productname> uses a fixed page size (commonly
8 kB), and does not allow tuples to span multiple pages. Therefore, it is 8 kB), and does not allow tuples to span multiple pages. Therefore, it is
not possible to store very large field values directly. To overcome not possible to store very large field values directly. To overcome
this limitation, large field values are compressed and/or broken up into this limitation, large field values are compressed and/or broken up into
multiple physical rows. This happens transparently to the user, with only multiple physical rows. This happens transparently to the user, with only
small impact on most of the backend code. The technique is affectionately small impact on most of the backend code. The technique is affectionately
known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>). known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
The <acronym>TOAST</> infrastructure is also used to improve handling of
large data values in-memory.
</para> </para>
<para> <para>
Only certain data types support <acronym>TOAST</> &mdash; there is no need to Only certain data types support <acronym>TOAST</> &mdash; there is no need to
impose the overhead on data types that cannot produce large field values. impose the overhead on data types that cannot produce large field values.
To support <acronym>TOAST</>, a data type must have a variable-length To support <acronym>TOAST</>, a data type must have a variable-length
(<firstterm>varlena</>) representation, in which the first 32-bit word of any (<firstterm>varlena</>) representation, in which, ordinarily, the first
stored value contains the total length of the value in bytes (including four-byte word of any stored value contains the total length of the value in
itself). <acronym>TOAST</> does not constrain the rest of the representation. bytes (including itself). <acronym>TOAST</> does not constrain the rest
All the C-level functions supporting a <acronym>TOAST</>-able data type must of the data type's representation. The special representations collectively
be careful to handle <acronym>TOAST</>ed input values. (This is normally done called <firstterm><acronym>TOAST</>ed values</firstterm> work by modifying or
by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input reinterpreting this initial length word. Therefore, the C-level functions
value, but in some cases more efficient approaches are possible.) supporting a <acronym>TOAST</>-able data type must be careful about how they
handle potentially <acronym>TOAST</>ed input values: an input might not
actually consist of a four-byte length word and contents until after it's
been <firstterm>detoasted</>. (This is normally done by invoking
<function>PG_DETOAST_DATUM</> before doing anything with an input value,
but in some cases more efficient approaches are possible.
See <xref linkend="xtypes-toast"> for more detail.)
</para> </para>
<para> <para>
@ -333,58 +341,84 @@ the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
the remaining bits of the length word give the total datum size (including the remaining bits of the length word give the total datum size (including
length word) in bytes. When the highest-order or lowest-order bit is set, length word) in bytes. When the highest-order or lowest-order bit is set,
the value has only a single-byte header instead of the normal four-byte the value has only a single-byte header instead of the normal four-byte
header, and the remaining bits give the total datum size (including length header, and the remaining bits of that byte give the total datum size
byte) in bytes. As a special case, if the remaining bits are all zero (including length byte) in bytes. This alternative supports space-efficient
(which would be impossible for a self-inclusive length), the value is a storage of values shorter than 127 bytes, while still allowing the data type
pointer to out-of-line data stored in a separate TOAST table. (The size of to grow to 1 GB at need. Values with single-byte headers aren't aligned on
a TOAST pointer is given in the second byte of the datum.) any particular boundary, whereas values with four-byte headers are aligned on
Values with single-byte headers aren't aligned on any particular at least a four-byte boundary; this omission of alignment padding provides
boundary, either. Lastly, when the highest-order or lowest-order bit is additional space savings that is significant compared to short values.
clear but the adjacent bit is set, the content of the datum has been As a special case, if the remaining bits of a single-byte header are all
compressed and must be decompressed before use. In this case the remaining zero (which would be impossible for a self-inclusive length), the value is
bits of the length word give the total size of the compressed datum, not the a pointer to out-of-line data, with several possible alternatives as
described below. The type and size of such a <firstterm>TOAST pointer</>
are determined by a code stored in the second byte of the datum.
Lastly, when the highest-order or lowest-order bit is clear but the adjacent
bit is set, the content of the datum has been compressed and must be
decompressed before use. In this case the remaining bits of the four-byte
length word give the total size of the compressed datum, not the
original data. Note that compression is also possible for out-of-line data original data. Note that compression is also possible for out-of-line data
but the varlena header does not tell whether it has occurred &mdash; but the varlena header does not tell whether it has occurred &mdash;
the content of the TOAST pointer tells that, instead. the content of the <acronym>TOAST</> pointer tells that, instead.
</para> </para>
<para>
As mentioned, there are multiple types of <acronym>TOAST</> pointer datums.
The oldest and most common type is a pointer to out-of-line data stored in
a <firstterm><acronym>TOAST</> table</firstterm> that is separate from, but
associated with, the table containing the <acronym>TOAST</> pointer datum
itself. These <firstterm>on-disk</> pointer datums are created by the
<acronym>TOAST</> management code (in <filename>access/heap/tuptoaster.c</>)
when a tuple to be stored on disk is too large to be stored as-is.
Further details appear in <xref linkend="storage-toast-ondisk">.
Alternatively, a <acronym>TOAST</> pointer datum can contain a pointer to
out-of-line data that appears elsewhere in memory. Such datums are
necessarily short-lived, and will never appear on-disk, but they are very
useful for avoiding copying and redundant processing of large data values.
Further details appear in <xref linkend="storage-toast-inmemory">.
</para>
<para>
The compression technique used for either in-line or out-of-line compressed
data is a fairly simple and very fast member
of the LZ family of compression techniques. See
<filename>src/common/pg_lzcompress.c</> for the details.
</para>
<sect2 id="storage-toast-ondisk">
<title>Out-of-line, on-disk TOAST storage</title>
<para> <para>
If any of the columns of a table are <acronym>TOAST</>-able, the table will If any of the columns of a table are <acronym>TOAST</>-able, the table will
have an associated <acronym>TOAST</> table, whose OID is stored in the table's have an associated <acronym>TOAST</> table, whose OID is stored in the table's
<structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line <structname>pg_class</>.<structfield>reltoastrelid</> entry. On-disk
<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
described in more detail below. described in more detail below.
</para> </para>
<para>
The compression technique used is a fairly simple and very fast member
of the LZ family of compression techniques. See
<filename>src/common/pg_lzcompress.c</> for the details.
</para>
<para> <para>
Out-of-line values are divided (after compression if used) into chunks of at Out-of-line values are divided (after compression if used) into chunks of at
most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
so that four chunk rows will fit on a page, making it about 2000 bytes). so that four chunk rows will fit on a page, making it about 2000 bytes).
Each chunk is stored Each chunk is stored as a separate row in the <acronym>TOAST</> table
as a separate row in the <acronym>TOAST</> table for the owning table. Every belonging to the owning table. Every
<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID <acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
identifying the particular <acronym>TOAST</>ed value), identifying the particular <acronym>TOAST</>ed value),
<structfield>chunk_seq</> (a sequence number for the chunk within its value), <structfield>chunk_seq</> (a sequence number for the chunk within its value),
and <structfield>chunk_data</> (the actual data of the chunk). A unique index and <structfield>chunk_data</> (the actual data of the chunk). A unique index
on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
retrieval of the values. A pointer datum representing an out-of-line retrieval of the values. A pointer datum representing an out-of-line on-disk
<acronym>TOAST</>ed value therefore needs to store the OID of the <acronym>TOAST</>ed value therefore needs to store the OID of the
<acronym>TOAST</> table in which to look and the OID of the specific value <acronym>TOAST</> table in which to look and the OID of the specific value
(its <structfield>chunk_id</>). For convenience, pointer datums also store the (its <structfield>chunk_id</>). For convenience, pointer datums also store the
logical datum size (original uncompressed data length) and actual stored size logical datum size (original uncompressed data length) and physical stored size
(different if compression was applied). Allowing for the varlena header bytes, (different if compression was applied). Allowing for the varlena header bytes,
the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes the total size of an on-disk <acronym>TOAST</> pointer datum is therefore 18
regardless of the actual size of the represented value. bytes regardless of the actual size of the represented value.
</para> </para>
<para> <para>
The <acronym>TOAST</> code is triggered only The <acronym>TOAST</> management code is triggered only
when a row value to be stored in a table is wider than when a row value to be stored in a table is wider than
<symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB). <symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
The <acronym>TOAST</> code will compress and/or move The <acronym>TOAST</> code will compress and/or move
@ -397,8 +431,8 @@ none of the out-of-line values change.
</para> </para>
<para> <para>
The <acronym>TOAST</> code recognizes four different strategies for storing The <acronym>TOAST</> management code recognizes four different strategies
<acronym>TOAST</>-able columns: for storing <acronym>TOAST</>-able columns on disk:
<itemizedlist> <itemizedlist>
<listitem> <listitem>
@ -460,6 +494,41 @@ pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
comparison table, in which all the HTML pages were cut down to 7 kB to fit. comparison table, in which all the HTML pages were cut down to 7 kB to fit.
</para> </para>
</sect2>
<sect2 id="storage-toast-inmemory">
<title>Out-of-line, in-memory TOAST storage</title>
<para>
<acronym>TOAST</> pointers can point to data that is not on disk, but is
elsewhere in the memory of the current server process. Such pointers
obviously cannot be long-lived, but they are nonetheless useful. There
is currently just one sub-case:
pointers to <firstterm>indirect</> data.
</para>
<para>
Indirect <acronym>TOAST</> pointers simply point at a non-indirect varlena
value stored somewhere in memory. This case was originally created merely
as a proof of concept, but it is currently used during logical decoding to
avoid possibly having to create physical tuples exceeding 1 GB (as pulling
all out-of-line field values into the tuple might do). The case is of
limited use since the creator of the pointer datum is entirely responsible
that the referenced data survives for as long as the pointer could exist,
and there is no infrastructure to help with this.
</para>
<para>
For all types of in-memory <acronym>TOAST</> pointer, the <acronym>TOAST</>
management code ensures that no such pointer datum can accidentally get
stored on disk. In-memory <acronym>TOAST</> pointers are automatically
expanded to normal in-line varlena values before storage &mdash; and then
possibly converted to on-disk <acronym>TOAST</> pointers, if the containing
tuple would otherwise be too big.
</para>
</sect2>
</sect1> </sect1>
<sect1 id="storage-fsm"> <sect1 id="storage-fsm">

View File

@ -234,35 +234,49 @@ CREATE TYPE complex (
</para> </para>
<para> <para>
If the internal representation of the data type is variable-length, the
internal representation must follow the standard layout for variable-length
data: the first four bytes must be a <type>char[4]</type> field which is
never accessed directly (customarily named <structfield>vl_len_</>). You
must use the <function>SET_VARSIZE()</function> macro to store the total
size of the datum (including the length field itself) in this field
and <function>VARSIZE()</function> to retrieve it. (These macros exist
because the length field may be encoded depending on platform.)
</para>
<para>
For further details see the description of the
<xref linkend="sql-createtype"> command.
</para>
<sect2 id="xtypes-toast">
<title>TOAST Considerations</title>
<indexterm> <indexterm>
<primary>TOAST</primary> <primary>TOAST</primary>
<secondary>and user-defined types</secondary> <secondary>and user-defined types</secondary>
</indexterm> </indexterm>
If the values of your data type vary in size (in internal form), you should
make the data type <acronym>TOAST</>-able (see <xref <para>
linkend="storage-toast">). You should do this even if the data are always If the values of your data type vary in size (in internal form), it's
usually desirable to make the data type <acronym>TOAST</>-able (see <xref
linkend="storage-toast">). You should do this even if the values are always
too small to be compressed or stored externally, because too small to be compressed or stored externally, because
<acronym>TOAST</> can save space on small data too, by reducing header <acronym>TOAST</> can save space on small data too, by reducing header
overhead. overhead.
</para> </para>
<para> <para>
To do this, the internal representation must follow the standard layout for To support <acronym>TOAST</> storage, the C functions operating on the data
variable-length data: the first four bytes must be a <type>char[4]</type> type must always be careful to unpack any toasted values they are handed
field which is never accessed directly (customarily named by using <function>PG_DETOAST_DATUM</>. (This detail is customarily hidden
<structfield>vl_len_</>). You by defining type-specific <function>GETARG_DATATYPE_P</function> macros.)
must use <function>SET_VARSIZE()</function> to store the size of the datum Then, when running the <command>CREATE TYPE</command> command, specify the
in this field and <function>VARSIZE()</function> to retrieve it. The C internal length as <literal>variable</> and select some appropriate storage
functions operating on the data type must always be careful to unpack any option other than <literal>plain</>.
toasted values they are handed, by using <function>PG_DETOAST_DATUM</>.
(This detail is customarily hidden by defining type-specific
<function>GETARG_DATATYPE_P</function> macros.) Then, when running the
<command>CREATE TYPE</command> command, specify the internal length as
<literal>variable</> and select the appropriate storage option.
</para> </para>
<para> <para>
If the alignment is unimportant (either just for a specific function or If data alignment is unimportant (either just for a specific function or
because the data type specifies byte alignment anyway) then it's possible because the data type specifies byte alignment anyway) then it's possible
to avoid some of the overhead of <function>PG_DETOAST_DATUM</>. You can use to avoid some of the overhead of <function>PG_DETOAST_DATUM</>. You can use
<function>PG_DETOAST_DATUM_PACKED</> instead (customarily hidden by <function>PG_DETOAST_DATUM_PACKED</> instead (customarily hidden by
@ -286,8 +300,6 @@ CREATE TYPE complex (
</para> </para>
</note> </note>
<para> </sect2>
For further details see the description of the
<xref linkend="sql-createtype"> command.
</para>
</sect1> </sect1>