Add some real documentation about TOAST (finally). Combine this with

the old 'page' chapter and the recently added 'filelayout' chapter to
make a coherent chapter about PostgreSQL's physical storage layout.
This commit is contained in:
Tom Lane 2005-01-10 00:04:43 +00:00
parent 521e8888e9
commit ab3bb9cf78
9 changed files with 704 additions and 535 deletions

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.13 2004/12/28 19:08:58 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.14 2005/01/10 00:04:38 tgl Exp $
-->
<chapter id="diskusage">
@ -22,12 +22,12 @@ $PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.13 2004/12/28 19:08:58 tgl Ex
stored. If the table has any columns with potentially-wide values,
there is also a <acronym>TOAST</> file associated with the table,
which is used to store values too wide to fit comfortably in the main
table. There will be one index on the
table (see <xref linkend="storage-toast">). There will be one index on the
<acronym>TOAST</> table, if present. There may also be indexes associated
with the base table. Each table and index is stored in a separate disk
file &mdash; possibly more than one file, if the file would exceed one
gigabyte. Naming conventions for these files are described in <xref
linkend="file-layout">.
linkend="storage-file-layout">.
</para>
<para>

View File

@ -1,161 +0,0 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/filelayout.sgml,v 1.2 2004/11/16 15:00:36 tgl Exp $
-->
<chapter id="file-layout">
<title>Database File Layout</title>
<abstract>
<para>
A description of the database physical storage layout.
</para>
</abstract>
<para>
This section provides an overview of the physical format used by
<productname>PostgreSQL</productname> databases.
</para>
<para>
All the data needed for a database cluster is stored within the cluster's data
directory, commonly referred to as <varname>PGDATA</> (after the name of the
environment variable that can be used to define it). A common location for
<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
managed by different postmasters, can exist on the same machine.
</para>
<para>
The <varname>PGDATA</> directory contains several subdirectories and control
files, as shown in <xref linkend="pgdata-contents-table">. In addition to
these required items, the cluster configuration files
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
<filename>pg_ident.conf</filename> are traditionally stored in
<varname>PGDATA</> (although beginning in
<productname>PostgreSQL</productname> 8.0 it is possible to keep them
elsewhere).
</para>
<table tocentry="1" id="pgdata-contents-table">
<title>Contents of <varname>PGDATA</></title>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry><filename>PG_VERSION</></entry>
<entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
</row>
<row>
<entry><filename>base</></entry>
<entry>Subdirectory containing per-database subdirectories</entry>
</row>
<row>
<entry><filename>global</></entry>
<entry>Subdirectory containing cluster-wide tables, such as
<structname>pg_database</></entry>
</row>
<row>
<entry><filename>pg_clog</></entry>
<entry>Subdirectory containing transaction commit status data</entry>
</row>
<row>
<entry><filename>pg_subtrans</></entry>
<entry>Subdirectory containing subtransaction status data</entry>
</row>
<row>
<entry><filename>pg_tblspc</></entry>
<entry>Subdirectory containing symbolic links to tablespaces</entry>
</row>
<row>
<entry><filename>pg_xlog</></entry>
<entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
</row>
<row>
<entry><filename>postmaster.opts</></entry>
<entry>A file recording the command-line options the postmaster was
last started with</entry>
</row>
<row>
<entry><filename>postmaster.pid</></entry>
<entry>A lock file recording the current postmaster PID and shared memory
segment ID (not present after postmaster shutdown)</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
For each database in the cluster there is a subdirectory within
<varname>PGDATA</><filename>/base</>, named after the database's OID in
<structname>pg_database</>. This subdirectory is the default location
for the database's files; in particular, its system catalogs are stored
there.
</para>
<para>
Each table and index is stored in a separate file, named after the table
or index's <firstterm>filenode</> number, which can be found in
<structname>pg_class</>.<structfield>relfilenode</>.
</para>
<caution>
<para>
Note that while a table's filenode often matches its OID, this is
<emphasis>not</> necessarily the case; some operations, like
<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
Avoid assuming that filenode and table OID are the same.
</para>
</caution>
<para>
When a table or index exceeds 1Gb, it is divided into gigabyte-sized
<firstterm>segments</>. The first segment's file name is the same as the
filenode; subsequent segments are named filenode.1, filenode.2, etc.
This arrangement avoids problems on platforms that have file size limitations.
The contents of tables and indexes are discussed further in
<xref linkend="page">.
</para>
<para>
A table that has columns with potentially large entries will have an
associated <firstterm>TOAST</> table, which is used for out-of-line storage of
field values that are too large to keep in the table rows proper.
<structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
its TOAST table, if any.
</para>
<para>
Tablespaces make the scenario more complicated. Each user-defined tablespace
has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
directory, which points to the physical tablespace directory (as specified in
its <command>CREATE TABLESPACE</> command). The symbolic link is named after
the tablespace's OID. Inside the physical tablespace directory there is
a subdirectory for each database that has elements in the tablespace, named
after the database's OID. Tables within that directory follow the filenode
naming scheme. The <literal>pg_default</> tablespace is not accessed through
<filename>pg_tblspc</>, but corresponds to
<varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
<varname>PGDATA</><filename>/global</>.
</para>
</chapter>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.40 2004/12/03 05:50:18 momjian Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.41 2005/01/10 00:04:38 tgl Exp $ -->
<!entity history SYSTEM "history.sgml">
<!entity info SYSTEM "info.sgml">
@ -75,15 +75,14 @@
<!entity arch-dev SYSTEM "arch-dev.sgml">
<!entity bki SYSTEM "bki.sgml">
<!entity catalogs SYSTEM "catalogs.sgml">
<!entity filelayout SYSTEM "filelayout.sgml">
<!entity geqo SYSTEM "geqo.sgml">
<!entity gist SYSTEM "gist.sgml">
<!entity indexcost SYSTEM "indexcost.sgml">
<!entity nls SYSTEM "nls.sgml">
<!entity page SYSTEM "page.sgml">
<!entity plhandler SYSTEM "plhandler.sgml">
<!entity protocol SYSTEM "protocol.sgml">
<!entity sources SYSTEM "sources.sgml">
<!entity storage SYSTEM "storage.sgml">
<!-- appendixes -->
<!entity contacts SYSTEM "contacts.sgml">

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.35 2005/01/08 22:13:33 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.36 2005/01/10 00:04:38 tgl Exp $
-->
<chapter id="largeObjects">
@ -51,9 +51,11 @@ $PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.35 2005/01/08 22:13:33 tgl Exp $
</para>
<para>
<indexterm><primary>TOAST</></>
<indexterm><primary>sliced bread</><see>TOAST</></indexterm>
<productname>PostgreSQL 7.1</productname> introduced a mechanism
<indexterm>
<primary>TOAST</primary>
<secondary>versus large objects</secondary>
</indexterm>
<productname>PostgreSQL</productname> 7.1 introduced a mechanism
(nicknamed <quote><acronym>TOAST</acronym></quote>) that allows
data values to be much larger than single pages. This
makes the large object facility partially obsolete. One

View File

@ -1,354 +0,0 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.19 2004/11/12 21:50:53 tgl Exp $
-->
<chapter id="page">
<title>Database Page Layout</title>
<abstract>
<para>
A description of the database file page format.
</para>
</abstract>
<para>
This section provides an overview of the page format used by
<productname>PostgreSQL</productname> tables and indexes.<footnote>
<para>
Actually, index access methods need not use this page format.
All the existing index methods do use this basic format,
but the data kept on index metapages usually doesn't follow
the item layout rules.
</para>
</footnote>
TOAST tables and sequences are formatted just like a regular table.
</para>
<para>
In the following explanation, a
<firstterm>byte</firstterm>
is assumed to contain 8 bits. In addition, the term
<firstterm>item</firstterm>
refers to an individual data value that is stored on a page. In a table,
an item is a row; in an index, an item is an index entry.
</para>
<para>
Every table and index is stored as an array of <firstterm>pages</> of a
fixed size (usually 8K, although a different page size can be selected
when compiling the server). In a table, all the pages are logically
equivalent, so a particular item (row) can be stored in any page. In
indexes, the first page is generally reserved as a <firstterm>metapage</>
holding control information, and there may be different types of pages
within the index, depending on the index access method.
</para>
<para>
<xref linkend="page-table"> shows the overall layout of a page.
There are five parts to each page.
</para>
<table tocentry="1" id="page-table">
<title>Overall Page Layout</title>
<titleabbrev>Page Layout</titleabbrev>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>PageHeaderData</entry>
<entry>20 bytes long. Contains general information about the page, including
free space pointers.</entry>
</row>
<row>
<entry>ItemPointerData</entry>
<entry>Array of (offset,length) pairs pointing to the actual items.
4 bytes per item.</entry>
</row>
<row>
<entry>Free space</entry>
<entry>The unallocated space. New item pointers are allocated from the start
of this area, new items from the end.</entry>
</row>
<row>
<entry>Items</entry>
<entry>The actual items themselves.</entry>
</row>
<row>
<entry>Special space</entry>
<entry>Index access method specific data. Different methods store different
data. Empty in ordinary tables.</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
The first 20 bytes of each page consists of a page header
(PageHeaderData). Its format is detailed in <xref
linkend="pageheaderdata-table">. The first two fields track the most
recent WAL entry related to this page. They are followed by three 2-byte
integer fields
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
and <structfield>pd_special</structfield>). These contain byte offsets
from the page start to the start
of unallocated space, to the end of unallocated space, and to the start of
the special space.
The last 2 bytes of the page header,
<structfield>pd_pagesize_version</structfield>, store both the page size
and a version indicator. Beginning with
<productname>PostgreSQL</productname> 8.0 the version number is 2;
<productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
prior releases used version number 0.
(The basic page layout and header format has not changed in these versions,
but the layout of heap row headers has.) The page size
is basically only present as a cross-check; there is no support for having
more than one page size in an installation.
</para>
<table tocentry="1" id="pageheaderdata-table">
<title>PageHeaderData Layout</title>
<titleabbrev>PageHeaderData Layout</titleabbrev>
<tgroup cols="4">
<thead>
<row>
<entry>Field</entry>
<entry>Type</entry>
<entry>Length</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>pd_lsn</entry>
<entry>XLogRecPtr</entry>
<entry>8 bytes</entry>
<entry>LSN: next byte after last byte of xlog record for last change
to this page</entry>
</row>
<row>
<entry>pd_tli</entry>
<entry>TimeLineID</entry>
<entry>4 bytes</entry>
<entry>TLI of last change</entry>
</row>
<row>
<entry>pd_lower</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to start of free space</entry>
</row>
<row>
<entry>pd_upper</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to end of free space</entry>
</row>
<row>
<entry>pd_special</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to start of special space</entry>
</row>
<row>
<entry>pd_pagesize_version</entry>
<entry>uint16</entry>
<entry>2 bytes</entry>
<entry>Page size and layout version number information</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
All the details may be found in
<filename>src/include/storage/bufpage.h</filename>.
</para>
<para>
Following the page header are item identifiers
(<type>ItemIdData</type>), each requiring four bytes.
An item identifier contains a byte-offset to
the start of an item, its length in bytes, and a few attribute bits
which affect its interpretation.
New item identifiers are allocated
as needed from the beginning of the unallocated space.
The number of item identifiers present can be determined by looking at
<structfield>pd_lower</>, which is increased to allocate a new identifier.
Because an item
identifier is never moved until it is freed, its index may be used on a
long-term basis to reference an item, even when the item itself is moved
around on the page to compact free space. In fact, every pointer to an
item (<type>ItemPointer</type>, also known as
<type>CTID</type>) created by
<productname>PostgreSQL</productname> consists of a page number and the
index of an item identifier.
</para>
<para>
The items themselves are stored in space allocated backwards from the end
of unallocated space. The exact structure varies depending on what the
table is to contain. Tables and sequences both use a structure named
<type>HeapTupleHeaderData</type>, described below.
</para>
<para>
The final section is the <quote>special section</quote> which may
contain anything the access method wishes to store. For example,
b-tree indexes store links to the page's left and right siblings,
as well as some other data relevant to the index structure.
Ordinary tables do not use a special section at all (indicated by setting
<structfield>pd_special</> to equal the page size).
</para>
<para>
All table rows are structured in the same way. There is a fixed-size
header (occupying 27 bytes on most machines), followed by an optional null
bitmap, an optional object ID field, and the user data. The header is
detailed
in <xref linkend="heaptupleheaderdata-table">. The actual user data
(columns of the row) begins at the offset indicated by
<structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
distance for the platform.
The null bitmap is
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
<structfield>t_infomask</structfield>. If it is present it begins just after
the fixed header and occupies enough bytes to have one bit per data column
(that is, <structfield>t_natts</> bits altogether). In this list of bits, a
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
present, all columns are assumed not-null.
The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
is set in <structfield>t_infomask</structfield>. If present, it appears just
before the <structfield>t_hoff</> boundary. Any padding needed to make
<structfield>t_hoff</> a MAXALIGN multiple will appear between the null
bitmap and the object ID. (This in turn ensures that the object ID is
suitably aligned.)
</para>
<table tocentry="1" id="heaptupleheaderdata-table">
<title>HeapTupleHeaderData Layout</title>
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
<tgroup cols="4">
<thead>
<row>
<entry>Field</entry>
<entry>Type</entry>
<entry>Length</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>t_xmin</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>insert XID stamp</entry>
</row>
<row>
<entry>t_cmin</entry>
<entry>CommandId</entry>
<entry>4 bytes</entry>
<entry>insert CID stamp</entry>
</row>
<row>
<entry>t_xmax</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>delete XID stamp</entry>
</row>
<row>
<entry>t_cmax</entry>
<entry>CommandId</entry>
<entry>4 bytes</entry>
<entry>delete CID stamp (overlays with t_xvac)</entry>
</row>
<row>
<entry>t_xvac</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>XID for VACUUM operation moving a row version</entry>
</row>
<row>
<entry>t_ctid</entry>
<entry>ItemPointerData</entry>
<entry>6 bytes</entry>
<entry>current TID of this or newer row version</entry>
</row>
<row>
<entry>t_natts</entry>
<entry>int16</entry>
<entry>2 bytes</entry>
<entry>number of attributes</entry>
</row>
<row>
<entry>t_infomask</entry>
<entry>uint16</entry>
<entry>2 bytes</entry>
<entry>various flag bits</entry>
</row>
<row>
<entry>t_hoff</entry>
<entry>uint8</entry>
<entry>1 byte</entry>
<entry>offset to user data</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
All the details may be found in
<filename>src/include/access/htup.h</filename>.
</para>
<para>
Interpreting the actual data can only be done with information obtained
from other tables, mostly <structname>pg_attribute</structname>. The
key values needed to identify field locations are
<structfield>attlen</structfield> and <structfield>attalign</structfield>.
There is no way to directly get a
particular attribute, except when there are only fixed width fields and no
NULLs. All this trickery is wrapped up in the functions
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
and <firstterm>heap_getsysattr</firstterm>.
</para>
<para>
To read the data you need to examine each attribute in turn. First check
whether the field is NULL according to the null bitmap. If it is, go to
the next. Then make sure you have the right alignment. If the field is a
fixed width field, then all the bytes are simply placed. If it's a
variable length field (attlen = -1) then it's a bit more complicated.
All variable-length datatypes share the common header structure
<type>varattrib</type>, which includes the total length of the stored
value and some flag bits. Depending on the flags, the data may be either
inline or in another table (TOAST); it might be compressed, too.
</para>
</chapter>

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.72 2004/12/30 03:13:56 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.73 2005/01/10 00:04:38 tgl Exp $
-->
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
@ -237,8 +237,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.72 2004/12/30 03:13:56 tgl Exp
&geqo;
&indexcost;
&gist;
&filelayout;
&page;
&storage;
&bki;
</part>

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/ref/alter_table.sgml,v 1.75 2005/01/04 00:39:53 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/ref/alter_table.sgml,v 1.76 2005/01/10 00:04:43 tgl Exp $
PostgreSQL documentation
-->
@ -153,10 +153,14 @@ where <replaceable class="PARAMETER">action</replaceable> is one of:
inline, uncompressed. <literal>MAIN</literal> is for inline,
compressible data. <literal>EXTERNAL</literal> is for external,
uncompressed data, and <literal>EXTENDED</literal> is for external,
compressed data. <literal>EXTENDED</literal> is the default for all
data types that support it. Use of <literal>EXTERNAL</literal> will
compressed data. <literal>EXTENDED</literal> is the default for most
data types that support non-<literal>PLAIN</literal> storage.
Use of <literal>EXTERNAL</literal> will
make substring operations on <type>text</type> and <type>bytea</type>
columns faster, at the penalty of increased storage space.
columns faster, at the penalty of increased storage space. Note that
<literal>SET STORAGE</> doesn't itself change anything in the table,
it just sets the strategy to be pursued during future table updates.
See <xref linkend="storage-toast"> for more information.
</para>
</listitem>
</varlistentry>

676
doc/src/sgml/storage.sgml Normal file
View File

@ -0,0 +1,676 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/storage.sgml,v 1.4 2005/01/10 00:04:38 tgl Exp $
-->
<chapter id="storage">
<title>Database Physical Storage</title>
<para>
This chapter provides an overview of the physical storage format used by
<productname>PostgreSQL</productname> databases.
</para>
<sect1 id="storage-file-layout">
<title>Database File Layout</title>
<para>
This section describes the storage format at the level of files and
directories.
</para>
<para>
All the data needed for a database cluster is stored within the cluster's data
directory, commonly referred to as <varname>PGDATA</> (after the name of the
environment variable that can be used to define it). A common location for
<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
managed by different postmasters, can exist on the same machine.
</para>
<para>
The <varname>PGDATA</> directory contains several subdirectories and control
files, as shown in <xref linkend="pgdata-contents-table">. In addition to
these required items, the cluster configuration files
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
<filename>pg_ident.conf</filename> are traditionally stored in
<varname>PGDATA</> (although beginning in
<productname>PostgreSQL</productname> 8.0 it is possible to keep them
elsewhere).
</para>
<table tocentry="1" id="pgdata-contents-table">
<title>Contents of <varname>PGDATA</></title>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry><filename>PG_VERSION</></entry>
<entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
</row>
<row>
<entry><filename>base</></entry>
<entry>Subdirectory containing per-database subdirectories</entry>
</row>
<row>
<entry><filename>global</></entry>
<entry>Subdirectory containing cluster-wide tables, such as
<structname>pg_database</></entry>
</row>
<row>
<entry><filename>pg_clog</></entry>
<entry>Subdirectory containing transaction commit status data</entry>
</row>
<row>
<entry><filename>pg_subtrans</></entry>
<entry>Subdirectory containing subtransaction status data</entry>
</row>
<row>
<entry><filename>pg_tblspc</></entry>
<entry>Subdirectory containing symbolic links to tablespaces</entry>
</row>
<row>
<entry><filename>pg_xlog</></entry>
<entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
</row>
<row>
<entry><filename>postmaster.opts</></entry>
<entry>A file recording the command-line options the postmaster was
last started with</entry>
</row>
<row>
<entry><filename>postmaster.pid</></entry>
<entry>A lock file recording the current postmaster PID and shared memory
segment ID (not present after postmaster shutdown)</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
For each database in the cluster there is a subdirectory within
<varname>PGDATA</><filename>/base</>, named after the database's OID in
<structname>pg_database</>. This subdirectory is the default location
for the database's files; in particular, its system catalogs are stored
there.
</para>
<para>
Each table and index is stored in a separate file, named after the table
or index's <firstterm>filenode</> number, which can be found in
<structname>pg_class</>.<structfield>relfilenode</>.
</para>
<caution>
<para>
Note that while a table's filenode often matches its OID, this is
<emphasis>not</> necessarily the case; some operations, like
<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
Avoid assuming that filenode and table OID are the same.
</para>
</caution>
<para>
When a table or index exceeds 1Gb, it is divided into gigabyte-sized
<firstterm>segments</>. The first segment's file name is the same as the
filenode; subsequent segments are named filenode.1, filenode.2, etc.
This arrangement avoids problems on platforms that have file size limitations.
The contents of tables and indexes are discussed further in
<xref linkend="storage-page-layout">.
</para>
<para>
A table that has columns with potentially large entries will have an
associated <firstterm>TOAST</> table, which is used for out-of-line storage of
field values that are too large to keep in the table rows proper.
<structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
its <acronym>TOAST</> table, if any.
See <xref linkend="storage-toast"> for more information.
</para>
<para>
Tablespaces make the scenario more complicated. Each user-defined tablespace
has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
directory, which points to the physical tablespace directory (as specified in
its <command>CREATE TABLESPACE</> command). The symbolic link is named after
the tablespace's OID. Inside the physical tablespace directory there is
a subdirectory for each database that has elements in the tablespace, named
after the database's OID. Tables within that directory follow the filenode
naming scheme. The <literal>pg_default</> tablespace is not accessed through
<filename>pg_tblspc</>, but corresponds to
<varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
<varname>PGDATA</><filename>/global</>.
</para>
</sect1>
<sect1 id="storage-toast">
<title>TOAST</title>
<indexterm>
<primary>TOAST</primary>
</indexterm>
<indexterm><primary>sliced bread</><see>TOAST</></indexterm>
<para>
This section provides an overview of <acronym>TOAST</> (The
Oversized-Attribute Storage Technique).
</para>
<para>
Since <productname>PostgreSQL</productname> uses a fixed page size (commonly
8Kb), and does not allow tuples to span multiple pages, it's not possible to
store very large field values directly. Before <productname>PostgreSQL</> 7.1
there was a hard limit of just under one page on the total amount of data that
could be put into a table row. In release 7.1 and later, this limit is
overcome by allowing large field values to be compressed and/or broken up into
multiple physical rows. This happens transparently to the user, with only
small impact on most of the backend code. The technique is affectionately
known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
</para>
<para>
Only certain data types support <acronym>TOAST</> &mdash; there is no need to
impose the overhead on data types that cannot produce large field values.
To support <acronym>TOAST</>, a data type must have a variable-length
(<firstterm>varlena</>) representation, in which the first 32-bit word of any
stored value contains the total length of the value in bytes (including
itself). <acronym>TOAST</> does not constrain the rest of the representation.
All the C-level functions supporting a <acronym>TOAST</>-able data type must
be careful to handle <acronym>TOAST</>ed input values. (This is normally done
by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
value; but in some cases more efficient approaches are possible.)
</para>
<para>
<acronym>TOAST</> usurps the high-order two bits of the varlena length word,
thereby limiting the logical size of any value of a <acronym>TOAST</>-able
data type to 1Gb (2<superscript>30</> - 1 bytes). When both bits are zero,
the value is an ordinary un-<acronym>TOAST</>ed value of the data type. One
of these bits, if set, indicates that the value has been compressed and must
be decompressed before use. The other bit, if set, indicates that the value
has been stored out-of-line. In this case the remainder of the value is
actually just a pointer, and the correct data has to be found elsewhere. When
both bits are set, the out-of-line data has been compressed too. In each case
the length in the low-order bits of the varlena word indicates the actual size
of the datum, not the size of the logical value that would be extracted by
decompression or fetching of the out-of-line data.
</para>
<para>
If any of the columns of a table are <acronym>TOAST</>-able, the table will
have an associated <acronym>TOAST</> table, whose OID is stored in the table's
<structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
described in more detail below.
</para>
<para>
The compression technique used is a fairly simple and very fast member
of the LZ family of compression techniques. See
<filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.
</para>
<para>
Out-of-line values are divided (after compression if used) into chunks of at
most <literal>TOAST_MAX_CHUNK_SIZE</> bytes (this value is a little less than
<literal>BLCKSZ/4</>, or about 2000 bytes by default). Each chunk is stored
as a separate row in the <acronym>TOAST</> table for the owning table. Every
<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
identifying the particular <acronym>TOAST</>ed value),
<structfield>chunk_seq</> (a sequence number for the chunk within its value),
and <structfield>chunk_data</> (the actual data of the chunk). A unique index
on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
retrieval of the values. A pointer datum representing an out-of-line
<acronym>TOAST</>ed value therefore needs to store the OID of the
<acronym>TOAST</> table in which to look and the OID of the specific value
(its <structfield>chunk_id</>). For convenience, pointer datums also store the
logical datum size (original uncompressed data length) and actual stored size
(different if compression was applied). Allowing for the varlena header word,
the total size of a <acronym>TOAST</> pointer datum is therefore 20 bytes
regardless of the actual size of the represented value.
</para>
<para>
The <acronym>TOAST</> code is triggered only
when a row value to be stored in a table is wider than <literal>BLCKSZ/4</>
bytes (normally 2Kb). The <acronym>TOAST</> code will compress and/or move
field values out-of-line until the row value is shorter than
<literal>BLCKSZ/4</> bytes or no more gains can be had. During an UPDATE
operation, values of unchanged fields are normally preserved as-is; so an
UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
none of the out-of-line values change.
</para>
<para>
The <acronym>TOAST</> code recognizes four different strategies for storing
<acronym>TOAST</>-able columns:
<itemizedlist>
<listitem>
<para>
<literal>PLAIN</literal> prevents either compression or
out-of-line storage. This is the only possible strategy for
columns of non-<acronym>TOAST</>-able data types.
</para>
</listitem>
<listitem>
<para>
<literal>EXTENDED</literal> allows both compression and out-of-line
storage. This is the default for most <acronym>TOAST</>-able data types.
Compression will be attempted first, then out-of-line storage if
the row is still too big.
</para>
</listitem>
<listitem>
<para>
<literal>EXTERNAL</literal> allows out-of-line storage but not
compression. Use of <literal>EXTERNAL</literal> will
make substring operations on wide <type>text</type> and
<type>bytea</type> columns faster (at the penalty of increased storage
space) because these operations are optimized to fetch only the
required parts of the out-of-line value when it is not compressed.
</para>
</listitem>
<listitem>
<para>
<literal>MAIN</literal> allows compression but not out-of-line
storage. (Actually, out-of-line storage will still be performed
for such columns, but only as a last resort when there is no other
way to make the row small enough.)
</para>
</listitem>
</itemizedlist>
Each <acronym>TOAST</>-able data type specifies a default strategy for columns
of that data type, but the strategy for a given table column can be altered
with <command>ALTER TABLE SET STORAGE</>.
</para>
<para>
This scheme has a number of advantages compared to a more straightforward
approach such as allowing row values to span pages. Assuming that queries are
usually qualified by comparisons against relatively small key values, most of
the work of the executor will be done using the main row entry. The big values
of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
at the time the result set is sent to the client. Thus, the main table is much
smaller and more of its rows fit in the shared buffer cache than would be the
case without any out-of-line storage. Sort sets shrink also, and sorts will
more often be done entirely in memory. A little test showed that a table
containing typical HTML pages and their URLs was stored in about half of the
raw data size including the <acronym>TOAST</> table, and that the main table
contained only about 10% of the entire data (the URLs and some small HTML
pages). There was no runtime difference compared to an un-<acronym>TOAST</>ed
comparison table, in which all the HTML pages were cut down to 7Kb to fit.
</para>
</sect1>
<sect1 id="storage-page-layout">
<title>Database Page Layout</title>
<para>
This section provides an overview of the page format used within
<productname>PostgreSQL</productname> tables and indexes.<footnote>
<para>
Actually, index access methods need not use this page format.
All the existing index methods do use this basic format,
but the data kept on index metapages usually doesn't follow
the item layout rules.
</para>
</footnote>
Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
</para>
<para>
In the following explanation, a
<firstterm>byte</firstterm>
is assumed to contain 8 bits. In addition, the term
<firstterm>item</firstterm>
refers to an individual data value that is stored on a page. In a table,
an item is a row; in an index, an item is an index entry.
</para>
<para>
Every table and index is stored as an array of <firstterm>pages</> of a
fixed size (usually 8Kb, although a different page size can be selected
when compiling the server). In a table, all the pages are logically
equivalent, so a particular item (row) can be stored in any page. In
indexes, the first page is generally reserved as a <firstterm>metapage</>
holding control information, and there may be different types of pages
within the index, depending on the index access method.
</para>
<para>
<xref linkend="page-table"> shows the overall layout of a page.
There are five parts to each page.
</para>
<table tocentry="1" id="page-table">
<title>Overall Page Layout</title>
<titleabbrev>Page Layout</titleabbrev>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>PageHeaderData</entry>
<entry>20 bytes long. Contains general information about the page, including
free space pointers.</entry>
</row>
<row>
<entry>ItemPointerData</entry>
<entry>Array of (offset,length) pairs pointing to the actual items.
4 bytes per item.</entry>
</row>
<row>
<entry>Free space</entry>
<entry>The unallocated space. New item pointers are allocated from the start
of this area, new items from the end.</entry>
</row>
<row>
<entry>Items</entry>
<entry>The actual items themselves.</entry>
</row>
<row>
<entry>Special space</entry>
<entry>Index access method specific data. Different methods store different
data. Empty in ordinary tables.</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
The first 20 bytes of each page consists of a page header
(PageHeaderData). Its format is detailed in <xref
linkend="pageheaderdata-table">. The first two fields track the most
recent WAL entry related to this page. They are followed by three 2-byte
integer fields
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
and <structfield>pd_special</structfield>). These contain byte offsets
from the page start to the start
of unallocated space, to the end of unallocated space, and to the start of
the special space.
The last 2 bytes of the page header,
<structfield>pd_pagesize_version</structfield>, store both the page size
and a version indicator. Beginning with
<productname>PostgreSQL</productname> 8.0 the version number is 2;
<productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
prior releases used version number 0.
(The basic page layout and header format has not changed in these versions,
but the layout of heap row headers has.) The page size
is basically only present as a cross-check; there is no support for having
more than one page size in an installation.
</para>
<table tocentry="1" id="pageheaderdata-table">
<title>PageHeaderData Layout</title>
<titleabbrev>PageHeaderData Layout</titleabbrev>
<tgroup cols="4">
<thead>
<row>
<entry>Field</entry>
<entry>Type</entry>
<entry>Length</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>pd_lsn</entry>
<entry>XLogRecPtr</entry>
<entry>8 bytes</entry>
<entry>LSN: next byte after last byte of xlog record for last change
to this page</entry>
</row>
<row>
<entry>pd_tli</entry>
<entry>TimeLineID</entry>
<entry>4 bytes</entry>
<entry>TLI of last change</entry>
</row>
<row>
<entry>pd_lower</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to start of free space</entry>
</row>
<row>
<entry>pd_upper</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to end of free space</entry>
</row>
<row>
<entry>pd_special</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to start of special space</entry>
</row>
<row>
<entry>pd_pagesize_version</entry>
<entry>uint16</entry>
<entry>2 bytes</entry>
<entry>Page size and layout version number information</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
All the details may be found in
<filename>src/include/storage/bufpage.h</filename>.
</para>
<para>
Following the page header are item identifiers
(<type>ItemIdData</type>), each requiring four bytes.
An item identifier contains a byte-offset to
the start of an item, its length in bytes, and a few attribute bits
which affect its interpretation.
New item identifiers are allocated
as needed from the beginning of the unallocated space.
The number of item identifiers present can be determined by looking at
<structfield>pd_lower</>, which is increased to allocate a new identifier.
Because an item
identifier is never moved until it is freed, its index may be used on a
long-term basis to reference an item, even when the item itself is moved
around on the page to compact free space. In fact, every pointer to an
item (<type>ItemPointer</type>, also known as
<type>CTID</type>) created by
<productname>PostgreSQL</productname> consists of a page number and the
index of an item identifier.
</para>
<para>
The items themselves are stored in space allocated backwards from the end
of unallocated space. The exact structure varies depending on what the
table is to contain. Tables and sequences both use a structure named
<type>HeapTupleHeaderData</type>, described below.
</para>
<para>
The final section is the <quote>special section</quote> which may
contain anything the access method wishes to store. For example,
b-tree indexes store links to the page's left and right siblings,
as well as some other data relevant to the index structure.
Ordinary tables do not use a special section at all (indicated by setting
<structfield>pd_special</> to equal the page size).
</para>
<para>
All table rows are structured in the same way. There is a fixed-size
header (occupying 27 bytes on most machines), followed by an optional null
bitmap, an optional object ID field, and the user data. The header is
detailed
in <xref linkend="heaptupleheaderdata-table">. The actual user data
(columns of the row) begins at the offset indicated by
<structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
distance for the platform.
The null bitmap is
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
<structfield>t_infomask</structfield>. If it is present it begins just after
the fixed header and occupies enough bytes to have one bit per data column
(that is, <structfield>t_natts</> bits altogether). In this list of bits, a
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
present, all columns are assumed not-null.
The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
is set in <structfield>t_infomask</structfield>. If present, it appears just
before the <structfield>t_hoff</> boundary. Any padding needed to make
<structfield>t_hoff</> a MAXALIGN multiple will appear between the null
bitmap and the object ID. (This in turn ensures that the object ID is
suitably aligned.)
</para>
<table tocentry="1" id="heaptupleheaderdata-table">
<title>HeapTupleHeaderData Layout</title>
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
<tgroup cols="4">
<thead>
<row>
<entry>Field</entry>
<entry>Type</entry>
<entry>Length</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>t_xmin</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>insert XID stamp</entry>
</row>
<row>
<entry>t_cmin</entry>
<entry>CommandId</entry>
<entry>4 bytes</entry>
<entry>insert CID stamp</entry>
</row>
<row>
<entry>t_xmax</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>delete XID stamp</entry>
</row>
<row>
<entry>t_cmax</entry>
<entry>CommandId</entry>
<entry>4 bytes</entry>
<entry>delete CID stamp (overlays with t_xvac)</entry>
</row>
<row>
<entry>t_xvac</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>XID for VACUUM operation moving a row version</entry>
</row>
<row>
<entry>t_ctid</entry>
<entry>ItemPointerData</entry>
<entry>6 bytes</entry>
<entry>current TID of this or newer row version</entry>
</row>
<row>
<entry>t_natts</entry>
<entry>int16</entry>
<entry>2 bytes</entry>
<entry>number of attributes</entry>
</row>
<row>
<entry>t_infomask</entry>
<entry>uint16</entry>
<entry>2 bytes</entry>
<entry>various flag bits</entry>
</row>
<row>
<entry>t_hoff</entry>
<entry>uint8</entry>
<entry>1 byte</entry>
<entry>offset to user data</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
All the details may be found in
<filename>src/include/access/htup.h</filename>.
</para>
<para>
Interpreting the actual data can only be done with information obtained
from other tables, mostly <structname>pg_attribute</structname>. The
key values needed to identify field locations are
<structfield>attlen</structfield> and <structfield>attalign</structfield>.
There is no way to directly get a
particular attribute, except when there are only fixed width fields and no
NULLs. All this trickery is wrapped up in the functions
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
and <firstterm>heap_getsysattr</firstterm>.
</para>
<para>
To read the data you need to examine each attribute in turn. First check
whether the field is NULL according to the null bitmap. If it is, go to
the next. Then make sure you have the right alignment. If the field is a
fixed width field, then all the bytes are simply placed. If it's a
variable length field (attlen = -1) then it's a bit more complicated.
All variable-length datatypes share the common header structure
<type>varattrib</type>, which includes the total length of the stored
value and some flag bits. Depending on the flags, the data may be either
inline or in a <acronym>TOAST</> table;
it might be compressed, too (see <xref linkend="storage-toast">).
</para>
</sect1>
</chapter>

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/xtypes.sgml,v 1.24 2005/01/08 22:13:38 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/xtypes.sgml,v 1.25 2005/01/10 00:04:38 tgl Exp $
-->
<sect1 id="xtypes">
@ -232,10 +232,14 @@ CREATE TYPE complex (
</para>
<para>
<indexterm>
<primary>TOAST</primary>
<secondary>and user-defined types</secondary>
</indexterm>
If the values of your data type might exceed a few hundred bytes in
size (in internal form), you should make the data type
TOAST-able.<indexterm><primary>TOAST</primary><secondary>and
user-defined types</secondary></indexterm> To do this, the internal
<acronym>TOAST</>-able (see <xref linkend="storage-toast">).
To do this, the internal
representation must follow the standard layout for variable-length
data: the first four bytes must be an <type>int32</type> containing
the total length in bytes of the datum (including itself). The C