From ab3bb9cf784c0d7057add6a3a95b296c678cd029 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Mon, 10 Jan 2005 00:04:43 +0000 Subject: [PATCH] Add some real documentation about TOAST (finally). Combine this with the old 'page' chapter and the recently added 'filelayout' chapter to make a coherent chapter about PostgreSQL's physical storage layout. --- doc/src/sgml/diskusage.sgml | 6 +- doc/src/sgml/filelayout.sgml | 161 ------- doc/src/sgml/filelist.sgml | 5 +- doc/src/sgml/lobj.sgml | 10 +- doc/src/sgml/page.sgml | 354 ---------------- doc/src/sgml/postgres.sgml | 5 +- doc/src/sgml/ref/alter_table.sgml | 12 +- doc/src/sgml/storage.sgml | 676 ++++++++++++++++++++++++++++++ doc/src/sgml/xtypes.sgml | 10 +- 9 files changed, 704 insertions(+), 535 deletions(-) delete mode 100644 doc/src/sgml/filelayout.sgml delete mode 100644 doc/src/sgml/page.sgml create mode 100644 doc/src/sgml/storage.sgml diff --git a/doc/src/sgml/diskusage.sgml b/doc/src/sgml/diskusage.sgml index 65cc6f651f..385fd93ef1 100644 --- a/doc/src/sgml/diskusage.sgml +++ b/doc/src/sgml/diskusage.sgml @@ -1,5 +1,5 @@ @@ -22,12 +22,12 @@ $PostgreSQL: pgsql/doc/src/sgml/diskusage.sgml,v 1.13 2004/12/28 19:08:58 tgl Ex stored. If the table has any columns with potentially-wide values, there is also a TOAST file associated with the table, which is used to store values too wide to fit comfortably in the main - table. There will be one index on the + table (see ). There will be one index on the TOAST table, if present. There may also be indexes associated with the base table. Each table and index is stored in a separate disk file — possibly more than one file, if the file would exceed one gigabyte. Naming conventions for these files are described in . + linkend="storage-file-layout">. diff --git a/doc/src/sgml/filelayout.sgml b/doc/src/sgml/filelayout.sgml deleted file mode 100644 index 6aa856fd62..0000000000 --- a/doc/src/sgml/filelayout.sgml +++ /dev/null @@ -1,161 +0,0 @@ - - - - -Database File Layout - - - -A description of the database physical storage layout. - - - - -This section provides an overview of the physical format used by -PostgreSQL databases. - - - -All the data needed for a database cluster is stored within the cluster's data -directory, commonly referred to as PGDATA (after the name of the -environment variable that can be used to define it). A common location for -PGDATA is /var/lib/pgsql/data. Multiple clusters, -managed by different postmasters, can exist on the same machine. - - - -The PGDATA directory contains several subdirectories and control -files, as shown in . In addition to -these required items, the cluster configuration files -postgresql.conf, pg_hba.conf, and -pg_ident.conf are traditionally stored in -PGDATA (although beginning in -PostgreSQL 8.0 it is possible to keep them -elsewhere). - - - -Contents of <varname>PGDATA</> - - - - -Item - -Description - - - - - - - PG_VERSION - A file containing the major version number of PostgreSQL - - - - base - Subdirectory containing per-database subdirectories - - - - global - Subdirectory containing cluster-wide tables, such as - pg_database - - - - pg_clog - Subdirectory containing transaction commit status data - - - - pg_subtrans - Subdirectory containing subtransaction status data - - - - pg_tblspc - Subdirectory containing symbolic links to tablespaces - - - - pg_xlog - Subdirectory containing WAL (Write Ahead Log) files - - - - postmaster.opts - A file recording the command-line options the postmaster was -last started with - - - - postmaster.pid - A lock file recording the current postmaster PID and shared memory -segment ID (not present after postmaster shutdown) - - - - -
- - -For each database in the cluster there is a subdirectory within -PGDATA/base, named after the database's OID in -pg_database. This subdirectory is the default location -for the database's files; in particular, its system catalogs are stored -there. - - - -Each table and index is stored in a separate file, named after the table -or index's filenode number, which can be found in -pg_class.relfilenode. - - - - -Note that while a table's filenode often matches its OID, this is -not necessarily the case; some operations, like -TRUNCATE, REINDEX, CLUSTER and some forms -of ALTER TABLE, can change the filenode while preserving the OID. -Avoid assuming that filenode and table OID are the same. - - - - -When a table or index exceeds 1Gb, it is divided into gigabyte-sized -segments. The first segment's file name is the same as the -filenode; subsequent segments are named filenode.1, filenode.2, etc. -This arrangement avoids problems on platforms that have file size limitations. -The contents of tables and indexes are discussed further in -. - - - -A table that has columns with potentially large entries will have an -associated TOAST table, which is used for out-of-line storage of -field values that are too large to keep in the table rows proper. -pg_class.reltoastrelid links from a table to -its TOAST table, if any. - - - -Tablespaces make the scenario more complicated. Each user-defined tablespace -has a symbolic link inside the PGDATA/pg_tblspc -directory, which points to the physical tablespace directory (as specified in -its CREATE TABLESPACE command). The symbolic link is named after -the tablespace's OID. Inside the physical tablespace directory there is -a subdirectory for each database that has elements in the tablespace, named -after the database's OID. Tables within that directory follow the filenode -naming scheme. The pg_default tablespace is not accessed through -pg_tblspc, but corresponds to -PGDATA/base. Similarly, the pg_global -tablespace is not accessed through pg_tblspc, but corresponds to -PGDATA/global. - - -
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 3dc13f44fb..21e8db881b 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -1,4 +1,4 @@ - + @@ -75,15 +75,14 @@ - - + diff --git a/doc/src/sgml/lobj.sgml b/doc/src/sgml/lobj.sgml index 1c40835db4..82ca839efb 100644 --- a/doc/src/sgml/lobj.sgml +++ b/doc/src/sgml/lobj.sgml @@ -1,5 +1,5 @@ @@ -51,9 +51,11 @@ $PostgreSQL: pgsql/doc/src/sgml/lobj.sgml,v 1.35 2005/01/08 22:13:33 tgl Exp $
- TOAST - sliced breadTOAST - PostgreSQL 7.1 introduced a mechanism + + TOAST + versus large objects + + PostgreSQL 7.1 introduced a mechanism (nicknamed TOAST) that allows data values to be much larger than single pages. This makes the large object facility partially obsolete. One diff --git a/doc/src/sgml/page.sgml b/doc/src/sgml/page.sgml deleted file mode 100644 index 8f2388af6a..0000000000 --- a/doc/src/sgml/page.sgml +++ /dev/null @@ -1,354 +0,0 @@ - - - - -Database Page Layout - - - -A description of the database file page format. - - - - -This section provides an overview of the page format used by -PostgreSQL tables and indexes. - - Actually, index access methods need not use this page format. - All the existing index methods do use this basic format, - but the data kept on index metapages usually doesn't follow - the item layout rules. - - -TOAST tables and sequences are formatted just like a regular table. - - - -In the following explanation, a -byte -is assumed to contain 8 bits. In addition, the term -item -refers to an individual data value that is stored on a page. In a table, -an item is a row; in an index, an item is an index entry. - - - -Every table and index is stored as an array of pages of a -fixed size (usually 8K, although a different page size can be selected -when compiling the server). In a table, all the pages are logically -equivalent, so a particular item (row) can be stored in any page. In -indexes, the first page is generally reserved as a metapage -holding control information, and there may be different types of pages -within the index, depending on the index access method. - - - - shows the overall layout of a page. -There are five parts to each page. - - - -Overall Page Layout -Page Layout - - - - -Item - -Description - - - - - - - PageHeaderData - 20 bytes long. Contains general information about the page, including -free space pointers. - - - -ItemPointerData -Array of (offset,length) pairs pointing to the actual items. -4 bytes per item. - - - -Free space -The unallocated space. New item pointers are allocated from the start -of this area, new items from the end. - - - -Items -The actual items themselves. - - - -Special space -Index access method specific data. Different methods store different -data. Empty in ordinary tables. - - - - -
- - - - The first 20 bytes of each page consists of a page header - (PageHeaderData). Its format is detailed in . The first two fields track the most - recent WAL entry related to this page. They are followed by three 2-byte - integer fields - (pd_lower, pd_upper, - and pd_special). These contain byte offsets - from the page start to the start - of unallocated space, to the end of unallocated space, and to the start of - the special space. - The last 2 bytes of the page header, - pd_pagesize_version, store both the page size - and a version indicator. Beginning with - PostgreSQL 8.0 the version number is 2; - PostgreSQL 7.3 and 7.4 used version number 1; - prior releases used version number 0. - (The basic page layout and header format has not changed in these versions, - but the layout of heap row headers has.) The page size - is basically only present as a cross-check; there is no support for having - more than one page size in an installation. - - - - - PageHeaderData Layout - PageHeaderData Layout - - - - Field - Type - Length - Description - - - - - pd_lsn - XLogRecPtr - 8 bytes - LSN: next byte after last byte of xlog record for last change - to this page - - - pd_tli - TimeLineID - 4 bytes - TLI of last change - - - pd_lower - LocationIndex - 2 bytes - Offset to start of free space - - - pd_upper - LocationIndex - 2 bytes - Offset to end of free space - - - pd_special - LocationIndex - 2 bytes - Offset to start of special space - - - pd_pagesize_version - uint16 - 2 bytes - Page size and layout version number information - - - -
- - - All the details may be found in - src/include/storage/bufpage.h. - - - - - Following the page header are item identifiers - (ItemIdData), each requiring four bytes. - An item identifier contains a byte-offset to - the start of an item, its length in bytes, and a few attribute bits - which affect its interpretation. - New item identifiers are allocated - as needed from the beginning of the unallocated space. - The number of item identifiers present can be determined by looking at - pd_lower, which is increased to allocate a new identifier. - Because an item - identifier is never moved until it is freed, its index may be used on a - long-term basis to reference an item, even when the item itself is moved - around on the page to compact free space. In fact, every pointer to an - item (ItemPointer, also known as - CTID) created by - PostgreSQL consists of a page number and the - index of an item identifier. - - - - - - The items themselves are stored in space allocated backwards from the end - of unallocated space. The exact structure varies depending on what the - table is to contain. Tables and sequences both use a structure named - HeapTupleHeaderData, described below. - - - - - - The final section is the special section which may - contain anything the access method wishes to store. For example, - b-tree indexes store links to the page's left and right siblings, - as well as some other data relevant to the index structure. - Ordinary tables do not use a special section at all (indicated by setting - pd_special to equal the page size). - - - - - - All table rows are structured in the same way. There is a fixed-size - header (occupying 27 bytes on most machines), followed by an optional null - bitmap, an optional object ID field, and the user data. The header is - detailed - in . The actual user data - (columns of the row) begins at the offset indicated by - t_hoff, which must always be a multiple of the MAXALIGN - distance for the platform. - The null bitmap is - only present if the HEAP_HASNULL bit is set in - t_infomask. If it is present it begins just after - the fixed header and occupies enough bytes to have one bit per data column - (that is, t_natts bits altogether). In this list of bits, a - 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not - present, all columns are assumed not-null. - The object ID is only present if the HEAP_HASOID bit - is set in t_infomask. If present, it appears just - before the t_hoff boundary. Any padding needed to make - t_hoff a MAXALIGN multiple will appear between the null - bitmap and the object ID. (This in turn ensures that the object ID is - suitably aligned.) - - - - - HeapTupleHeaderData Layout - HeapTupleHeaderData Layout - - - - Field - Type - Length - Description - - - - - t_xmin - TransactionId - 4 bytes - insert XID stamp - - - t_cmin - CommandId - 4 bytes - insert CID stamp - - - t_xmax - TransactionId - 4 bytes - delete XID stamp - - - t_cmax - CommandId - 4 bytes - delete CID stamp (overlays with t_xvac) - - - t_xvac - TransactionId - 4 bytes - XID for VACUUM operation moving a row version - - - t_ctid - ItemPointerData - 6 bytes - current TID of this or newer row version - - - t_natts - int16 - 2 bytes - number of attributes - - - t_infomask - uint16 - 2 bytes - various flag bits - - - t_hoff - uint8 - 1 byte - offset to user data - - - -
- - - All the details may be found in - src/include/access/htup.h. - - - - - Interpreting the actual data can only be done with information obtained - from other tables, mostly pg_attribute. The - key values needed to identify field locations are - attlen and attalign. - There is no way to directly get a - particular attribute, except when there are only fixed width fields and no - NULLs. All this trickery is wrapped up in the functions - heap_getattr, fastgetattr - and heap_getsysattr. - - - - - To read the data you need to examine each attribute in turn. First check - whether the field is NULL according to the null bitmap. If it is, go to - the next. Then make sure you have the right alignment. If the field is a - fixed width field, then all the bytes are simply placed. If it's a - variable length field (attlen = -1) then it's a bit more complicated. - All variable-length datatypes share the common header structure - varattrib, which includes the total length of the stored - value and some flag bits. Depending on the flags, the data may be either - inline or in another table (TOAST); it might be compressed, too. - - -
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml index 36d58f4d89..8ec6262226 100644 --- a/doc/src/sgml/postgres.sgml +++ b/doc/src/sgml/postgres.sgml @@ -1,5 +1,5 @@ diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 91793e0362..477ccf99f0 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -1,5 +1,5 @@ @@ -153,10 +153,14 @@ where action is one of: inline, uncompressed. MAIN is for inline, compressible data. EXTERNAL is for external, uncompressed data, and EXTENDED is for external, - compressed data. EXTENDED is the default for all - data types that support it. Use of EXTERNAL will + compressed data. EXTENDED is the default for most + data types that support non-PLAIN storage. + Use of EXTERNAL will make substring operations on text and bytea - columns faster, at the penalty of increased storage space. + columns faster, at the penalty of increased storage space. Note that + SET STORAGE doesn't itself change anything in the table, + it just sets the strategy to be pursued during future table updates. + See for more information.
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml new file mode 100644 index 0000000000..118d6604d2 --- /dev/null +++ b/doc/src/sgml/storage.sgml @@ -0,0 +1,676 @@ + + + + +Database Physical Storage + + +This chapter provides an overview of the physical storage format used by +PostgreSQL databases. + + + + +Database File Layout + + +This section describes the storage format at the level of files and +directories. + + + +All the data needed for a database cluster is stored within the cluster's data +directory, commonly referred to as PGDATA (after the name of the +environment variable that can be used to define it). A common location for +PGDATA is /var/lib/pgsql/data. Multiple clusters, +managed by different postmasters, can exist on the same machine. + + + +The PGDATA directory contains several subdirectories and control +files, as shown in . In addition to +these required items, the cluster configuration files +postgresql.conf, pg_hba.conf, and +pg_ident.conf are traditionally stored in +PGDATA (although beginning in +PostgreSQL 8.0 it is possible to keep them +elsewhere). + + + +Contents of <varname>PGDATA</> + + + + +Item + +Description + + + + + + + PG_VERSION + A file containing the major version number of PostgreSQL + + + + base + Subdirectory containing per-database subdirectories + + + + global + Subdirectory containing cluster-wide tables, such as + pg_database + + + + pg_clog + Subdirectory containing transaction commit status data + + + + pg_subtrans + Subdirectory containing subtransaction status data + + + + pg_tblspc + Subdirectory containing symbolic links to tablespaces + + + + pg_xlog + Subdirectory containing WAL (Write Ahead Log) files + + + + postmaster.opts + A file recording the command-line options the postmaster was +last started with + + + + postmaster.pid + A lock file recording the current postmaster PID and shared memory +segment ID (not present after postmaster shutdown) + + + + +
+ + +For each database in the cluster there is a subdirectory within +PGDATA/base, named after the database's OID in +pg_database. This subdirectory is the default location +for the database's files; in particular, its system catalogs are stored +there. + + + +Each table and index is stored in a separate file, named after the table +or index's filenode number, which can be found in +pg_class.relfilenode. + + + + +Note that while a table's filenode often matches its OID, this is +not necessarily the case; some operations, like +TRUNCATE, REINDEX, CLUSTER and some forms +of ALTER TABLE, can change the filenode while preserving the OID. +Avoid assuming that filenode and table OID are the same. + + + + +When a table or index exceeds 1Gb, it is divided into gigabyte-sized +segments. The first segment's file name is the same as the +filenode; subsequent segments are named filenode.1, filenode.2, etc. +This arrangement avoids problems on platforms that have file size limitations. +The contents of tables and indexes are discussed further in +. + + + +A table that has columns with potentially large entries will have an +associated TOAST table, which is used for out-of-line storage of +field values that are too large to keep in the table rows proper. +pg_class.reltoastrelid links from a table to +its TOAST table, if any. +See for more information. + + + +Tablespaces make the scenario more complicated. Each user-defined tablespace +has a symbolic link inside the PGDATA/pg_tblspc +directory, which points to the physical tablespace directory (as specified in +its CREATE TABLESPACE command). The symbolic link is named after +the tablespace's OID. Inside the physical tablespace directory there is +a subdirectory for each database that has elements in the tablespace, named +after the database's OID. Tables within that directory follow the filenode +naming scheme. The pg_default tablespace is not accessed through +pg_tblspc, but corresponds to +PGDATA/base. Similarly, the pg_global +tablespace is not accessed through pg_tblspc, but corresponds to +PGDATA/global. + + +
+ + + +TOAST + + + TOAST + + sliced breadTOAST + + +This section provides an overview of TOAST (The +Oversized-Attribute Storage Technique). + + + +Since PostgreSQL uses a fixed page size (commonly +8Kb), and does not allow tuples to span multiple pages, it's not possible to +store very large field values directly. Before PostgreSQL 7.1 +there was a hard limit of just under one page on the total amount of data that +could be put into a table row. In release 7.1 and later, this limit is +overcome by allowing large field values to be compressed and/or broken up into +multiple physical rows. This happens transparently to the user, with only +small impact on most of the backend code. The technique is affectionately +known as TOAST (or the best thing since sliced bread). + + + +Only certain data types support TOAST — there is no need to +impose the overhead on data types that cannot produce large field values. +To support TOAST, a data type must have a variable-length +(varlena) representation, in which the first 32-bit word of any +stored value contains the total length of the value in bytes (including +itself). TOAST does not constrain the rest of the representation. +All the C-level functions supporting a TOAST-able data type must +be careful to handle TOASTed input values. (This is normally done +by invoking PG_DETOAST_DATUM before doing anything with an input +value; but in some cases more efficient approaches are possible.) + + + +TOAST usurps the high-order two bits of the varlena length word, +thereby limiting the logical size of any value of a TOAST-able +data type to 1Gb (230 - 1 bytes). When both bits are zero, +the value is an ordinary un-TOASTed value of the data type. One +of these bits, if set, indicates that the value has been compressed and must +be decompressed before use. The other bit, if set, indicates that the value +has been stored out-of-line. In this case the remainder of the value is +actually just a pointer, and the correct data has to be found elsewhere. When +both bits are set, the out-of-line data has been compressed too. In each case +the length in the low-order bits of the varlena word indicates the actual size +of the datum, not the size of the logical value that would be extracted by +decompression or fetching of the out-of-line data. + + + +If any of the columns of a table are TOAST-able, the table will +have an associated TOAST table, whose OID is stored in the table's +pg_class.reltoastrelid entry. Out-of-line +TOASTed values are kept in the TOAST table, as +described in more detail below. + + + +The compression technique used is a fairly simple and very fast member +of the LZ family of compression techniques. See +src/backend/utils/adt/pg_lzcompress.c for the details. + + + +Out-of-line values are divided (after compression if used) into chunks of at +most TOAST_MAX_CHUNK_SIZE bytes (this value is a little less than +BLCKSZ/4, or about 2000 bytes by default). Each chunk is stored +as a separate row in the TOAST table for the owning table. Every +TOAST table has the columns chunk_id (an OID +identifying the particular TOASTed value), +chunk_seq (a sequence number for the chunk within its value), +and chunk_data (the actual data of the chunk). A unique index +on chunk_id and chunk_seq provides fast +retrieval of the values. A pointer datum representing an out-of-line +TOASTed value therefore needs to store the OID of the +TOAST table in which to look and the OID of the specific value +(its chunk_id). For convenience, pointer datums also store the +logical datum size (original uncompressed data length) and actual stored size +(different if compression was applied). Allowing for the varlena header word, +the total size of a TOAST pointer datum is therefore 20 bytes +regardless of the actual size of the represented value. + + + +The TOAST code is triggered only +when a row value to be stored in a table is wider than BLCKSZ/4 +bytes (normally 2Kb). The TOAST code will compress and/or move +field values out-of-line until the row value is shorter than +BLCKSZ/4 bytes or no more gains can be had. During an UPDATE +operation, values of unchanged fields are normally preserved as-is; so an +UPDATE of a row with out-of-line values incurs no TOAST costs if +none of the out-of-line values change. + + + +The TOAST code recognizes four different strategies for storing +TOAST-able columns: + + + + + PLAIN prevents either compression or + out-of-line storage. This is the only possible strategy for + columns of non-TOAST-able data types. + + + + + EXTENDED allows both compression and out-of-line + storage. This is the default for most TOAST-able data types. + Compression will be attempted first, then out-of-line storage if + the row is still too big. + + + + + EXTERNAL allows out-of-line storage but not + compression. Use of EXTERNAL will + make substring operations on wide text and + bytea columns faster (at the penalty of increased storage + space) because these operations are optimized to fetch only the + required parts of the out-of-line value when it is not compressed. + + + + + MAIN allows compression but not out-of-line + storage. (Actually, out-of-line storage will still be performed + for such columns, but only as a last resort when there is no other + way to make the row small enough.) + + + + +Each TOAST-able data type specifies a default strategy for columns +of that data type, but the strategy for a given table column can be altered +with ALTER TABLE SET STORAGE. + + + +This scheme has a number of advantages compared to a more straightforward +approach such as allowing row values to span pages. Assuming that queries are +usually qualified by comparisons against relatively small key values, most of +the work of the executor will be done using the main row entry. The big values +of TOASTed attributes will only be pulled out (if selected at all) +at the time the result set is sent to the client. Thus, the main table is much +smaller and more of its rows fit in the shared buffer cache than would be the +case without any out-of-line storage. Sort sets shrink also, and sorts will +more often be done entirely in memory. A little test showed that a table +containing typical HTML pages and their URLs was stored in about half of the +raw data size including the TOAST table, and that the main table +contained only about 10% of the entire data (the URLs and some small HTML +pages). There was no runtime difference compared to an un-TOASTed +comparison table, in which all the HTML pages were cut down to 7Kb to fit. + + + + + + +Database Page Layout + + +This section provides an overview of the page format used within +PostgreSQL tables and indexes. + + Actually, index access methods need not use this page format. + All the existing index methods do use this basic format, + but the data kept on index metapages usually doesn't follow + the item layout rules. + + +Sequences and TOAST tables are formatted just like a regular table. + + + +In the following explanation, a +byte +is assumed to contain 8 bits. In addition, the term +item +refers to an individual data value that is stored on a page. In a table, +an item is a row; in an index, an item is an index entry. + + + +Every table and index is stored as an array of pages of a +fixed size (usually 8Kb, although a different page size can be selected +when compiling the server). In a table, all the pages are logically +equivalent, so a particular item (row) can be stored in any page. In +indexes, the first page is generally reserved as a metapage +holding control information, and there may be different types of pages +within the index, depending on the index access method. + + + + shows the overall layout of a page. +There are five parts to each page. + + + +Overall Page Layout +Page Layout + + + + +Item + +Description + + + + + + + PageHeaderData + 20 bytes long. Contains general information about the page, including +free space pointers. + + + +ItemPointerData +Array of (offset,length) pairs pointing to the actual items. +4 bytes per item. + + + +Free space +The unallocated space. New item pointers are allocated from the start +of this area, new items from the end. + + + +Items +The actual items themselves. + + + +Special space +Index access method specific data. Different methods store different +data. Empty in ordinary tables. + + + + +
+ + + + The first 20 bytes of each page consists of a page header + (PageHeaderData). Its format is detailed in . The first two fields track the most + recent WAL entry related to this page. They are followed by three 2-byte + integer fields + (pd_lower, pd_upper, + and pd_special). These contain byte offsets + from the page start to the start + of unallocated space, to the end of unallocated space, and to the start of + the special space. + The last 2 bytes of the page header, + pd_pagesize_version, store both the page size + and a version indicator. Beginning with + PostgreSQL 8.0 the version number is 2; + PostgreSQL 7.3 and 7.4 used version number 1; + prior releases used version number 0. + (The basic page layout and header format has not changed in these versions, + but the layout of heap row headers has.) The page size + is basically only present as a cross-check; there is no support for having + more than one page size in an installation. + + + + + PageHeaderData Layout + PageHeaderData Layout + + + + Field + Type + Length + Description + + + + + pd_lsn + XLogRecPtr + 8 bytes + LSN: next byte after last byte of xlog record for last change + to this page + + + pd_tli + TimeLineID + 4 bytes + TLI of last change + + + pd_lower + LocationIndex + 2 bytes + Offset to start of free space + + + pd_upper + LocationIndex + 2 bytes + Offset to end of free space + + + pd_special + LocationIndex + 2 bytes + Offset to start of special space + + + pd_pagesize_version + uint16 + 2 bytes + Page size and layout version number information + + + +
+ + + All the details may be found in + src/include/storage/bufpage.h. + + + + + Following the page header are item identifiers + (ItemIdData), each requiring four bytes. + An item identifier contains a byte-offset to + the start of an item, its length in bytes, and a few attribute bits + which affect its interpretation. + New item identifiers are allocated + as needed from the beginning of the unallocated space. + The number of item identifiers present can be determined by looking at + pd_lower, which is increased to allocate a new identifier. + Because an item + identifier is never moved until it is freed, its index may be used on a + long-term basis to reference an item, even when the item itself is moved + around on the page to compact free space. In fact, every pointer to an + item (ItemPointer, also known as + CTID) created by + PostgreSQL consists of a page number and the + index of an item identifier. + + + + + + The items themselves are stored in space allocated backwards from the end + of unallocated space. The exact structure varies depending on what the + table is to contain. Tables and sequences both use a structure named + HeapTupleHeaderData, described below. + + + + + + The final section is the special section which may + contain anything the access method wishes to store. For example, + b-tree indexes store links to the page's left and right siblings, + as well as some other data relevant to the index structure. + Ordinary tables do not use a special section at all (indicated by setting + pd_special to equal the page size). + + + + + + All table rows are structured in the same way. There is a fixed-size + header (occupying 27 bytes on most machines), followed by an optional null + bitmap, an optional object ID field, and the user data. The header is + detailed + in . The actual user data + (columns of the row) begins at the offset indicated by + t_hoff, which must always be a multiple of the MAXALIGN + distance for the platform. + The null bitmap is + only present if the HEAP_HASNULL bit is set in + t_infomask. If it is present it begins just after + the fixed header and occupies enough bytes to have one bit per data column + (that is, t_natts bits altogether). In this list of bits, a + 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not + present, all columns are assumed not-null. + The object ID is only present if the HEAP_HASOID bit + is set in t_infomask. If present, it appears just + before the t_hoff boundary. Any padding needed to make + t_hoff a MAXALIGN multiple will appear between the null + bitmap and the object ID. (This in turn ensures that the object ID is + suitably aligned.) + + + + + HeapTupleHeaderData Layout + HeapTupleHeaderData Layout + + + + Field + Type + Length + Description + + + + + t_xmin + TransactionId + 4 bytes + insert XID stamp + + + t_cmin + CommandId + 4 bytes + insert CID stamp + + + t_xmax + TransactionId + 4 bytes + delete XID stamp + + + t_cmax + CommandId + 4 bytes + delete CID stamp (overlays with t_xvac) + + + t_xvac + TransactionId + 4 bytes + XID for VACUUM operation moving a row version + + + t_ctid + ItemPointerData + 6 bytes + current TID of this or newer row version + + + t_natts + int16 + 2 bytes + number of attributes + + + t_infomask + uint16 + 2 bytes + various flag bits + + + t_hoff + uint8 + 1 byte + offset to user data + + + +
+ + + All the details may be found in + src/include/access/htup.h. + + + + + Interpreting the actual data can only be done with information obtained + from other tables, mostly pg_attribute. The + key values needed to identify field locations are + attlen and attalign. + There is no way to directly get a + particular attribute, except when there are only fixed width fields and no + NULLs. All this trickery is wrapped up in the functions + heap_getattr, fastgetattr + and heap_getsysattr. + + + + + To read the data you need to examine each attribute in turn. First check + whether the field is NULL according to the null bitmap. If it is, go to + the next. Then make sure you have the right alignment. If the field is a + fixed width field, then all the bytes are simply placed. If it's a + variable length field (attlen = -1) then it's a bit more complicated. + All variable-length datatypes share the common header structure + varattrib, which includes the total length of the stored + value and some flag bits. Depending on the flags, the data may be either + inline or in a TOAST table; + it might be compressed, too (see ). + + +
+ +
diff --git a/doc/src/sgml/xtypes.sgml b/doc/src/sgml/xtypes.sgml index 6061571c62..22d11a6300 100644 --- a/doc/src/sgml/xtypes.sgml +++ b/doc/src/sgml/xtypes.sgml @@ -1,5 +1,5 @@ @@ -232,10 +232,14 @@ CREATE TYPE complex ( + + TOAST + and user-defined types + If the values of your data type might exceed a few hundred bytes in size (in internal form), you should make the data type - TOAST-able.TOASTand - user-defined types To do this, the internal + TOAST-able (see ). + To do this, the internal representation must follow the standard layout for variable-length data: the first four bytes must be an int32 containing the total length in bytes of the datum (including itself). The C