Copy-editing for recent documentation changes relevant to WAL,

full_page_writes, etc.
This commit is contained in:
Tom Lane 2005-10-22 21:56:07 +00:00
parent 6d6c3722fb
commit f72a342fb7
3 changed files with 132 additions and 146 deletions

View File

@ -1,5 +1,5 @@
<!-- <!--
$PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.71 2005/10/15 01:15:33 alvherre Exp $ $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.72 2005/10/22 21:56:07 tgl Exp $
--> -->
<chapter id="backup"> <chapter id="backup">
<title>Backup and Restore</title> <title>Backup and Restore</title>
@ -1148,21 +1148,20 @@ restore_command = 'copy /mnt/server/archivedir/%f "%p"' # Windows
<para> <para>
It should also be noted that the default <acronym>WAL</acronym> It should also be noted that the default <acronym>WAL</acronym>
format is fairly bulky since it includes many disk page snapshots. The pages format is fairly bulky since it includes many disk page snapshots.
are partially compressed, using the simple expedient of removing the These page snapshots are designed to support crash recovery,
empty space (if any) within each block. You can significantly reduce since we may need to fix partially-written disk pages. Depending
on your system hardware and software, the risk of partial writes may
be small enough to ignore, in which case you can significantly reduce
the total volume of archived logs by turning off page snapshots the total volume of archived logs by turning off page snapshots
using the <xref linkend="guc-full-page-writes"> parameter, using the <xref linkend="guc-full-page-writes"> parameter.
though you should read the notes and warnings in (Read the notes and warnings in
<xref linkend="reliability"> before you do so. <xref linkend="reliability"> before you do so.)
These page snapshots are designed to allow crash recovery, Turning off page snapshots does not prevent use of the logs for PITR
since we may need to fix partially-written disk pages. It is not operations.
necessary to store these page copies for PITR operations, however.
If you turn off <xref linkend="guc-full-page-writes">, your PITR
backup and recovery operations will continue to work successfully.
An area for future development is to compress archived WAL data by An area for future development is to compress archived WAL data by
removing unnecessary page copies when <xref linkend="guc-full-page-writes"> removing unnecessary page copies even when <varname>full_page_writes</>
is turned on. In the meantime, administrators is on. In the meantime, administrators
may wish to reduce the number of page snapshots included in WAL by may wish to reduce the number of page snapshots included in WAL by
increasing the checkpoint interval parameters as much as feasible. increasing the checkpoint interval parameters as much as feasible.
</para> </para>

View File

@ -1,5 +1,5 @@
<!-- <!--
$PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.31 2005/10/15 20:12:32 neilc Exp $ $PostgreSQL: pgsql/doc/src/sgml/config.sgml,v 1.32 2005/10/22 21:56:07 tgl Exp $
--> -->
<chapter Id="runtime-config"> <chapter Id="runtime-config">
<title>Run-time Configuration</title> <title>Run-time Configuration</title>
@ -1251,14 +1251,15 @@ SET ENABLE_SEQSCAN TO OFF;
<listitem> <listitem>
<para> <para>
If this option is on, the <productname>PostgreSQL</> server If this option is on, the <productname>PostgreSQL</> server
will use the <function>fsync()</> system call in several places will try to make sure that updates are physically written to
to make sure that updates are physically written to disk. This disk, by issuing <function>fsync()</> system calls or various
insures that a database cluster will recover to a equivalent methods (see <xref linkend="guc-wal-sync-method">).
This ensures that the database cluster can recover to a
consistent state after an operating system or hardware crash. consistent state after an operating system or hardware crash.
</para> </para>
<para> <para>
However, using <function>fsync()</function> results in a However, using <varname>fsync</varname> results in a
performance penalty: when a transaction is committed, performance penalty: when a transaction is committed,
<productname>PostgreSQL</productname> must wait for the <productname>PostgreSQL</productname> must wait for the
operating system to flush the write-ahead log to disk. When operating system to flush the write-ahead log to disk. When
@ -1268,7 +1269,7 @@ SET ENABLE_SEQSCAN TO OFF;
However, if the system crashes, the results of the last few However, if the system crashes, the results of the last few
committed transactions may be lost in part or whole. In the committed transactions may be lost in part or whole. In the
worst case, unrecoverable data corruption may occur. worst case, unrecoverable data corruption may occur.
(Crashes of the database server itself are <emphasis>not</> (Crashes of the database software itself are <emphasis>not</>
a risk factor here. Only an operating-system-level crash a risk factor here. Only an operating-system-level crash
creates a risk of corruption.) creates a risk of corruption.)
</para> </para>
@ -1277,8 +1278,8 @@ SET ENABLE_SEQSCAN TO OFF;
Due to the risks involved, there is no universally correct Due to the risks involved, there is no universally correct
setting for <varname>fsync</varname>. Some administrators setting for <varname>fsync</varname>. Some administrators
always disable <varname>fsync</varname>, while others only always disable <varname>fsync</varname>, while others only
turn it off for bulk loads, where there is a clear restart turn it off during initial bulk data loads, where there is a clear
point if something goes wrong, whereas some administrators restart point if something goes wrong. Others
always leave <varname>fsync</varname> enabled. The default is always leave <varname>fsync</varname> enabled. The default is
to enable <varname>fsync</varname>, for maximum reliability. to enable <varname>fsync</varname>, for maximum reliability.
If you trust your operating system, your hardware, and your If you trust your operating system, your hardware, and your
@ -1288,9 +1289,9 @@ SET ENABLE_SEQSCAN TO OFF;
<para> <para>
This option can only be set at server start or in the This option can only be set at server start or in the
<filename>postgresql.conf</filename> file. If this option <filename>postgresql.conf</filename> file. If you turn
is <literal>off</>, consider also turning off this option off, also consider turning off
<varname>guc-full-page-writes</>. <xref linkend="guc-full-page-writes">.
</para> </para>
</listitem> </listitem>
</varlistentry> </varlistentry>
@ -1302,8 +1303,10 @@ SET ENABLE_SEQSCAN TO OFF;
</indexterm> </indexterm>
<listitem> <listitem>
<para> <para>
Method used for forcing WAL updates out to disk. Possible Method used for forcing WAL updates out to disk.
values are: If <varname>fsync</varname> is off then this setting is irrelevant,
since updates will not be forced out at all.
Possible values are:
</para> </para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
@ -1313,7 +1316,12 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem> </listitem>
<listitem> <listitem>
<para> <para>
<literal>fdatasync</> (call <function>fdatasync()</> at each commit), <literal>fdatasync</> (call <function>fdatasync()</> at each commit)
</para>
</listitem>
<listitem>
<para>
<literal>fsync_writethrough</> (call <function>fsync()</> at each commit, forcing write-through of any disk write cache)
</para> </para>
</listitem> </listitem>
<listitem> <listitem>
@ -1322,11 +1330,6 @@ SET ENABLE_SEQSCAN TO OFF;
</para> </para>
</listitem> </listitem>
<listitem> <listitem>
<para>
<literal>fsync_writethrough</> (force write-through of any disk write cache)
</para>
</listitem>
<listitem>
<para> <para>
<literal>open_sync</> (write WAL files with <function>open()</> option <symbol>O_SYNC</>) <literal>open_sync</> (write WAL files with <function>open()</> option <symbol>O_SYNC</>)
</para> </para>
@ -1334,8 +1337,7 @@ SET ENABLE_SEQSCAN TO OFF;
</itemizedlist> </itemizedlist>
<para> <para>
Not all of these choices are available on all platforms. Not all of these choices are available on all platforms.
The top-most supported option is used as the default. The default is the first method in the above list that is supported.
If <varname>fsync</varname> is off then this setting is irrelevant.
This option can only be set at server start or in the This option can only be set at server start or in the
<filename>postgresql.conf</filename> file. <filename>postgresql.conf</filename> file.
</para> </para>
@ -1349,21 +1351,37 @@ SET ENABLE_SEQSCAN TO OFF;
<term><varname>full_page_writes</varname> (<type>boolean</type>)</term> <term><varname>full_page_writes</varname> (<type>boolean</type>)</term>
<listitem> <listitem>
<para> <para>
A page write in process during an operating system crash might When this option is on, the <productname>PostgreSQL</> server
be only partially written to disk, leading to an on-disk page writes the entire content of each disk page to WAL during the
that contains a mix of old and new data. During recovery, the first modification of that page after a checkpoint.
row changes stored in WAL are not enough to completely restore This is needed because
the page. a page write that is in process during an operating system crash might
be only partially completed, leading to an on-disk page
that contains a mix of old and new data. The row-level change data
normally stored in WAL will not be enough to completely restore
such a page during post-crash recovery. Storing the full page image
guarantees that the page can be correctly restored, but at a price
in increasing the amount of data that must be written to WAL.
(Because WAL replay always starts from a checkpoint, it is sufficient
to do this during the first change of each page after a checkpoint.
Therefore, one way to reduce the cost of full-page writes is to
increase the checkpoint interval parameters.)
</para> </para>
<para> <para>
When this option is on, the <productname>PostgreSQL</> server Turning this option off speeds normal operation, but
writes full pages to WAL when they are first modified after a might lead to a corrupt database after an operating system crash
checkpoint so crash recovery is possible. Turning this option off or power failure. The risks are similar to turning off
might lead to a corrupt system after an operating system crash <varname>fsync</>, though smaller. It may be safe to turn off
or power failure because uncorrected partial pages might contain this option if you have hardware (such as a battery-backed disk
inconsistent or corrupt data. The risks are less but similar to controller) or filesystem software (e.g., Reiser4) that reduces
<varname>fsync</>. the risk of partial page writes to an acceptably low level.
</para>
<para>
Turning off this option does not affect use of
WAL archiving for point-in-time recovery (PITR)
(see <xref linkend="backup-online">).
</para> </para>
<para> <para>
@ -1384,7 +1402,7 @@ SET ENABLE_SEQSCAN TO OFF;
Number of disk-page buffers allocated in shared memory for WAL data. Number of disk-page buffers allocated in shared memory for WAL data.
The default is 8. The setting need only be large enough to hold The default is 8. The setting need only be large enough to hold
the amount of WAL data generated by one typical transaction, since the amount of WAL data generated by one typical transaction, since
the data is flushed to disk at every transaction commit. the data is written out to disk at every transaction commit.
This option can only be set at server start. This option can only be set at server start.
</para> </para>
@ -1481,8 +1499,9 @@ SET ENABLE_SEQSCAN TO OFF;
<para> <para>
Write a message to the server log if checkpoints caused by Write a message to the server log if checkpoints caused by
the filling of checkpoint segment files happen closer together the filling of checkpoint segment files happen closer together
than this many seconds. The default is 30 seconds. than this many seconds (which suggests that
Zero turns off the warning. <varname>checkpoint_segments</> ought to be raised). The default is
30 seconds. Zero disables the warning.
</para> </para>
</listitem> </listitem>
</varlistentry> </varlistentry>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.36 2005/10/13 17:32:42 momjian Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.37 2005/10/22 21:56:07 tgl Exp $ -->
<chapter id="reliability"> <chapter id="reliability">
<title>Reliability</title> <title>Reliability</title>
@ -7,12 +7,12 @@
Reliability is a major feature of any serious database system, and Reliability is a major feature of any serious database system, and
<productname>PostgreSQL</> does everything possible to guarantee <productname>PostgreSQL</> does everything possible to guarantee
reliable operation. One aspect of reliable operation is that all data reliable operation. One aspect of reliable operation is that all data
recorded by a transaction should be stored in a non-volatile area recorded by a committed transaction should be stored in a non-volatile area
that is safe from power loss, operating system failure, and hardware that is safe from power loss, operating system failure, and hardware
failure (unrelated to the non-volatile area itself). To accomplish failure (except failure of the non-volatile area itself, of course).
this, <productname>PostgreSQL</> uses the magnetic platters of modern Successfully writing the data to the computer's permanent storage
disk drives for permanent storage that is immune to the failures (disk drive or equivalent) ordinarily meets this requirement.
listed above. In fact, even if a computer is fatally damaged, if In fact, even if a computer is fatally damaged, if
the disk drives survive they can be moved to another computer with the disk drives survive they can be moved to another computer with
similar hardware and all committed transactions will remain intact. similar hardware and all committed transactions will remain intact.
</para> </para>
@ -21,60 +21,64 @@
While forcing data periodically to the disk platters might seem like While forcing data periodically to the disk platters might seem like
a simple operation, it is not. Because disk drives are dramatically a simple operation, it is not. Because disk drives are dramatically
slower than main memory and CPUs, several layers of caching exist slower than main memory and CPUs, several layers of caching exist
between the computer's main memory and the disk drive platters. between the computer's main memory and the disk platters.
First, there is the operating system kernel cache, which caches First, there is the operating system's buffer cache, which caches
frequently requested disk blocks and delays disk writes. Fortunately, frequently requested disk blocks and combines disk writes. Fortunately,
all operating systems give applications a way to force writes from all operating systems give applications a way to force writes from
the kernel cache to disk, and <productname>PostgreSQL</> uses those the buffer cache to disk, and <productname>PostgreSQL</> uses those
features. In fact, the <xref linkend="guc-wal-sync-method"> parameter features. (See the <xref linkend="guc-wal-sync-method"> parameter
controls how this is done. to adjust how this is done.)
</para> </para>
<para> <para>
Secondly, there is an optional disk drive controller cache, Next, there may be a cache in the disk drive controller; this is
particularly popular on <acronym>RAID</> controller cards. Some of particularly common on <acronym>RAID</> controller cards. Some of
these caches are <literal>write-through</>, meaning writes are passed these caches are <firstterm>write-through</>, meaning writes are passed
along to the drive as soon as they arrive. Others are along to the drive as soon as they arrive. Others are
<literal>write-back</>, meaning data is passed on to the drive at <firstterm>write-back</>, meaning data is passed on to the drive at
some later time. Such caches can be a reliability problem because the some later time. Such caches can be a reliability hazard because the
disk controller card cache is volatile, unlike the disk driver memory in the disk controller cache is volatile, and will lose its
platters, unless the disk drive controller has a battery-backed contents in a power failure. Better controller cards have
cache, meaning the card has a battery that maintains power to the <firstterm>battery-backed</> caches, meaning the card has a battery that
cache in case of server power loss. When the disk drives are later maintains power to the cache in case of system power loss. After power
accessible, the data is written to the drives. is restored the data will be written to the disk drives.
</para> </para>
<para> <para>
And finally, most disk drives have caches. Some are write-through And finally, most disk drives have caches. Some are write-through
(typically SCSI), and some are write-back(typically IDE), and the while some are write-back, and the
same concerns about data loss exist for write-back drive caches as same concerns about data loss exist for write-back drive caches as
exist for disk controller caches. To have reliability, all exist for disk controller caches. Consumer-grade IDE drives are
storage subsystems must be reliable in their storage characteristics. particularly likely to contain write-back caches that will not
When the operating system sends a write request to the drive platters, survive a power failure.
there is little it can do to make sure the data has arrived at a
non-volatile store area on the system. Rather, it is the
administrator's responsibility to be sure that all storage components
have reliable characteristics.
</para> </para>
<para> <para>
One other area of potential data loss are the disk platter writes When the operating system sends a write request to the disk hardware,
themselves. Disk platters are internally made up of 512-byte sectors. there is little it can do to make sure the data has arrived at a truly
non-volatile storage area. Rather, it is the
administrator's responsibility to be sure that all storage components
ensure data integrity. Avoid disk controllers that have non-battery-backed
write caches. At the drive level, disable write-back caching if the
drive cannot guarantee the data will be written before shutdown.
</para>
<para>
Another risk of data loss is posed by the disk platter write
operations themselves. Disk platters are divided into sectors,
commonly 512 bytes each. Every physical read or write operation
processes a whole sector.
When a write request arrives at the drive, it might be for 512 bytes, When a write request arrives at the drive, it might be for 512 bytes,
1024 bytes, or 8192 bytes, and the process of writing could fail due 1024 bytes, or 8192 bytes, and the process of writing could fail due
to power loss at any time, meaning some of the 512-byte sectors were to power loss at any time, meaning some of the 512-byte sectors were
written, and others were not, or the first half of a 512-byte sector written, and others were not. To guard against such failures,
has new data, and the remainder has the original data. Obviously, on
startup, <productname>PostgreSQL</> would not be able to deal with
these partially written cases. To guard against that,
<productname>PostgreSQL</> periodically writes full page images to <productname>PostgreSQL</> periodically writes full page images to
permanent storage <emphasis>before</> modifying the actual page on permanent storage <emphasis>before</> modifying the actual page on
disk. By doing this, during crash recovery <productname>PostgreSQL</> can disk. By doing this, during crash recovery <productname>PostgreSQL</> can
restore partially-written pages. If you have a battery-backed disk restore partially-written pages. If you have a battery-backed disk
controller or filesystem (e.g. Reiser4) that prevents partial page writes, controller or filesystem software (e.g., Reiser4) that prevents partial
you can turn off this page imaging by using the page writes, you can turn off this page imaging by using the
<xref linkend="guc-full-page-writes"> parameter. This parameter has no <xref linkend="guc-full-page-writes"> parameter.
effect on the successful use of Point in Time Recovery (PITR),
described in <xref linkend="backup-online">.
</para> </para>
<para> <para>
@ -111,11 +115,7 @@
</para> </para>
<para> <para>
WAL brings three major benefits: A major benefit of using <acronym>WAL</acronym> is a
</para>
<para>
The first major benefit of using <acronym>WAL</acronym> is a
significantly reduced number of disk writes, because only the log significantly reduced number of disk writes, because only the log
file needs to be flushed to disk at the time of transaction file needs to be flushed to disk at the time of transaction
commit, rather than every data file changed by the transaction. commit, rather than every data file changed by the transaction.
@ -129,30 +129,7 @@
</para> </para>
<para> <para>
The next benefit is crash recovery protection. The truth is <acronym>WAL</acronym> also makes it possible to support on-line
that, before <acronym>WAL</acronym> was introduced back in release 7.1,
<productname>PostgreSQL</productname> was never able to guarantee
consistency in the case of a crash. Now,
<acronym>WAL</acronym> protects fully against the following problems:
<orderedlist>
<listitem>
<simpara>index rows pointing to nonexistent table rows</simpara>
</listitem>
<listitem>
<simpara>index rows lost in split operations</simpara>
</listitem>
<listitem>
<simpara>totally corrupted table or index page content, because
of partially written data pages</simpara>
</listitem>
</orderedlist>
</para>
<para>
Finally, <acronym>WAL</acronym> makes it possible to support on-line
backup and point-in-time recovery, as described in <xref backup and point-in-time recovery, as described in <xref
linkend="backup-online">. By archiving the WAL data we can support linkend="backup-online">. By archiving the WAL data we can support
reverting to any time instant covered by the available WAL data: reverting to any time instant covered by the available WAL data:
@ -169,7 +146,7 @@
<title><acronym>WAL</acronym> Configuration</title> <title><acronym>WAL</acronym> Configuration</title>
<para> <para>
There are several <acronym>WAL</acronym>-related configuration parameters that There are several <acronym>WAL</>-related configuration parameters that
affect database performance. This section explains their use. affect database performance. This section explains their use.
Consult <xref linkend="runtime-config"> for general information about Consult <xref linkend="runtime-config"> for general information about
setting server configuration parameters. setting server configuration parameters.
@ -178,16 +155,17 @@
<para> <para>
<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
are points in the sequence of transactions at which it is guaranteed are points in the sequence of transactions at which it is guaranteed
that the data files have been updated with all information logged before that the data files have been updated with all information written before
the checkpoint. At checkpoint time, all dirty data pages are flushed to the checkpoint. At checkpoint time, all dirty data pages are flushed to
disk and a special checkpoint record is written to the log file. As a disk and a special checkpoint record is written to the log file.
result, in the event of a crash, the crash recovery procedure knows from In the event of a crash, the crash recovery procedure looks at the latest
what point in the log (known as the redo record) it should start the checkpoint record to determine the point in the log (known as the redo
REDO operation, since any changes made to data files before that point record) from which it should start the REDO operation. Any changes made to
are already on disk. After a checkpoint has been made, any log segments data files before that point are known to be already on disk. Hence, after
written before the redo record are no longer needed and can be recycled a checkpoint has been made, any log segments preceding the one containing
or removed. (When <acronym>WAL</acronym> archiving is being done, the the redo record are no longer needed and can be recycled or removed. (When
log segments must be archived before being recycled or removed.) <acronym>WAL</acronym> archiving is being done, the log segments must be
archived before being recycled or removed.)
</para> </para>
<para> <para>
@ -206,7 +184,7 @@
more often. This allows faster after-crash recovery (since less work more often. This allows faster after-crash recovery (since less work
will need to be redone). However, one must balance this against the will need to be redone). However, one must balance this against the
increased cost of flushing dirty data pages more often. If increased cost of flushing dirty data pages more often. If
<xref linkend="guc-full-page-writes"> is set (the default), there is <xref linkend="guc-full-page-writes"> is set (as is the default), there is
another factor to consider. To ensure data page consistency, another factor to consider. To ensure data page consistency,
the first modification of a data page after each checkpoint results in the first modification of a data page after each checkpoint results in
logging the entire page content. In that case, logging the entire page content. In that case,
@ -228,8 +206,9 @@
<varname>checkpoint_segments</varname>. Occasional appearance of such <varname>checkpoint_segments</varname>. Occasional appearance of such
a message is not cause for alarm, but if it appears often then the a message is not cause for alarm, but if it appears often then the
checkpoint control parameters should be increased. Bulk operations such checkpoint control parameters should be increased. Bulk operations such
as a COPY, INSERT SELECT etc. may cause a number of such warnings if you as large <command>COPY</> transfers may cause a number of such warnings
do not set <xref linkend="guc-checkpoint-segments"> high enough. to appear if you have not set <varname>checkpoint_segments</> high
enough.
</para> </para>
<para> <para>
@ -273,8 +252,7 @@
correspondingly increase shared memory usage. When correspondingly increase shared memory usage. When
<xref linkend="guc-full-page-writes"> is set and the system is very busy, <xref linkend="guc-full-page-writes"> is set and the system is very busy,
setting this value higher will help smooth response times during the setting this value higher will help smooth response times during the
period immediately following each checkpoint. As a guide, a setting of 1024 period immediately following each checkpoint.
would be considered to be high.
</para> </para>
<para> <para>
@ -310,8 +288,7 @@
(provided that <productname>PostgreSQL</productname> has been (provided that <productname>PostgreSQL</productname> has been
compiled with support for it) will result in each compiled with support for it) will result in each
<function>LogInsert</function> and <function>LogFlush</function> <function>LogInsert</function> and <function>LogFlush</function>
<acronym>WAL</acronym> call being logged to the server log. The output <acronym>WAL</acronym> call being logged to the server log. This
is too verbose for use as a guide to performance tuning. This
option may be replaced by a more general mechanism in the future. option may be replaced by a more general mechanism in the future.
</para> </para>
</sect1> </sect1>
@ -340,15 +317,6 @@
available stock of numbers. available stock of numbers.
</para> </para>
<para>
The <acronym>WAL</acronym> buffers and control structure are in
shared memory and are handled by the server child processes; they
are protected by lightweight locks. The demand on shared memory is
dependent on the number of buffers. The default size of the
<acronym>WAL</acronym> buffers is 8 buffers of 8 kB each, or 64 kB
total.
</para>
<para> <para>
It is of advantage if the log is located on another disk than the It is of advantage if the log is located on another disk than the
main database files. This may be achieved by moving the directory main database files. This may be achieved by moving the directory