Enhance standby documentation.

Original patch by Fujii Masao, with heavy editing and bitrot-fixing
after my other commit.
This commit is contained in:
Heikki Linnakangas 2010-03-31 20:35:09 +00:00
parent 259f60e9b6
commit ec9ee9381f

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/high-availability.sgml,v 1.55 2010/03/31 19:13:01 heikki Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/high-availability.sgml,v 1.56 2010/03/31 20:35:09 heikki Exp $ -->
<chapter id="high-availability"> <chapter id="high-availability">
<title>High Availability, Load Balancing, and Replication</title> <title>High Availability, Load Balancing, and Replication</title>
@ -622,7 +622,8 @@ protocol to make nodes agree on a serializable transactional order.
<title>Preparing Master for Standby Servers</title> <title>Preparing Master for Standby Servers</title>
<para> <para>
Set up continuous archiving to a WAL archive on the master, as described Set up continuous archiving on the primary to an archive directory
accessible from the standby, as described
in <xref linkend="continuous-archiving">. The archive location should be in <xref linkend="continuous-archiving">. The archive location should be
accessible from the standby even when the master is down, ie. it should accessible from the standby even when the master is down, ie. it should
reside on the standby server itself or another trusted server, not on reside on the standby server itself or another trusted server, not on
@ -646,11 +647,11 @@ protocol to make nodes agree on a serializable transactional order.
<para> <para>
To set up the standby server, restore the base backup taken from primary To set up the standby server, restore the base backup taken from primary
server (see <xref linkend="backup-pitr-recovery">). In the recovery command file server (see <xref linkend="backup-pitr-recovery">). Create a recovery
<filename>recovery.conf</> in the standby's cluster data directory, command file <filename>recovery.conf</> in the standby's cluster data
turn on <varname>standby_mode</>. Set <varname>restore_command</> to directory, and turn on <varname>standby_mode</>. Set
a simple command to copy files from the WAL archive. If you want to <varname>restore_command</> to a simple command to copy files from
use streaming replication, set <varname>primary_conninfo</>. the WAL archive.
</para> </para>
<note> <note>
@ -664,17 +665,38 @@ protocol to make nodes agree on a serializable transactional order.
</note> </note>
<para> <para>
You can use restartpoint_command to prune the archive of files no longer If you want to use streaming replication, fill in
needed by the standby. <varname>primary_conninfo</> with a libpq connection string, including
the host name (or IP address) and any additional details needed to
connect to the primary server. If the primary needs a password for
authentication, the password needs to be specified in
<varname>primary_conninfo</> as well.
</para>
<para>
You can use <varname>restartpoint_command</> to prune the archive of
files no longer needed by the standby.
</para> </para>
<para> <para>
If you're setting up the standby server for high availability purposes, If you're setting up the standby server for high availability purposes,
set up WAL archiving, connections and authentication like the primary set up WAL archiving, connections and authentication like the primary
server, because the standby server will work as a primary server after server, because the standby server will work as a primary server after
failover. If you're setting up the standby server for reporting failover. You will also need to set <varname>trigger_file</> to make
purposes, with no plans to fail over to it, configure the standby it possible to fail over.
accordingly. If you're setting up the standby server for reporting
purposes, with no plans to fail over to it, <varname>trigger_file</>
is not required.
</para>
<para>
A simple example of a <filename>recovery.conf</> is:
<programlisting>
standby_mode = 'on'
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
restore_command = 'cp /path/to/archive/%f %p'
trigger_file = '/path/to/trigger_file'
</programlisting>
</para> </para>
<para> <para>
@ -731,7 +753,7 @@ protocol to make nodes agree on a serializable transactional order.
On systems that support the keepalive socket option, setting On systems that support the keepalive socket option, setting
<xref linkend="guc-tcp-keepalives-idle">, <xref linkend="guc-tcp-keepalives-idle">,
<xref linkend="guc-tcp-keepalives-interval"> and <xref linkend="guc-tcp-keepalives-interval"> and
<xref linkend="guc-tcp-keepalives-count"> helps the master promptly <xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
notice a broken connection. notice a broken connection.
</para> </para>
@ -798,6 +820,29 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
<varname>primary_conninfo</varname> then a FATAL error will be raised. <varname>primary_conninfo</varname> then a FATAL error will be raised.
</para> </para>
</sect3> </sect3>
<sect3 id="streaming-replication-monitoring">
<title>Monitoring</title>
<para>
The WAL files required for the standby's recovery are not deleted from
the <filename>pg_xlog</> directory on the primary while the standby is
connected. If the standby lags far behind the primary, many WAL files
will accumulate in there, and can fill up the disk. It is therefore
important to monitor the lag to ensure the health of the standby and
to avoid disk full situations in the primary.
You can calculate the lag by comparing the current WAL write
location on the primary with the last WAL location received by the
standby. They can be retrieved using
<function>pg_current_xlog_location</> on the primary and the
<function>pg_last_xlog_receive_location</> on the standby,
respectively (see <xref linkend="functions-admin-backup-table"> and
<xref linkend="functions-recovery-info-table"> for details).
The last WAL receive location in the standby is also displayed in the
process status of the WAL receiver process, displayed using the
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
</sect3>
</sect2> </sect2>
</sect1> </sect1>
@ -1898,16 +1943,64 @@ LOG: database system is ready to accept read only connections
updated backup than from the original base backup. updated backup than from the original base backup.
</para> </para>
<para>
The procedure for taking a file system backup of the standby server's
data directory while it's processing logs shipped from the primary is:
<orderedlist>
<listitem>
<para>
Perform the backup, without using <function>pg_start_backup</> and
<function>pg_stop_backup</>. Note that the <filename>pg_control</>
file must be backed up <emphasis>first</>, as in:
<programlisting>
cp /var/lib/pgsql/data/global/pg_control /tmp
cp -r /var/lib/pgsql/data /path/to/backup
mv /tmp/pg_control /path/to/backup/data/global
</programlisting>
<filename>pg_control</> contains the location where WAL replay will
begin after restoring from the backup; backing it up first ensures
that it points to the last restartpoint when the backup started, not
some later restartpoint that happened while files were copied to the
backup.
</para>
</listitem>
<listitem>
<para>
Make note of the backup ending WAL location by calling the <function>
pg_last_xlog_replay_location</> function at the end of the backup,
and keep it with the backup.
<programlisting>
psql -c "select pg_last_xlog_replay_location();" > /path/to/backup/end_location
</programlisting>
When recovering from the incrementally updated backup, the server
can begin accepting connections and complete the recovery successfully
before the database has become consistent. To avoid that, you must
ensure the database is consistent before users try to connect to the
server and when the recovery ends. You can do that by comparing the
progress of the recovery with the stored backup ending WAL location:
the server is not consistent until recovery has reached the backup end
location. The progress of the recovery can also be observed with the
<function>pg_last_xlog_replay_location</> function, but that required
connecting to the server while it might not be consistent yet, so
care should be taken with that method.
</para>
<para>
</para>
</listitem>
</orderedlist>
</para>
<para> <para>
Since the standby server is not <quote>live</>, it is not possible to Since the standby server is not <quote>live</>, it is not possible to
use <function>pg_start_backup()</> and <function>pg_stop_backup()</> use <function>pg_start_backup()</> and <function>pg_stop_backup()</>
to manage the backup process; it will be up to you to determine how to manage the backup process; it will be up to you to determine how
far back you need to keep WAL segment files to have a recoverable far back you need to keep WAL segment files to have a recoverable
backup. You can do this by running <application>pg_controldata</> backup. That is determined by the last restartpoint when the backup
on the standby server to inspect the control file and determine the was taken, any WAL older than that can be deleted from the archive
current checkpoint WAL location, or by using the once the backup is complete. You can determine the last restartpoint
<varname>log_checkpoints</> option to print values to the standby's by running <application>pg_controldata</> on the standby server before
server log. taking the backup, or by using the <varname>log_checkpoints</> option
to print values to the standby's server log.
</para> </para>
</sect1> </sect1>