mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-01-06 15:24:56 +08:00
286 lines
12 KiB
Plaintext
286 lines
12 KiB
Plaintext
From owner-pgsql-hackers@hub.org Fri Nov 13 13:24:37 1998
|
|
Received: from hub.org (majordom@hub.org [209.47.148.200])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13457
|
|
for <maillist@candle.pha.pa.us>; Fri, 13 Nov 1998 13:24:35 -0500 (EST)
|
|
Received: from localhost (majordom@localhost)
|
|
by hub.org (8.9.1/8.9.1) with SMTP id NAA02464;
|
|
Fri, 13 Nov 1998 13:22:52 -0500 (EST)
|
|
(envelope-from owner-pgsql-hackers@hub.org)
|
|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 13 Nov 1998 13:21:14 +0000 (EST)
|
|
Received: (from majordom@localhost)
|
|
by hub.org (8.9.1/8.9.1) id NAA02331
|
|
for pgsql-hackers-outgoing; Fri, 13 Nov 1998 13:21:12 -0500 (EST)
|
|
(envelope-from owner-pgsql-hackers@postgreSQL.org)
|
|
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
|
|
by hub.org (8.9.1/8.9.1) with SMTP id NAA02316
|
|
for <pgsql-hackers@postgreSQL.org>; Fri, 13 Nov 1998 13:21:06 -0500 (EST)
|
|
(envelope-from wieck@sapserv.debis.de)
|
|
Received: by orion.SAPserv.Hamburg.dsh.de
|
|
for pgsql-hackers@postgreSQL.org
|
|
id m0zeOEf-000EBPC; Fri, 13 Nov 98 19:46 MET
|
|
Message-Id: <m0zeOEf-000EBPC@orion.SAPserv.Hamburg.dsh.de>
|
|
From: jwieck@debis.com (Jan Wieck)
|
|
Subject: [HACKERS] shmem limits and redolog
|
|
To: pgsql-hackers@postgreSQL.org (PostgreSQL HACKERS)
|
|
Date: Fri, 13 Nov 1998 19:46:20 +0100 (MET)
|
|
Reply-To: jwieck@debis.com (Jan Wieck)
|
|
X-Mailer: ELM [version 2.4 PL25]
|
|
Content-Type: text
|
|
Sender: owner-pgsql-hackers@postgreSQL.org
|
|
Precedence: bulk
|
|
Status: ROr
|
|
|
|
Hi,
|
|
|
|
I'm currently hacking around on a solution for logging all
|
|
database operations at query level that can recover a crashed
|
|
database from the last successful backup by redoing all the
|
|
commands.
|
|
|
|
Well, I wanted it to be as flexible as can. So I decided to
|
|
make it per database configurable. One could say which
|
|
databases are logged and if a database is, if it is logged
|
|
sync or async (in sync mode, every COMMIT forces an fsync of
|
|
the actual logfile and controlfiles).
|
|
|
|
To make async mode as fast as can, I'm using a shared memory
|
|
of 32K per database (not per backend) that is used as a wrap
|
|
around buffer from the backends to place their query
|
|
information. So the log writer can fall a little behind if
|
|
there are many backends doing different things that don't
|
|
lock each other.
|
|
|
|
Now I'm a little in doubt about the shared memory limits
|
|
reported. Was it a good decision to use shared memory? Am I
|
|
better off using socket's?
|
|
|
|
The bad thing in what I have up to now (it's far from
|
|
complete) is, that even if a database isn't currently logged,
|
|
a redolog writer is started and creates the 32K shmem segment
|
|
(plus a semaphore set with 5 semaphores). This is because I
|
|
plan to create commands like
|
|
|
|
ALTER DATABASE LOG MODE=ASYNC LOGDIR='/somewhere/dbname';
|
|
|
|
and the like that can be used at runtime (while more than one
|
|
backend is connected to the database) to turn logging on/off,
|
|
switch to/from backup mode (all other activity is stopped)
|
|
etc.
|
|
|
|
So every 32 databases will require another megabyte of shared
|
|
memory. The logging master controls which databases have
|
|
activity and kills redolog writers after some time of
|
|
inactivity, and the shmem is freed then. But it can hurt if
|
|
someone really has many many databases that are all used at
|
|
the same time.
|
|
|
|
What do the others say?
|
|
|
|
|
|
Jan
|
|
|
|
--
|
|
|
|
#======================================================================#
|
|
# It's easier to get forgiveness for being wrong than for being right. #
|
|
# Let's break this rule - forgive me. #
|
|
#======================================== jwieck@debis.com (Jan Wieck) #
|
|
|
|
|
|
|
|
|
|
From owner-pgsql-hackers@hub.org Wed Dec 16 15:46:41 1998
|
|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA00521
|
|
for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:46:40 -0500 (EST)
|
|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id PAA08772 for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:10:01 -0500 (EST)
|
|
Received: from localhost (majordom@localhost)
|
|
by hub.org (8.9.1/8.9.1) with SMTP id PAA01254;
|
|
Wed, 16 Dec 1998 15:06:56 -0500 (EST)
|
|
(envelope-from owner-pgsql-hackers@hub.org)
|
|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Dec 1998 14:58:11 +0000 (EST)
|
|
Received: (from majordom@localhost)
|
|
by hub.org (8.9.1/8.9.1) id OAA00660
|
|
for pgsql-hackers-outgoing; Wed, 16 Dec 1998 14:58:10 -0500 (EST)
|
|
(envelope-from owner-pgsql-hackers@postgreSQL.org)
|
|
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
|
|
by hub.org (8.9.1/8.9.1) with SMTP id OAA00643
|
|
for <pgsql-hackers@postgreSQL.org>; Wed, 16 Dec 1998 14:58:05 -0500 (EST)
|
|
(envelope-from wieck@sapserv.debis.de)
|
|
Received: by orion.SAPserv.Hamburg.dsh.de
|
|
for pgsql-hackers@postgreSQL.org
|
|
id m0zqNDo-000EBTC; Wed, 16 Dec 98 21:07 MET
|
|
Message-Id: <m0zqNDo-000EBTC@orion.SAPserv.Hamburg.dsh.de>
|
|
From: jwieck@debis.com (Jan Wieck)
|
|
Subject: Re: [HACKERS] redolog - for discussion
|
|
To: vadim@krs.ru (Vadim Mikheev)
|
|
Date: Wed, 16 Dec 1998 21:07:00 +0100 (MET)
|
|
Cc: jwieck@debis.com, pgsql-hackers@postgreSQL.org
|
|
Reply-To: jwieck@debis.com (Jan Wieck)
|
|
In-Reply-To: <3677B71D.C67462B3@krs.ru> from "Vadim Mikheev" at Dec 16, 98 08:35:25 pm
|
|
X-Mailer: ELM [version 2.4 PL25]
|
|
Content-Type: text
|
|
Sender: owner-pgsql-hackers@postgreSQL.org
|
|
Precedence: bulk
|
|
Status: RO
|
|
|
|
Vadim wrote:
|
|
|
|
>
|
|
> Jan Wieck wrote:
|
|
> >
|
|
> > RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
|
|
> >
|
|
> ...
|
|
> >
|
|
> > For the others, the backend starts the recovery program
|
|
> > which reads the redolog files, establishes database
|
|
> > connections as required and reruns all the commands in
|
|
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
> > them. If a required logfile isn't found, it tells the
|
|
> ^^^^^
|
|
>
|
|
> I foresee problems with using _commands_ logging for
|
|
> recovery/replication -:((
|
|
>
|
|
> Let's consider two concurrent updates in READ COMMITTED mode:
|
|
>
|
|
> update test set x = 2 where y = 1;
|
|
>
|
|
> and
|
|
>
|
|
> update test set x = 3 where y = 1;
|
|
>
|
|
> The result of both committed transaction will be x = 2
|
|
> if the 1st transaction updated row _after_ 2nd transaction
|
|
> and x = 3 if the 2nd transaction gets row after 1st one.
|
|
> Order of updates is not defined by order in which commands
|
|
> begun and so order in which commands should be rerun
|
|
> will be unknown...
|
|
|
|
Yepp, the order in which commands begun is absolutely not of
|
|
interest. Locking could already delay the execution of one
|
|
command until another one started later has finished and
|
|
released the lock. It's a classic race condition.
|
|
|
|
Thus, my plan was to log the queries just before the call to
|
|
CommitTransactionCommand() in tcop. This has the advantage,
|
|
that queries which bail out with errors don't get into the
|
|
log at all and must not get rerun. And I can set a static
|
|
flag to false before starting the command, which is set to
|
|
true in the buffer manager when a buffer is written (marked
|
|
dirty), so filtering out queries that do no updates at all is
|
|
easy.
|
|
|
|
Unfortunately query level logging get's hit by the current
|
|
implementation of sequence numbers. If a query that get's
|
|
aborted somewhere in the middle (maybe by a trigger) called
|
|
nextval() for rows processed earlier, the sequence number
|
|
isn't advanced at recovery time, because the query is
|
|
suppressed at all. And sequences aren't locked, so for
|
|
concurrently running queries getting numbers from the same
|
|
sequence, the results aren't reproduceable. If some
|
|
application selects a value resulting from a sequence and
|
|
uses that later in another query, how could the redolog know
|
|
that this has changed? It's a Const in the query logged, and
|
|
all that corrupts the whole thing.
|
|
|
|
All that is painful and I don't see another solution yet than
|
|
to hook into nextval(), log out the numbers generated in
|
|
normal operation and getting back the same numbers in redo
|
|
mode.
|
|
|
|
The whole thing gets more and more complicated :-(
|
|
|
|
|
|
Jan
|
|
|
|
--
|
|
|
|
#======================================================================#
|
|
# It's easier to get forgiveness for being wrong than for being right. #
|
|
# Let's break this rule - forgive me. #
|
|
#======================================== jwieck@debis.com (Jan Wieck) #
|
|
|
|
|
|
|
|
|
|
From owner-pgsql-hackers@hub.org Wed Jun 16 09:29:31 1999
|
|
Received: from hub.org (hub.org [209.167.229.1])
|
|
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA22504
|
|
for <maillist@candle.pha.pa.us>; Wed, 16 Jun 1999 09:29:29 -0400 (EDT)
|
|
Received: from hub.org (hub.org [209.167.229.1])
|
|
by hub.org (8.9.3/8.9.3) with ESMTP id JAA02132;
|
|
Wed, 16 Jun 1999 09:18:20 -0400 (EDT)
|
|
(envelope-from owner-pgsql-hackers@hub.org)
|
|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Jun 1999 09:14:07 +0000 (EDT)
|
|
Received: (from majordom@localhost)
|
|
by hub.org (8.9.3/8.9.3) id JAA01318
|
|
for pgsql-hackers-outgoing; Wed, 16 Jun 1999 09:14:06 -0400 (EDT)
|
|
(envelope-from owner-pgsql-hackers@postgreSQL.org)
|
|
X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f
|
|
Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37])
|
|
by hub.org (8.9.3/8.9.3) with ESMTP id JAA01278
|
|
for <hackers@postgreSQL.org>; Wed, 16 Jun 1999 09:13:48 -0400 (EDT)
|
|
(envelope-from vadim@krs.ru)
|
|
Received: from krs.ru (dune.krs.ru [195.161.16.38])
|
|
by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id VAA06276
|
|
for <hackers@postgreSQL.org>; Wed, 16 Jun 1999 21:12:49 +0800 (KRSS)
|
|
Message-ID: <3767A2CF.E6E4A5F9@krs.ru>
|
|
Date: Wed, 16 Jun 1999 21:12:47 +0800
|
|
From: Vadim Mikheev <vadim@krs.ru>
|
|
Organization: OJSC Rostelecom (Krasnoyarsk)
|
|
X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
|
|
X-Accept-Language: ru, en
|
|
MIME-Version: 1.0
|
|
To: PostgreSQL Developers List <hackers@postgreSQL.org>
|
|
Subject: [HACKERS] Savepoints...
|
|
Content-Type: text/plain; charset=us-ascii
|
|
Content-Transfer-Encoding: 7bit
|
|
Sender: owner-pgsql-hackers@postgreSQL.org
|
|
Precedence: bulk
|
|
Status: ROr
|
|
|
|
To have them I need to add tuple id (6 bytes) to heap tuple
|
|
header. Are there objections? Though it's not good to increase
|
|
tuple header size, subj is, imho, very nice feature...
|
|
|
|
Implementation is , hm, "easy":
|
|
|
|
- heap_insert/heap_delete/heap_replace/heap_mark4update will
|
|
remember updated tid (and current command id) in relation cache
|
|
and store previously updated tid (remembered in relation cache)
|
|
in additional heap header tid;
|
|
- lmgr will remember command id when lock was acquired;
|
|
- for a savepoint we will just store command id when
|
|
the savepoint was setted;
|
|
- when going to sleep due to concurrent the-same-row update,
|
|
backend will store MyProc and tuple id in shmem hash table.
|
|
|
|
When rolling back to a savepoint, backend will:
|
|
|
|
- release locks acquired after savepoint;
|
|
- for a relation updated after savepoint, get last updated tid
|
|
from relation cache, walk through relation, set
|
|
HEAP_XMIN_INVALID/HEAP_XMAX_INVALID in all tuples updated
|
|
after savepoint and wake up concurrent writers blocked
|
|
on these tuples (using shmem hash table mentioned above).
|
|
|
|
The last feature (waking up of concurrent writers) is most hard
|
|
part to implement. AFAIK, Oracle 7.3 was not able to do it.
|
|
Can someone comment is this feature implemented in Oracle 8.X,
|
|
other DBMSes?
|
|
|
|
Now about implicit savepoints. Backend will place them before
|
|
user statements execution. In the case of failure, transaction
|
|
state will be rolled back to the one before execution of query.
|
|
As side-effect, this means that we'll get rid of complaints
|
|
about entire transaction abort in the case of mistyping
|
|
causing abort due to parser errors...
|
|
|
|
Comments?
|
|
|
|
Vadim
|
|
|
|
|