mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-01-24 18:55:04 +08:00
1476 lines
66 KiB
Plaintext
1476 lines
66 KiB
Plaintext
From cjs@cynic.net Sat Jun 22 04:41:54 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net ([63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5M8fpF04711
|
|
for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 04:41:53 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 415C8F820; Sat, 22 Jun 2002 08:41:33 +0000 (UTC)
|
|
Date: Sat, 22 Jun 2002 17:41:30 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
|
|
mlw <markw@mohawksoft.com>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <19332.1024668861@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206221731130.1091-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On Fri, 21 Jun 2002, Tom Lane wrote:
|
|
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > And now you know which parts of your page got written, and which
|
|
> > parts didn't.
|
|
>
|
|
> Yes ... and what do you *do* about it?
|
|
|
|
Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207:
|
|
|
|
torn page detection When TRUE, this option causes a bit to be
|
|
flipped for each 512-byte sector in a database page (8 KB)
|
|
whenever the page is written to disk. This option allows
|
|
SQL Server to detect incomplete I/O operations caused by
|
|
power failures or other system outages. If a bit is in the
|
|
wrong state when the page is later read by SQL Server, this
|
|
means the page was written incorrectly; a torn page has
|
|
been detected. Although SQL Server database pages are 8
|
|
KB, disks perform I/O operations using 512-byte sectors.
|
|
Therefore, 16 sectors are written per database page. A
|
|
torn page can occur if the system crashes (for example,
|
|
because of power failure) between the time the operating
|
|
system writes the first 512-byte sector to disk and the
|
|
completion of the 8-KB I/O operation. If the first sector
|
|
of a database page is successfully written before the crash,
|
|
it will appear that the database page on disk was updated,
|
|
although it might not have succeeded. Using battery-backed
|
|
disk caches can ensure that data is [sic] successfully
|
|
written to disk or not written at all. In this case, don't
|
|
set torn page detection to TRUE, as it isn't needed. If a
|
|
torn page is detected, the database will need to be restored
|
|
from backup because it will be physically inconsistent.
|
|
|
|
As I understand it, this is not a problem for postgres becuase the
|
|
entire page is written to the log. So postgres is safe, but quite
|
|
inefficient. (It would be much more efficient to write just the
|
|
changed tuple, or even just the changed values within the tuple,
|
|
to the log.)
|
|
|
|
Adding these torn bits would allow posgres at least to write to
|
|
the log just the 512-byte sectors that have changed, rather than
|
|
the entire 8 KB page.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002
|
|
Return-path: <pgsql-hackers-owner+M24100@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OHDeF08564
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 13:13:40 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 05602475CBE; Mon, 24 Jun 2002 13:11:10 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 13:11:10 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 929A247633B; Mon, 24 Jun 2002 09:26:54 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 962C147631A
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:31:43 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:31:43 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id C112D475C3C
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:35:20 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5NJYtL07449;
|
|
Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206231934.g5NJYtL07449@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
Date: Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
J. R. Nield wrote:
|
|
> So since we have all this buffering designed especially to meet our
|
|
> needs, and since the OS buffering is in the way, can someone explain to
|
|
> me why postgresql would ever open a file without the O_DSYNC flag if the
|
|
> platform supports it?
|
|
|
|
We sync only WAL, not the other pages, except for the sync() call we do
|
|
during checkpoint when we discard old WAL files.
|
|
|
|
> > I concur with Bruce: the reason we keep page images in WAL is to
|
|
> > minimize the number of places we have to fsync, and thus the amount of
|
|
> > head movement required for a commit. Putting the page images elsewhere
|
|
> > cannot be a win AFAICS.
|
|
>
|
|
>
|
|
> Why not put all the page images in a single pre-allocated file and treat
|
|
> it as a ring? How could this be any worse than flushing them in the WAL
|
|
> log?
|
|
>
|
|
> Maybe fsync would be slower with two files, but I don't see how
|
|
> fdatasync would be, and most platforms support that.
|
|
|
|
We have fdatasync option for WAL in postgresql.conf.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002
|
|
Return-path: <pgsql-hackers-owner+M24091@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:54:22 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 2769E4762E3
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:27:39 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id ED459475C61
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:37:08 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642;
|
|
Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Curt Sampson wrote:
|
|
> On 23 Jun 2002, J. R. Nield wrote:
|
|
>
|
|
> > So since we have all this buffering designed especially to meet our
|
|
> > needs, and since the OS buffering is in the way, can someone explain to
|
|
> > me why postgresql would ever open a file without the O_DSYNC flag if the
|
|
> > platform supports it?
|
|
>
|
|
> It's more code, if there are platforms out there that don't support
|
|
> O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
|
|
> O_DSYNC could save us a disk arm movement over fsync() because it
|
|
> appears to me that fsync is also going to force a metadata update, which
|
|
> means that the inode blocks have to be written as well.
|
|
|
|
Again, see postgresql.conf:
|
|
|
|
#wal_sync_method = fsync # the default varies across platforms:
|
|
# # fsync, fdatasync, open_sync, or open_datasync
|
|
|
|
>
|
|
> > Maybe fsync would be slower with two files, but I don't see how
|
|
> > fdatasync would be, and most platforms support that.
|
|
>
|
|
> Because, if both files are on the same disk, you still have to move
|
|
> the disk arm from the cylinder at the current log file write point
|
|
> to the cylinder at the current ping-pong file write point. And then back
|
|
> again to the log file write point cylinder.
|
|
>
|
|
> In the end, having a ping-pong file as well seems to me unnecessary
|
|
> complexity, especially when anyone interested in really good
|
|
> performance is going to buy a disk subsystem that guarantees no
|
|
> torn pages and thus will want to turn off the ping-pong file writes
|
|
> entirely, anyway.
|
|
|
|
Yes, I don't see writing to two files vs. one to be any win, especially
|
|
when we need to fsync both of them. What I would really like is to
|
|
avoid the double I/O of writing to WAL and to the data file; improving
|
|
that would be a huge win.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
|
|
|
|
From cjs@cynic.net Sun Jun 23 23:40:59 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O3evF17903
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 23:40:58 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 37F36F820; Mon, 24 Jun 2002 03:40:54 +0000 (UTC)
|
|
Date: Mon, 24 Jun 2002 12:40:51 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
|
|
Message-ID: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On 23 Jun 2002, J. R. Nield wrote:
|
|
|
|
> If is impossible to do what you want. You can not protect against
|
|
> partial writes without writing pages twice and calling fdatasync
|
|
> between them while going through a generic filesystem.
|
|
|
|
I agree with this.
|
|
|
|
> The best disk array will not protect you if the operating system does
|
|
> not align block writes to the structure of the underlying device.
|
|
|
|
This I don't quite understand. Assuming you're using a SCSI drive
|
|
(and this mostly applies to ATAPI/IDE, too), you can do naught but
|
|
align block writes to the structure of the underlying device. When you
|
|
initiate a SCSI WRITE command, you start by telling the device at which
|
|
block to start writing and how many blocks you intend to write. Then you
|
|
start passing the data.
|
|
|
|
(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
|
|
details for the SCSI WRITE(10) command. You may find the SCSI 2
|
|
specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
|
|
reference here.)
|
|
|
|
> Even with raw devices, you need special support or knowledge of the
|
|
> operating system and/or the disk device to ensure that each write
|
|
> request will be atomic to the underlying hardware.
|
|
|
|
Well, so here I guess you're talking about two things:
|
|
|
|
1. When you request, say, an 8K block write, will the OS really
|
|
write it to disk in a single 8K or multiple of 8K SCSI write
|
|
command?
|
|
|
|
2. Does the SCSI device you're writing to consider these writes to
|
|
be transactional. That is, if the write is interrupted before being
|
|
completed, does the SCSI device guarantee that the partially-sent
|
|
data is not written, and the old data is maintained? And of course,
|
|
does it guarantee that, when it acknowledges a write, that write is
|
|
now in stable storage and will never go away?
|
|
|
|
Both of these are not hard to guarantee, actually. For a BSD-based OS,
|
|
for example, just make sure that your filesystem block size is the
|
|
same as or a multiple of the database block size. BSD will never write
|
|
anything other than a block or a sequence of blocks to a disk in a
|
|
single SCSI transaction (unless you've got a really odd SCSI driver).
|
|
And for your disk, buy a Baydel or Clarion disk array, or something
|
|
similar.
|
|
|
|
Given that it's not hard to set up a system that meets these criteria,
|
|
and this is in fact commonly done for database servers, it would seem a
|
|
good idea for postgres to have the option to take advantage of the time
|
|
and money spent and adjust its performance upward appropriately.
|
|
|
|
> All other systems rely on the fact that you can recover a damaged file
|
|
> using the log archive.
|
|
|
|
Not exactly. For MS SQL Server, at any rate, if it detects a page tear
|
|
you cannot restore based on the log file alone. You need a full or
|
|
partial backup that includes that entire torn block.
|
|
|
|
> This means downtime in the rare case, but no data loss. Until
|
|
> PostgreSQL can do this, then it will not be acceptable for real
|
|
> critical production use.
|
|
|
|
It seems to me that it is doing this right now. In fact, it's more
|
|
reliable than some commerial systems (such as SQL Server) because it can
|
|
recover from a torn block with just the logfile.
|
|
|
|
> But at the end of the day, unless you have complete understanding of
|
|
> the I/O system from write(2) through to the disk system, the only sure
|
|
> ways to protect against partial writes are by "careful writes" (in
|
|
> the WAL log or elsewhere, writing pages twice), or by requiring (and
|
|
> allowing) users to do log-replay recovery when a file is corrupted by
|
|
> a partial write.
|
|
|
|
I don't understand how, without a copy of the old data that was in the
|
|
torn block, you can restore that block from just log file entries. Can
|
|
you explain this to me? Take, as an example, a block with ten tuples,
|
|
only one of which has been changed "recently." (I.e., only that change
|
|
is in the log files.)
|
|
|
|
> If we log pages to WAL, they are useless when archived (after a
|
|
> checkpoint). So either we have a separate "log" for them (the
|
|
> ping-pong file), or we should at least remove them when archived,
|
|
> which makes log archiving more complex but is perfectly doable.
|
|
|
|
Right. That seems to me a better option, since we've now got only one
|
|
write point on the disk rather than two.
|
|
|
|
> Finally, I would love to hear why we are using the operating system
|
|
> buffer manager at all. The OS is acting as a secondary buffer manager
|
|
> for us. Why is that? What flaw in our I/O system does this reveal?
|
|
|
|
It's acting as a "second-level" buffer manager, yes, but to say it's
|
|
"secondary" may be a bit misleading. On most of the systems I've set
|
|
up, the OS buffer cache is doing the vast majority of the work, and the
|
|
postgres buffering is fairly minimal.
|
|
|
|
There are some good (and some perhaps not-so-good) reasons to do it this
|
|
way. I'll list them more or less in the order of best to worst:
|
|
|
|
1. The OS knows where the blocks physically reside on disk, and
|
|
postgres does not. Therefore it's in the interest of postgresql to
|
|
dispatch write responsibility back to the OS as quickly as possible
|
|
so that the OS can prioritize requests appropriately. Most operating
|
|
systems use an "elevator" algorithm to minimize disk head movement;
|
|
but if the OS does not have a block that it could write while the
|
|
head is "on the way" to another request, it can't write it in that
|
|
head pass.
|
|
|
|
2. Postgres does not know about any "bank-switching" tricks for
|
|
mapping more physical memory than it has address space. Thus, on
|
|
32-bit machines, postgres might be limited to mapping 2 or 3 GB of
|
|
memory, even though the machine has, say, 6 GB of physical RAM. The
|
|
OS can use all of the available memory for caching; postgres cannot.
|
|
|
|
3. A lot of work has been put into the seek algorithms, read-ahead
|
|
algorithms, block allocation algorithms, etc. in the OS. Why
|
|
duplicate all that work again in postgres?
|
|
|
|
When you say things like the following:
|
|
|
|
> We should only be writing blocks when they need to be on disk. We
|
|
> should not be expecting the OS to write them "sometime later" and
|
|
> avoid blocking (as long) for the write. If we need that, then our
|
|
> buffer management is wrong and we need to fix it.
|
|
|
|
you appear to be making the arugment that we should take the route of
|
|
other database systems, and use raw devices and our own management of
|
|
disk block allocation. If so, you might want first to look back through
|
|
the archives at the discussion I and several others had about this a
|
|
month or two ago. After looking in detail at what NetBSD, at least, does
|
|
in terms of its disk I/O algorithms and buffering, I've pretty much come
|
|
around, at least for the moment, to the attitude that we should stick
|
|
with using the OS. I wouldn't mind seeing postgres be able to manage all
|
|
of this stuff, but it's a *lot* of work for not all that much benefit
|
|
that I can see.
|
|
|
|
> The ORACLE people were not kidding when they said that they could not
|
|
> certify Linux for production use until it supported O_DSYNC. Can you
|
|
> explain why that was the case?
|
|
|
|
I'm suspecting it's because Linux at the time had no raw devices, so
|
|
O_DSYNC was the only other possible method of making sure that disk
|
|
writes actually got to disk.
|
|
|
|
You certainly don't want to use O_DSYNC if you can use another method,
|
|
because O_DSYNC still goes through the the operating system's buffer
|
|
cache, wasting memory and double-caching things. If you're doing your
|
|
own management, you need either to use a raw device or open files with
|
|
the flag that indicates that the buffer cache should not be used at all
|
|
for reads from and writes to that file.
|
|
|
|
> However, this discussion and a search of the pgsql-hackers archives
|
|
> reveals this problem to be the KEY area of PostgreSQL's failing, and
|
|
> general misunderstanding, when compared to its commercial competitors.
|
|
|
|
No, I think it's just that you're under a few minor misapprehensions
|
|
here about what postgres and the OS are actually doing. As I said, I
|
|
went through this whole exact argument a month or two ago, on this very
|
|
list, and I came around to the idea that what postgres is doing now
|
|
works quite well, at least on NetBSD. (Most other OSes have disk I/O
|
|
algorithms that are pretty much as good or better.) There might be a
|
|
very slight advantage to doing all one's own I/O management, but it's
|
|
a huge amount of work, and I think that much effort could be much more
|
|
usefully applied to other areas.
|
|
|
|
Just as a side note, I've been a NetBSD developer since about '96,
|
|
and have been delving into the details of OS design since well before
|
|
that time, so I'm coming to this with what I hope is reasonably good
|
|
knowledge of how disks work and how operating systems use them. (Not
|
|
that this should stop you from pointing out holes in my arguments. :-))
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24112@postgresql.org Mon Jun 24 18:16:36 2002
|
|
Return-path: <pgsql-hackers-owner+M24112@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OMGaF00910
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 18:16:36 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id A2EF1476475; Mon, 24 Jun 2002 16:43:38 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:43:38 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id BA57D476148; Mon, 24 Jun 2002 14:14:00 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 93D6A477214
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:59:17 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 13:59:17 2002
|
|
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
|
|
by postgresql.org (Postfix) with ESMTP id D70AA476401
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 10:06:26 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OE6J117666;
|
|
Mon, 24 Jun 2002 10:06:19 -0400 (EDT)
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, "J. R. Nield" <jrnield@usol.com>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
|
|
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
|
|
message dated "Mon, 24 Jun 2002 09:09:30 +0900"
|
|
Date: Mon, 24 Jun 2002 10:06:19 -0400
|
|
Message-ID: <17663.1024927579@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-5.3 required=5.0
|
|
tests=IN_REP_TO,X_NOT_PRESENT
|
|
version=2.30
|
|
Status: RO
|
|
|
|
> On Sun, 23 Jun 2002, Bruce Momjian wrote:
|
|
>> Yes, I don't see writing to two files vs. one to be any win, especially
|
|
>> when we need to fsync both of them. What I would really like is to
|
|
>> avoid the double I/O of writing to WAL and to the data file; improving
|
|
>> that would be a huge win.
|
|
|
|
I don't believe it's possible to eliminate the double I/O. Keep in mind
|
|
though that in the ideal case (plenty of shared buffers) you are only
|
|
paying two writes per modified block per checkpoint interval --- one to
|
|
the WAL during the first write of the interval, and then a write to the
|
|
real datafile issued by the checkpoint process. Anything that requires
|
|
transaction commits to write data blocks will likely result in more I/O
|
|
not less, at least for blocks that are modified by several successive
|
|
transactions.
|
|
|
|
The only thing I've been able to think of that seems like it might
|
|
improve matters is to make the WAL writing logic aware of the layout
|
|
of buffer pages --- specifically, to know that our pages generally
|
|
contain an uninteresting "hole" in the middle, and not write the hole.
|
|
Optimistically this might reduce the WAL data volume by something
|
|
approaching 50%; though pessimistically (if most pages are near full)
|
|
it wouldn't help much.
|
|
|
|
This was not very feasible when the WAL code was designed because the
|
|
buffer manager needed to cope with both normal pages and pg_log pages,
|
|
but as of 7.2 I think it'd be safe to assume that all pages have the
|
|
standard layout.
|
|
|
|
regards, tom lane
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 2: you can get off all lists at once with the unregister command
|
|
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24116@postgresql.org Mon Jun 24 20:32:07 2002
|
|
Return-path: <pgsql-hackers-owner+M24116@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0W7F10985
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:32:07 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id EBCE547632E; Mon, 24 Jun 2002 18:54:34 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 18:54:34 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 3EB93476D85; Mon, 24 Jun 2002 17:12:18 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id EBC20476E2E
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:54:40 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 14:54:40 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 1C8874760C2
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 12:40:53 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5OGeVY06116;
|
|
Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206241640.g5OGeVY06116@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <17663.1024927579@sss.pgh.pa.us>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Date: Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
|
|
cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Tom Lane wrote:
|
|
> > On Sun, 23 Jun 2002, Bruce Momjian wrote:
|
|
> >> Yes, I don't see writing to two files vs. one to be any win, especially
|
|
> >> when we need to fsync both of them. What I would really like is to
|
|
> >> avoid the double I/O of writing to WAL and to the data file; improving
|
|
> >> that would be a huge win.
|
|
>
|
|
> I don't believe it's possible to eliminate the double I/O. Keep in mind
|
|
> though that in the ideal case (plenty of shared buffers) you are only
|
|
> paying two writes per modified block per checkpoint interval --- one to
|
|
> the WAL during the first write of the interval, and then a write to the
|
|
> real datafile issued by the checkpoint process. Anything that requires
|
|
> transaction commits to write data blocks will likely result in more I/O
|
|
> not less, at least for blocks that are modified by several successive
|
|
> transactions.
|
|
>
|
|
> The only thing I've been able to think of that seems like it might
|
|
> improve matters is to make the WAL writing logic aware of the layout
|
|
> of buffer pages --- specifically, to know that our pages generally
|
|
> contain an uninteresting "hole" in the middle, and not write the hole.
|
|
> Optimistically this might reduce the WAL data volume by something
|
|
> approaching 50%; though pessimistically (if most pages are near full)
|
|
> it wouldn't help much.
|
|
|
|
Good idea. How about putting the page through or TOAST compression
|
|
routine before writing it to WAL? Should be pretty easy and fast and
|
|
doesn't require any knowledge of the page format.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002
|
|
Return-path: <pgsql-hackers-owner+M24128@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P21vF19918
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:01:57 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 540B8475B33; Mon, 24 Jun 2002 21:34:40 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 21:34:40 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 0A13F476965; Mon, 24 Jun 2002 19:30:14 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id B4F62476E4A
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:53:59 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 18:53:59 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 36043475BF6
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:25:28 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5OLPFG26140;
|
|
Mon, 24 Jun 2002 17:25:15 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
Date: Mon, 24 Jun 2002 17:25:14 -0400 (EDT)
|
|
cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
J. R. Nield wrote:
|
|
> > This I don't quite understand. Assuming you're using a SCSI drive
|
|
> > (and this mostly applies to ATAPI/IDE, too), you can do naught but
|
|
> > align block writes to the structure of the underlying device. When you
|
|
> > initiate a SCSI WRITE command, you start by telling the device at which
|
|
> > block to start writing and how many blocks you intend to write. Then you
|
|
> > start passing the data.
|
|
> >
|
|
>
|
|
> All I'm saying is that the entire postgresql block write must be
|
|
> converted into exactly one SCSI write command in all cases, and I don't
|
|
> know a portable way to ensure this.
|
|
|
|
...
|
|
|
|
> I agree with this. My point was only that you need to know what
|
|
> guarantees your operating system/hardware combination provides on a
|
|
> case-by-case basis, and there is no standard way for a program to
|
|
> discover this. Most system administrators are not going to know this
|
|
> either, unless databases are their main responsibility.
|
|
|
|
Yes, agreed. >1% are going to know the answer to this question so we
|
|
have to assume worst case.
|
|
|
|
> > It seems to me that it is doing this right now. In fact, it's more
|
|
> > reliable than some commerial systems (such as SQL Server) because it can
|
|
> > recover from a torn block with just the logfile.
|
|
>
|
|
> Again, what I meant to say is that the commercial systems can recover
|
|
> with an old file backup + logs. How old the backup can be depends only
|
|
> on how much time you are willing to spend playing the logs forward. So
|
|
> if you do a full backup once a week, and multiplex and backup the logs,
|
|
> then even if a backup tape gets destroyed you can still survive. It just
|
|
> takes longer.
|
|
>
|
|
> Also, postgreSQL can't recover from any other type of block corruption,
|
|
> while the commercial systems can. That's what I meant by the "critical
|
|
> production use" comment, which was sort-of unfair.
|
|
>
|
|
> So I would say they are equally reliable for torn pages (but not bad
|
|
> blocks), and the commercial systems let you trade potential recovery
|
|
> time for not having to write the blocks twice. You do need to back-up
|
|
> the log archives though.
|
|
|
|
Yes, good tradeoff analysis. We recover from partial writes quicker,
|
|
and don't require saving of log files, _but_ we don't recover from bad
|
|
disk blocks. Good summary.
|
|
|
|
> I'll back off on that. I don't know if we want to use the OS buffer
|
|
> manager, but shouldn't we try to have our buffer manager group writes
|
|
> together by files, and pro-actively get them out to disk? Right now, it
|
|
> looks like all our write requests are delayed as long as possible and
|
|
> the order in which they are written is pretty-much random, as is the
|
|
> backend that writes the block, so there is no locality of reference even
|
|
> when the blocks are adjacent on disk, and the write calls are spread-out
|
|
> over all the backends.
|
|
>
|
|
> Would it not be the case that things like read-ahead, grouping writes,
|
|
> and caching written data are probably best done by PostgreSQL, because
|
|
> only our buffer manager can understand when they will be useful or when
|
|
> they will thrash the cache?
|
|
|
|
The OS should handle all of this. We are doing main table writes but no
|
|
sync until checkpoint, so the OS can keep those blocks around and write
|
|
them at its convenience. It knows the size of the buffer cache and when
|
|
stuff is forced to disk. We can't second-guess that.
|
|
|
|
> I may likely be wrong on this, and I haven't done any performance
|
|
> testing. I shouldn't have brought this up alongside the logging issues,
|
|
> but there seemed to be some question about whether the OS was actually
|
|
> doing all these things behind the scene.
|
|
|
|
It had better. Looking at the kernel source is the way to know.
|
|
|
|
> Does anyone know what the major barriers to infinite log replay are in
|
|
> PostgreSQL? I'm trying to look for everything that might need to be
|
|
> changed outside xlog.c, but surely this has come up before. Searching
|
|
> the archives hasn't revealed much.
|
|
|
|
This has been brought up. Could we just save WAL files and get replay?
|
|
I believe some things have to be added to WAL to allow this, but it
|
|
seems possible. However, the pg_dump is just a data dump and does not
|
|
have the file offsets and things. Somehow you would need a tar-type
|
|
backup of the database, and with a running db, it is hard to get a valid
|
|
snapshot of that.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 3: if posting/reading through Usenet, please send an appropriate
|
|
subscribe-nomail command to majordomo@postgresql.org so that your
|
|
message can get through to the mailing list cleanly
|
|
|
|
|
|
|
|
From tgl@sss.pgh.pa.us Mon Jun 24 17:31:57 2002
|
|
Return-path: <tgl@sss.pgh.pa.us>
|
|
Received: from sss.pgh.pa.us (root@[192.204.191.242])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLVuF26684
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
|
|
Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
|
|
message dated "Mon, 24 Jun 2002 17:25:14 -0400"
|
|
Date: Mon, 24 Jun 2002 17:31:56 -0400
|
|
Message-ID: <21482.1024954316@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Status: ROr
|
|
|
|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
>> Does anyone know what the major barriers to infinite log replay are in
|
|
>> PostgreSQL? I'm trying to look for everything that might need to be
|
|
>> changed outside xlog.c, but surely this has come up before. Searching
|
|
>> the archives hasn't revealed much.
|
|
|
|
> This has been brought up. Could we just save WAL files and get replay?
|
|
> I believe some things have to be added to WAL to allow this, but it
|
|
> seems possible.
|
|
|
|
The Red Hat group has been looking at this somewhat; so far there seem
|
|
to be some minor tweaks that would be needed, but no showstoppers.
|
|
|
|
> Somehow you would need a tar-type
|
|
> backup of the database, and with a running db, it is hard to get a valid
|
|
> snapshot of that.
|
|
|
|
But you don't *need* a "valid snapshot", only a correct copy of
|
|
every block older than the first checkpoint in your WAL log series.
|
|
Any inconsistencies in your tar dump will look like repairable damage;
|
|
replaying the WAL log will fix 'em.
|
|
|
|
regards, tom lane
|
|
|
|
From pgsql-hackers-owner+M24131@postgresql.org Mon Jun 24 21:15:06 2002
|
|
Return-path: <pgsql-hackers-owner+M24131@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P1F5F15390
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 21:15:05 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id B76174768CC; Mon, 24 Jun 2002 20:59:56 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 20:59:56 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 8724C47742E; Mon, 24 Jun 2002 20:17:44 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 4E472476875
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:37:46 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 18:37:46 2002
|
|
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
|
|
by postgresql.org (Postfix) with ESMTP id CFCC9476A7A
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:32:02 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
|
|
Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
|
|
message dated "Mon, 24 Jun 2002 17:25:14 -0400"
|
|
Date: Mon, 24 Jun 2002 17:31:56 -0400
|
|
Message-ID: <21482.1024954316@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-5.3 required=5.0
|
|
tests=IN_REP_TO,X_NOT_PRESENT
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
>> Does anyone know what the major barriers to infinite log replay are in
|
|
>> PostgreSQL? I'm trying to look for everything that might need to be
|
|
>> changed outside xlog.c, but surely this has come up before. Searching
|
|
>> the archives hasn't revealed much.
|
|
|
|
> This has been brought up. Could we just save WAL files and get replay?
|
|
> I believe some things have to be added to WAL to allow this, but it
|
|
> seems possible.
|
|
|
|
The Red Hat group has been looking at this somewhat; so far there seem
|
|
to be some minor tweaks that would be needed, but no showstoppers.
|
|
|
|
> Somehow you would need a tar-type
|
|
> backup of the database, and with a running db, it is hard to get a valid
|
|
> snapshot of that.
|
|
|
|
But you don't *need* a "valid snapshot", only a correct copy of
|
|
every block older than the first checkpoint in your WAL log series.
|
|
Any inconsistencies in your tar dump will look like repairable damage;
|
|
replaying the WAL log will fix 'em.
|
|
|
|
regards, tom lane
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24133@postgresql.org Mon Jun 24 22:19:55 2002
|
|
Return-path: <pgsql-hackers-owner+M24133@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P2JsF21543
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:19:54 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 42391476E53; Mon, 24 Jun 2002 22:09:49 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 22:09:49 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 191654774EB; Mon, 24 Jun 2002 20:26:08 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 8EB90476101
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 19:43:19 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 19:43:19 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 08018476931
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:33:53 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5OLXhl26908;
|
|
Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206242133.g5OLXhl26908@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <21482.1024954316@sss.pgh.pa.us>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Date: Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Tom Lane wrote:
|
|
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
> >> Does anyone know what the major barriers to infinite log replay are in
|
|
> >> PostgreSQL? I'm trying to look for everything that might need to be
|
|
> >> changed outside xlog.c, but surely this has come up before. Searching
|
|
> >> the archives hasn't revealed much.
|
|
>
|
|
> > This has been brought up. Could we just save WAL files and get replay?
|
|
> > I believe some things have to be added to WAL to allow this, but it
|
|
> > seems possible.
|
|
>
|
|
> The Red Hat group has been looking at this somewhat; so far there seem
|
|
> to be some minor tweaks that would be needed, but no showstoppers.
|
|
|
|
|
|
Good.
|
|
|
|
> > Somehow you would need a tar-type
|
|
> > backup of the database, and with a running db, it is hard to get a valid
|
|
> > snapshot of that.
|
|
>
|
|
> But you don't *need* a "valid snapshot", only a correct copy of
|
|
> every block older than the first checkpoint in your WAL log series.
|
|
> Any inconsistencies in your tar dump will look like repairable damage;
|
|
> replaying the WAL log will fix 'em.
|
|
|
|
Yes, my point was that you need physical file backups, not pg_dump, and
|
|
you have to be tricky about the files changing during the backup. You
|
|
_can_ work around changes to the files during backup.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
/usr/local/bin/mime: cannot create /dev/ttyp3: permission denied
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24139@postgresql.org Tue Jun 25 00:00:22 2002
|
|
Return-path: <pgsql-hackers-owner+M24139@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P40LF00838
|
|
for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 00:00:21 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id CBAE8476E94; Mon, 24 Jun 2002 23:44:51 -0400 (EDT)
|
|
Mailbox-Line: From jrnield@usol.com Mon Jun 24 23:44:51 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id C5076476871; Mon, 24 Jun 2002 22:25:46 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 8DF57476979
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 22:08:31 -0400 (EDT)
|
|
Mailbox-Line: From jrnield@usol.com Mon Jun 24 22:08:31 2002
|
|
Received: from hades.usol.com (hades.usol.com [208.232.58.41])
|
|
by postgresql.org (Postfix) with ESMTP id 298D2476101
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 20:27:46 -0400 (EDT)
|
|
Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
|
|
Mon, 24 Jun 2002 20:27:37 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
<1024951786.1793.865.camel@localhost.localdomain>
|
|
<21376.1024953361@sss.pgh.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 24 Jun 2002 20:28:00 -0400
|
|
Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
|
|
|
|
> I think you have been missing the point...
|
|
Yes, this appears to be the case. Thanks especially to Curt for clearing
|
|
things up for me.
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 5: Have you checked our extensive FAQ?
|
|
|
|
http://www.postgresql.org/users-lounge/docs/faq.html
|
|
|
|
|
|
|
|
From jrnield@usol.com Mon Jun 24 20:27:45 2002
|
|
Return-path: <jrnield@usol.com>
|
|
Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0RhF10711
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:27:44 -0400 (EDT)
|
|
Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
|
|
Mon, 24 Jun 2002 20:27:37 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
<1024951786.1793.865.camel@localhost.localdomain>
|
|
<21376.1024953361@sss.pgh.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 24 Jun 2002 20:28:00 -0400
|
|
Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Status: RO
|
|
|
|
On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
|
|
|
|
> I think you have been missing the point...
|
|
Yes, this appears to be the case. Thanks especially to Curt for clearing
|
|
things up for me.
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
From cjs@cynic.net Tue Jun 25 01:09:02 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P591F07292
|
|
for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 01:09:01 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 517BEF820; Tue, 25 Jun 2002 05:09:02 +0000 (UTC)
|
|
Date: Tue, 25 Jun 2002 14:08:59 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: ROr
|
|
|
|
On Mon, 24 Jun 2002, Tom Lane wrote:
|
|
|
|
> There are a lot of other things we desperately need to spend time
|
|
> on that would not amount to re-engineering large quantities of OS-level
|
|
> code. Given that most Unixen have perfectly respectable disk management
|
|
> subsystems, we prefer to tune our code to make use of that stuff, rather
|
|
> than follow the "conventional wisdom" that databases need to bypass it.
|
|
> ...
|
|
> Oracle can afford to do that sort of thing because they have umpteen
|
|
> thousand developers available. Postgres does not.
|
|
|
|
Well, Oracle also started out, a long long time ago, on systems without
|
|
unified buffer cache and so on, and so they *had* to write this stuff
|
|
because otherwise data would not be cached. So Oracle can also afford to
|
|
maintain it now because the code already exists.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24154@postgresql.org Tue Jun 25 09:22:38 2002
|
|
Return-path: <pgsql-hackers-owner+M24154@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDMbF03932
|
|
for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 09:22:37 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id C12C3475E4A; Tue, 25 Jun 2002 09:22:32 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 09:22:32 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 65471475C7A; Tue, 25 Jun 2002 09:22:23 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 97C8C475A7C
|
|
for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:20 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 09:22:20 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 42C0B475A64
|
|
for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:19 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5PDM5B03772;
|
|
Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206251322.g5PDM5B03772@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
Date: Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, "J. R. Nield" <jrnield@usol.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Curt Sampson wrote:
|
|
> On Mon, 24 Jun 2002, Tom Lane wrote:
|
|
>
|
|
> > There are a lot of other things we desperately need to spend time
|
|
> > on that would not amount to re-engineering large quantities of OS-level
|
|
> > code. Given that most Unixen have perfectly respectable disk management
|
|
> > subsystems, we prefer to tune our code to make use of that stuff, rather
|
|
> > than follow the "conventional wisdom" that databases need to bypass it.
|
|
> > ...
|
|
> > Oracle can afford to do that sort of thing because they have umpteen
|
|
> > thousand developers available. Postgres does not.
|
|
>
|
|
> Well, Oracle also started out, a long long time ago, on systems without
|
|
> unified buffer cache and so on, and so they *had* to write this stuff
|
|
> because otherwise data would not be cached. So Oracle can also afford to
|
|
> maintain it now because the code already exists.
|
|
|
|
Well, actually, it isn't unified buffer cache that is the issue, but
|
|
rather the older SysV file system had pretty poor performance so
|
|
bypassing it was a bigger win that it is today.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M31893@postgresql.org Fri Nov 15 11:25:58 2002
|
|
Return-path: <pgsql-hackers-owner+M31893@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id gAFHPvR10276
|
|
for <pgman@candle.pha.pa.us>; Fri, 15 Nov 2002 12:25:57 -0500 (EST)
|
|
Received: from localhost (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with ESMTP
|
|
id A2D5A4774A1; Fri, 15 Nov 2002 11:34:54 -0500 (EST)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 5E898477132; Fri, 15 Nov 2002 11:15:45 -0500 (EST)
|
|
Received: from localhost (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with ESMTP id 90CF1475B85
|
|
for <pgsql-hackers@postgresql.org>; Mon, 11 Nov 2002 15:33:47 -0500 (EST)
|
|
Received: from Curtis-Vaio (unknown [63.164.0.45])
|
|
by postgresql.org (Postfix) with SMTP id C6CB1475A3F
|
|
for <pgsql-hackers@postgresql.org>; Mon, 11 Nov 2002 15:33:46 -0500 (EST)
|
|
Received: from [127.0.0.1] by Curtis-Vaio
|
|
(ArGoSoft Mail Server Freeware, Version 1.8 (1.8.1.7)); Mon, 11 Nov 2002 16:33:42 -0400
|
|
From: "Curtis Faith" <curtis@galtcapital.com>
|
|
To: <pgsql-hackers@postgresql.org>
|
|
Subject: [HACKERS] 500 tpsQL + WAL log implementation
|
|
Date: Mon, 11 Nov 2002 16:33:41 -0400
|
|
Message-ID: <DMEEJMCDOJAKPPFACMPMCEBMCFAA.curtis@galtcapital.com>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain;
|
|
charset="iso-8859-1"
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Priority: 3 (Normal)
|
|
X-MSMail-Priority: Normal
|
|
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
|
|
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
|
|
Importance: Normal
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Status: ROr
|
|
|
|
I have been experimenting with empirical tests of file system and device
|
|
level writes to determine the actual constraints in order to speed up the WAL
|
|
logging code.
|
|
|
|
Using a raw file partition and a time-based technique for determining the
|
|
optimal write position, I am able to get 8K writes physically written to disk
|
|
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
|
|
device partitions on IDE disks (with write cache disabled). I will be
|
|
testing it soon under linux with 10,00RPM SCSI which should be even better.
|
|
It is my belief that the mechanism used to achieve these speeds could be
|
|
incorporated into the existing WAL logging code as an abstraction that looks
|
|
to the WAL code just like the file level access currently used. The current
|
|
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
|
|
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second
|
|
|
|
The mechanism works by adjusting the seek offset of the write by using
|
|
gettimeofday to determine approximately where the disk head is in its
|
|
rotation. The mechanism does not use any AIO calls.
|
|
|
|
Assuming the following:
|
|
|
|
1) Disk rotation time is 8.333ms or 8333us (7200 RPM).
|
|
|
|
2) A write at offset 1,500K completes at system time 103s 000ms 000us
|
|
|
|
3) A new write is requested at system time 103s 004ms 166us
|
|
|
|
4) A 390K per rotation alignment of the data on the disk.
|
|
|
|
5) A write must be sent at least 20K ahead of the current head position to
|
|
ensure that it is written in less than one rotation.
|
|
|
|
It can be determined from the above that a write for an offset of something
|
|
slightly more than 195K past the last write, or offset 1,695K will be ahead
|
|
of the current location of the head and will therefore complete in less than
|
|
a single rotation's time.
|
|
|
|
The disk specific metrics (rotation speed, bytes per rotation, base write
|
|
time, etc.) can be derived empirically through a tester program that would
|
|
take a few minutes to run and which could be run at log setup time.
|
|
|
|
The obvious problem with the above mechanism is that the WAL log needs to be
|
|
able to read from the log file in transaction order during recovery. This
|
|
could be provided for using an abstraction that prepends the logical order
|
|
for each block written to the disk and makes sure that the log blocks contain
|
|
either a valid logical order number or some other marker indicating that the
|
|
block is not being used.
|
|
|
|
A bitmap of blocks that have already been used would be kept in memory for
|
|
quickly determining the next set of possible unused blocks but this bitmap
|
|
would not need to be written to disk except during normal shutdown since in
|
|
the even of a failure the bitmaps would be reconstructed by reading all the
|
|
blocks from the disk.
|
|
|
|
Checkpointing and something akin to log rotation could be handled using this
|
|
mechanism as well.
|
|
|
|
So, MY REAL QUESTION is whether or not this is the sort of speed improvement
|
|
that warrants the work of writing the required abstraction layer and making
|
|
this very robust. The WAL code should remain essentially unchanged, with
|
|
perhaps new calls for the five or six routines used to access the log files,
|
|
and handle the equivalent of log rotation for raw device access. These new
|
|
calls would either use the current file based implementation or the new
|
|
logging mechanism depending on the configuration.
|
|
|
|
I anticipate that the extra work required for a PostgreSQL administrator to
|
|
use the proposed logging mechanism would be to:
|
|
|
|
1) Create a raw device partition of the appropriate size
|
|
2) Run the metrics tester for that device partition
|
|
3) Set the appropriate configuration parameters to indicate raw WAL logging
|
|
|
|
I anticipate that the additional space requirements for this system would be
|
|
on the order of 10% to 15% beyond the current file-based implementation's
|
|
requirements.
|
|
|
|
So, is this worth doing? Would a robust implementation likely be accepted for
|
|
7.4 assuming it can demonstrate speed improvements in the range of 500tps?
|
|
|
|
- Curtis
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
On Sun, 2005-04-10 at 21:12 -0400, Bruce Momjian wrote:
|
|
> Jim C. Nasby wrote:
|
|
> > Maybe better for -hackers, but here it goes anyway...
|
|
> >
|
|
> > Has anyone looked at compressing WAL's before writing to disk? On a
|
|
> > system generating a lot of WAL it seems there might be some gains to be
|
|
> > had WAL data could be compressed before going to disk, since today's
|
|
> > machines are generally more I/O bound than CPU bound. And unlike the
|
|
> > base tables, you generally don't need to read the WAL, so you don't
|
|
> > really need to worry about not being able to quickly scan through the
|
|
> > data without decompressing it.
|
|
>
|
|
> I have never heard anyone talk about it, but it seems useful. I think
|
|
> compressing the page images written on first page modification since
|
|
> checkpoint would be a big win.
|
|
|
|
Well it was discussed 2-3 years ago as part of the PITR preamble. You
|
|
may be surprised to read that over...
|
|
|
|
A summary of thoughts to date on this are:
|
|
|
|
xlog.c XLogInsert places backup blocks into the wal buffers before
|
|
insertion, so is the right place to do this. It would be possible to do
|
|
this before any LWlocks are taken, so would not not necessarily impair
|
|
scalability.
|
|
|
|
Currently XLogInsert is a severe CPU bottleneck around the CRC
|
|
calculation, as identified recently by Tom. Digging further, the code
|
|
used seems to cause processor stalls on Intel CPUs, possibly responsible
|
|
for much of the CPU time. Discussions to move to a 32-bit CRC would also
|
|
be effected by this because of the byte-by-byte nature of the algorithm,
|
|
whatever the length of the generating polynomial. PostgreSQL's CRC
|
|
algorithm is the fastest BSD code available. Until improvement is made
|
|
there, I would not investigate compression further. Some input from
|
|
hardware tuning specialists is required...
|
|
|
|
The current LZW compression code uses a 4096 byte lookback size, so that
|
|
would need to be modified to extend across a whole block. An
|
|
alternative, suggested originally by Tom and rediscovered by me because
|
|
I just don't read everybody's fine words in history, is to simply take
|
|
out the freespace in the middle of every heap block that consists of
|
|
zeros.
|
|
|
|
Any solution in this area must take into account the variability of the
|
|
size of freespace in database blocks. Some databases have mostly full
|
|
blocks, others vary. There would also be considerable variation in
|
|
compressability of blocks, especially since some blocks (e.g. TOAST) are
|
|
likely to already be compressed. There'd need to be some testing done to
|
|
see exactly the point where the costs of compression produce realisable
|
|
benefits.
|
|
|
|
So any solution must be able to cope with both compressed blocks and
|
|
non-compressed blocks. My current thinking is that this could be
|
|
achieved by using the spare fourth bit of the BkpBlocks portion of the
|
|
XLog structure, so that either all included BkpBlocks are compressed or
|
|
none of them are, and hope that allows benefit to shine through. Not
|
|
thought about heap/index issues.
|
|
|
|
It is possible that an XLogWriter process could be used to assist in the
|
|
CRC and compression calculations also, an a similar process used to
|
|
assist decompression for recovery, in time.
|
|
|
|
I regret I do not currently have time to pursue further.
|
|
|
|
Best Regards, Simon Riggs
|
|
|
|
From pgsql-hackers-owner+M65147=pgman=candle.pha.pa.us@postgresql.org Fri Mar 11 12:35:29 2005
|
|
Return-path: <pgsql-hackers-owner+M65147=pgman=candle.pha.pa.us@postgresql.org>
|
|
Received: from svr1.postgresql.org (svr1.postgresql.org [200.46.204.71])
|
|
by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id j2BIZSk26269
|
|
for <pgman@candle.pha.pa.us>; Fri, 11 Mar 2005 13:35:29 -0500 (EST)
|
|
Received: from localhost (unknown [200.46.204.144])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id 7C7DD545B4
|
|
for <pgman@candle.pha.pa.us>; Fri, 11 Mar 2005 18:35:25 +0000 (GMT)
|
|
Received: from svr1.postgresql.org ([200.46.204.71])
|
|
by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
|
|
with ESMTP id 31351-02 for <pgman@candle.pha.pa.us>;
|
|
Fri, 11 Mar 2005 18:35:25 +0000 (GMT)
|
|
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id D316E545A2
|
|
for <pgman@candle.pha.pa.us>; Fri, 11 Mar 2005 18:35:24 +0000 (GMT)
|
|
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
|
|
Received: from localhost (unknown [200.46.204.144])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id E70B65640C
|
|
for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 11 Mar 2005 18:33:21 +0000 (GMT)
|
|
Received: from svr1.postgresql.org ([200.46.204.71])
|
|
by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
|
|
with ESMTP id 29957-07
|
|
for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
|
|
Fri, 11 Mar 2005 18:33:19 +0000 (GMT)
|
|
Received: from hal.kabsi.at (bottom.kabsi.at [195.202.128.74])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id 9D9EE545EE
|
|
for <pgsql-hackers@postgresql.org>; Fri, 11 Mar 2005 18:33:17 +0000 (GMT)
|
|
Received: from [192.168.0.5] (h062040243020.plc.cm.kabsi.at [62.40.243.20])
|
|
by hal.kabsi.at (8.11.1/) with ESMTP id j2BIX7a0000748810;
|
|
Fri, 11 Mar 2005 19:33:07 +0100 (CET)
|
|
Message-ID: <4231E416.4030900@cybertec.at>
|
|
Date: Fri, 11 Mar 2005 19:31:50 +0100
|
|
From: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at>
|
|
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20040922
|
|
X-Accept-Language: en-us, en
|
|
MIME-Version: 1.0
|
|
To: Simon Riggs <simon@2ndquadrant.com>
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Mark Cave-Ayland <m.cave-ayland@webbased.co.uk>,
|
|
pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations
|
|
References: <9EB50F1A91413F4FA63019487FCD251D113169@WEBBASEDDC.webbasedltd.local> <23031.1110206390@sss.pgh.pa.us> <1110239639.6117.197.camel@localhost.localdomain>
|
|
In-Reply-To: <1110239639.6117.197.camel@localhost.localdomain>
|
|
Content-Type: text/plain; charset=us-ascii; format=flowed
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Virus-Scanned: by amavisd-new at hub.org
|
|
X-Mailing-List: pgsql-hackers
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by amavisd-new at hub.org
|
|
Status: RO
|
|
|
|
|
|
> One of the things I was thinking about was whether we could use up those
|
|
> cycles more effectively. If we were to include a compression routine
|
|
> before we calculated the CRC that would
|
|
> - reduce the size of the blocks to be written, hence reduce size of xlog
|
|
> - reduce the following CRC calculation
|
|
>
|
|
> I was thinking about using a simple run-length encoding to massively
|
|
> shrink half-empty blocks with lots of zero padding, but we've already
|
|
> got code to LZW the data down also.
|
|
>
|
|
> Best Regards, Simon Riggs
|
|
>
|
|
>
|
|
> ---------------------------(end of broadcast)---------------------------
|
|
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
Simon,
|
|
|
|
I think having a compression routine in there could make real sense.
|
|
We have done some major I/O testing involving compression for a large
|
|
customer some time ago. We have seen that compressing / decompressing on
|
|
the fly is in MOST cases much faster than uncompressed I/O (try a simple
|
|
"cat file | ..." vs." zcat file.gz | ...") - the zcat version will be
|
|
faster on all platforms we have tried (Linux, AIX, Sun on some SAN
|
|
system, etc. ...).
|
|
Also, when building up a large database within one transaction the xlog
|
|
will eat a lot of storage - this can be quite annoying when you have to
|
|
deal with a lot of data).
|
|
Are there any technical reasons which would prevent somebody from
|
|
implementing compression?
|
|
|
|
Best regards,
|
|
|
|
Hans
|
|
|
|
--
|
|
Cybertec Geschwinde u Schoenig
|
|
Schoengrabern 134, A-2020 Hollabrunn, Austria
|
|
Tel: +43/660/816 40 77
|
|
www.cybertec.at, www.postgresql.at
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 5: Have you checked our extensive FAQ?
|
|
|
|
http://www.postgresql.org/docs/faq
|
|
|