mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-01-30 19:00:29 +08:00
Add to mmap emails.
This commit is contained in:
parent
b7e089fdbb
commit
2e6887df63
@ -1763,3 +1763,254 @@ message can get through to the mailing list cleanly
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
From pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org Thu Mar 6 19:37:25 2003
|
||||||
|
Return-path: <pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org>
|
||||||
|
Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143])
|
||||||
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h270bM624923
|
||||||
|
for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:24 -0500 (EST)
|
||||||
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
||||||
|
by relay2.pgsql.com (Postfix) with ESMTP id 4D5CDEE0411
|
||||||
|
for <maillist@candle.pha.pa.us>; Thu, 6 Mar 2003 19:37:23 -0500 (EST)
|
||||||
|
X-Original-To: pgsql-committers@postgresql.org
|
||||||
|
Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251])
|
||||||
|
by postgresql.org (Postfix) with ESMTP
|
||||||
|
id 3120E47646F; Thu, 6 Mar 2003 19:36:58 -0500 (EST)
|
||||||
|
Received: by perrin.int.nxad.com (Postfix, from userid 1001)
|
||||||
|
id 9CBE42105B; Thu, 6 Mar 2003 16:36:40 -0800 (PST)
|
||||||
|
Date: Thu, 6 Mar 2003 16:36:40 -0800
|
||||||
|
From: Sean Chittenden <sean@chittenden.org>
|
||||||
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
||||||
|
cc: Christopher Kings-Lynne <chriskl@familyhealth.com.au>,
|
||||||
|
pgsql-committers@postgresql.org, pgsql-performance@postgresql.org
|
||||||
|
Subject: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
|
||||||
|
Message-ID: <20030307003640.GF79234@perrin.int.nxad.com>
|
||||||
|
References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us>
|
||||||
|
MIME-Version: 1.0
|
||||||
|
Content-Type: multipart/signed; micalg=pgp-sha1;
|
||||||
|
protocol="application/pgp-signature"; boundary="HjNkcEWJ4DMx36DP"
|
||||||
|
Content-Disposition: inline
|
||||||
|
In-Reply-To: <15071.1046964336@sss.pgh.pa.us>
|
||||||
|
User-Agent: Mutt/1.4i
|
||||||
|
X-PGP-Key: finger seanc@FreeBSD.org
|
||||||
|
X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341
|
||||||
|
X-Web-Homepage: http://sean.chittenden.org/
|
||||||
|
Precedence: bulk
|
||||||
|
Sender: pgsql-committers-owner@postgresql.org
|
||||||
|
Status: OR
|
||||||
|
|
||||||
|
--HjNkcEWJ4DMx36DP
|
||||||
|
Content-Type: text/plain; charset=us-ascii
|
||||||
|
Content-Disposition: inline
|
||||||
|
Content-Transfer-Encoding: quoted-printable
|
||||||
|
|
||||||
|
[moving to -performance, please drop -committers from replies]
|
||||||
|
|
||||||
|
> > I've toyed with the idea of adding this because it is monstrously more
|
||||||
|
> > efficient than select()/poll() in basically every way, shape, and
|
||||||
|
> > form.
|
||||||
|
>=20
|
||||||
|
> From what I've looked at, kqueue only wins when you are watching a
|
||||||
|
> large number of file descriptors at the same time; which is an
|
||||||
|
> operation done nowhere in Postgres. I think the above would be a
|
||||||
|
> complete waste of effort.
|
||||||
|
|
||||||
|
It scales very well to many thousands of descriptors, but it also
|
||||||
|
works well on small numbers as well. kqueue is about 5x faster than
|
||||||
|
select() or poll() on the low end of number of fd's. As I said
|
||||||
|
earlier, I don't think there is _much_ to gain in this regard, but I
|
||||||
|
do think that it would be a speed improvement but only to one OS
|
||||||
|
supported by PostgreSQL. I think that there are bigger speed
|
||||||
|
improvements to be had elsewhere in the code.
|
||||||
|
|
||||||
|
> > Is this one of the areas of PostgreSQL that just needs to get
|
||||||
|
> > slowly migrated to use mmap() or are there any gaping reasons why
|
||||||
|
> > to not use the family of system calls?
|
||||||
|
>=20
|
||||||
|
> There has been much speculation on this, and no proof that it
|
||||||
|
> actually buys us anything to justify the portability hit.
|
||||||
|
|
||||||
|
Actually, I think that it wouldn't be that big of a portability hit
|
||||||
|
because you still would read() and write() as always, but in
|
||||||
|
performance sensitive areas, an #ifdef HAVE_MMAP section would have
|
||||||
|
the appropriate mmap() calls. If the system doesn't have mmap(),
|
||||||
|
there isn't much to loose and we're in the same position we're in now.
|
||||||
|
|
||||||
|
> There would be some nontrivial problems to solve, such as the
|
||||||
|
> mechanics of accessing a large number of files from a large number
|
||||||
|
> of backends without running out of virtual memory. Also, is it
|
||||||
|
> guaranteed that multiple backends mmap'ing the same block will
|
||||||
|
> access the very same physical buffer, and not multiple copies?
|
||||||
|
> Multiple copies would be fatal. See the acrhives for more
|
||||||
|
> discussion.
|
||||||
|
|
||||||
|
Have read through the archives. Making a call to madvise() will speed
|
||||||
|
up access to the pages as it gives hints to the VM about what order
|
||||||
|
the pages are accessed/used. Here are a few bits from the BSD mmap()
|
||||||
|
and madvise() man pages:
|
||||||
|
|
||||||
|
mmap(2):
|
||||||
|
MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
|
||||||
|
physical media only when necessary (usually by the
|
||||||
|
pager) rather then gratuitously. Typically this pre-
|
||||||
|
vents the update daemons from flushing pages dirtied
|
||||||
|
through such maps and thus allows efficient sharing =
|
||||||
|
of
|
||||||
|
memory across unassociated processes using a file-
|
||||||
|
backed shared memory map. Without this option any VM
|
||||||
|
pages you dirty may be flushed to disk every so often
|
||||||
|
(every 30-60 seconds usually) which can create perfo=
|
||||||
|
r-
|
||||||
|
mance problems if you do not need that to occur (such
|
||||||
|
as when you are using shared file-backed mmap regions
|
||||||
|
for IPC purposes). Note that VM/filesystem coherency
|
||||||
|
is maintained whether you use MAP_NOSYNC or not. Th=
|
||||||
|
is
|
||||||
|
option is not portable across UNIX platforms (yet),
|
||||||
|
though some may implement the same behavior by defau=
|
||||||
|
lt.
|
||||||
|
|
||||||
|
WARNING! Extending a file with ftruncate(2), thus c=
|
||||||
|
re-
|
||||||
|
ating a big hole, and then filling the hole by modif=
|
||||||
|
y-
|
||||||
|
ing a shared mmap() can lead to severe file fragment=
|
||||||
|
a-
|
||||||
|
tion. In order to avoid such fragmentation you shou=
|
||||||
|
ld
|
||||||
|
always pre-allocate the file's backing store by
|
||||||
|
write()ing zero's into the newly extended area prior=
|
||||||
|
to
|
||||||
|
modifying the area via your mmap(). The fragmentati=
|
||||||
|
on
|
||||||
|
problem is especially sensitive to MAP_NOSYNC pages,
|
||||||
|
because pages may be flushed to disk in a totally ra=
|
||||||
|
n-
|
||||||
|
dom order.
|
||||||
|
|
||||||
|
The same applies when using MAP_NOSYNC to implement a
|
||||||
|
file-based shared memory store. It is recommended t=
|
||||||
|
hat
|
||||||
|
you create the backing store by write()ing zero's to
|
||||||
|
the backing file rather then ftruncate()ing it. You
|
||||||
|
can test file fragmentation by observing the KB/t
|
||||||
|
(kilobytes per transfer) results from an ``iostat 1''
|
||||||
|
while reading a large file sequentially, e.g. using
|
||||||
|
``dd if=3Dfilename of=3D/dev/null bs=3D32k''.
|
||||||
|
|
||||||
|
The fsync(2) function will flush all dirty data and
|
||||||
|
metadata associated with a file, including dirty NOS=
|
||||||
|
YNC
|
||||||
|
VM data, to physical media. The sync(8) command and
|
||||||
|
sync(2) system call generally do not flush dirty NOS=
|
||||||
|
YNC
|
||||||
|
VM data. The msync(2) system call is obsolete since
|
||||||
|
BSD implements a coherent filesystem buffer cache.
|
||||||
|
However, it may be used to associate dirty VM pages
|
||||||
|
with filesystem buffers and thus cause them to be
|
||||||
|
flushed to physical media sooner rather then later.
|
||||||
|
|
||||||
|
madvise(2):
|
||||||
|
MADV_NORMAL Tells the system to revert to the default paging beha=
|
||||||
|
v-
|
||||||
|
ior.
|
||||||
|
|
||||||
|
MADV_RANDOM Is a hint that pages will be accessed randomly, and
|
||||||
|
prefetching is likely not advantageous.
|
||||||
|
|
||||||
|
MADV_SEQUENTIAL Causes the VM system to depress the priority of pages
|
||||||
|
immediately preceding a given page when it is faulted
|
||||||
|
in.
|
||||||
|
|
||||||
|
mprotect(2):
|
||||||
|
The mprotect() system call changes the specified pages to have protect=
|
||||||
|
ion
|
||||||
|
prot. Not all implementations will guarantee protection on a page bas=
|
||||||
|
is;
|
||||||
|
the granularity of protection changes may be as large as an entire
|
||||||
|
region. A region is the virtual address space defined by the start and
|
||||||
|
end addresses of a struct vm_map_entry.
|
||||||
|
|
||||||
|
Currently these protection bits are known, which can be combined, OR'd
|
||||||
|
together:
|
||||||
|
|
||||||
|
PROT_NONE No permissions at all.
|
||||||
|
|
||||||
|
PROT_READ The pages can be read.
|
||||||
|
|
||||||
|
PROT_WRITE The pages can be written.
|
||||||
|
|
||||||
|
PROT_EXEC The pages can be executed.
|
||||||
|
|
||||||
|
msync(2):
|
||||||
|
The msync() system call writes any modified pages back to the filesyst=
|
||||||
|
em
|
||||||
|
and updates the file modification time. If len is 0, all modified pag=
|
||||||
|
es
|
||||||
|
within the region containing addr will be flushed; if len is non-zero,
|
||||||
|
only those pages containing addr and len-1 succeeding locations will be
|
||||||
|
examined. The flags argument may be specified as follows:
|
||||||
|
|
||||||
|
MS_ASYNC Return immediately
|
||||||
|
MS_SYNC Perform synchronous writes
|
||||||
|
MS_INVALIDATE Invalidate all cached data
|
||||||
|
|
||||||
|
|
||||||
|
A few thoughts come to mind:
|
||||||
|
|
||||||
|
1) backends could share buffers by mmap()'ing shared regions of data.
|
||||||
|
While I haven't seen any numbers to reflect this, I'd wager that
|
||||||
|
mmap() is a faster interface than ipc.
|
||||||
|
|
||||||
|
2) It looks like while there are various file IO schemes scattered all
|
||||||
|
over the place, the bulk of the critical routines that would need
|
||||||
|
to be updated are in backend/storage/file/fd.c, more specifically:
|
||||||
|
|
||||||
|
*) fileNameOpenFile() would need the appropriate mmap() call made
|
||||||
|
to it.
|
||||||
|
|
||||||
|
*) FileTruncate() would need some attention to avoid fragmentation.
|
||||||
|
|
||||||
|
*) a new "sync" GUC would have to be introduced to handle msync
|
||||||
|
(affects only pg_fsync() and pg_fdatasync()).
|
||||||
|
|
||||||
|
3) There's a bit of code in pgsql/src/backend/storage/smgr that could
|
||||||
|
be gutted/removed. Which of those storage types are even used any
|
||||||
|
more? There's a reference in the code to PostgreSQL 3.0. :)
|
||||||
|
|
||||||
|
And I think that'd be it. The LRU code could be used if necessary to
|
||||||
|
help manage the amount of mmap()'ed in the VM at any one time, at the
|
||||||
|
very least that could be a handled by a shm var that various backends
|
||||||
|
would increment/decrement as files are open()'ed/close()'ed.
|
||||||
|
|
||||||
|
I didn't spend too long looking at this, but I _think_ that'd cover
|
||||||
|
80% of PostgreSQL's disk access needs. The next bit to possibly add
|
||||||
|
would be passing a flag on FileOpen operations that'd act as a hint to
|
||||||
|
madvise() that way the VM could proactively react to PostgreSQL's
|
||||||
|
needs.
|
||||||
|
|
||||||
|
I don't have my copy of Steven's handy (it's some 700mi away atm
|
||||||
|
otherwise I'd cite it), but if Tom or someone else has it handy, look
|
||||||
|
up the example re: the performance gain from read()'ing an mmap()'ed
|
||||||
|
file versus a non-mmap()'ed file. The difference is non-trivial and
|
||||||
|
_WELL_ worth the time given the speed increase. The same speed
|
||||||
|
benefit held true for writes as well, iirc. It's been a while, but I
|
||||||
|
think it was around page 330. The index has it listed and it's not
|
||||||
|
that hard of an example to find. -sc
|
||||||
|
|
||||||
|
--=20
|
||||||
|
Sean Chittenden
|
||||||
|
|
||||||
|
--HjNkcEWJ4DMx36DP
|
||||||
|
Content-Type: application/pgp-signature
|
||||||
|
Content-Disposition: inline
|
||||||
|
|
||||||
|
-----BEGIN PGP SIGNATURE-----
|
||||||
|
Comment: Sean Chittenden <sean@chittenden.org>
|
||||||
|
|
||||||
|
iD8DBQE+Z+mY3ZnjH7yEs0ERAjVkAJwMI1V7+HvMAA5ODadD5znsekI8TQCgvH0C
|
||||||
|
KwvG7YLsJ+xpsTUS67KD+4M=
|
||||||
|
=w8/7
|
||||||
|
-----END PGP SIGNATURE-----
|
||||||
|
|
||||||
|
--HjNkcEWJ4DMx36DP--
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user