Discussion:
PostgresSQL vs super pages
Thomas Munro
2018-10-10 23:59:41 UTC
Permalink
Hello FreeBSD hackers,

In version 11 of PostgreSQL (about to escape) we've introduced shared
memory-based parallel hash joins. Academic reports and basic
intuition tell me that big hash tables should benefit from super pages
due to random access. I'm interested in exploring this effect on
FreeBSD. How can I encourage super pages in short-lived mappings,
given our access pattern?

I don't have a good understanding of virtual memory (though I'm trying
to learn), but let me explain what we're doing, and what I see so far
on FreeBSD. We have two kinds of shared memory, and here I'm
interested in the second one:

1. We have a big fixed sized area that lives as long as the database
is up, that holds our buffer pool and other shared state. It is
inherited by every PostgreSQL process. Its size is set in
postgresql.conf with eg shared_buffers = '1GB'. On any long running
not-completely-idle system I eventually see that it is using super
pages (though I wish procstat -v would tell me how many):

752 0x802e00000 0x80bb9e000 rw- 3730 3730 5 0 --S- df

That's cool, and we know from benchmarks and experience on Linux and
Windows (where we explicitly request huge/large pages with MAP_HUGETLB
and MEM_LARGE_PAGES respectively) that this has beneficial performance
effects, so I'm happy that FreeBSD eventually reaches that state too
(though I haven't yet grokked exactly how and when that happens or
attempted to measure its impact on FreeBSD). No problem here AFAIK.

2. In recent versions of PostgreSQL we got parallel computing fever,
and decided we needed more dynamic shared memory for that. So we
create chunks of memory with shm_open(), size them with ftruncate()
and then map them into our main process + worker processes (yeah,
still no threads after 30 years). This is where parallel hash join
data goes.

To get rid of obvious obstacles to super page promotion, I patched my
local copy of PostgreSQL to make sure that we always ask for multiples
of 2MB and set MAP_ALIGNED_SUPER, but still no cigar (or not much
cigar, anyway). Here's what I see on my super slow laptop running
recent HEAD. I'll put repro instructions below in case anyone is
interested. I ran a 3-process ~380MB join that takes ~90s. These
mappings appeared in my memory:

18288 0x826600000 0x82a621000 rw- 16385 16385 4 0 ---- df
18288 0x82a800000 0x82ac00000 rw- 1024 1024 4 0 ---- df
18288 0x82ac00000 0x82b000000 rw- 1024 1024 4 0 ---- df
18288 0x82b000000 0x82b400000 rw- 1024 1024 4 0 ---- df
18288 0x82b400000 0x82b800000 rw- 1024 1024 4 0 ---- df
18288 0x82b800000 0x82c000000 rw- 2048 2048 4 0 --S- df
18288 0x82c000000 0x82c800000 rw- 2048 2048 4 0 ---- df
18288 0x82c800000 0x82d000000 rw- 2048 2048 4 0 ---- df
18288 0x82d000000 0x82d800000 rw- 2048 2048 4 0 ---- df
18288 0x82d800000 0x82e800000 rw- 4096 4096 4 0 ---- df
18288 0x82e800000 0x82f800000 rw- 4096 4096 4 0 ---- df
18288 0x82f800000 0x830800000 rw- 4096 4096 4 0 ---- df
18288 0x830800000 0x831800000 rw- 4096 4096 4 0 ---- df
18288 0x831800000 0x833800000 rw- 8192 8192 4 0 --S- df
18288 0x833800000 0x835800000 rw- 8192 8192 4 0 ---- df
18288 0x835800000 0x837800000 rw- 8192 8192 4 0 ---- df
18288 0x837800000 0x839800000 rw- 8192 8192 4 0 ---- df
18288 0x839800000 0x83d800000 rw- 16102 16102 4 0 ---- df

That's actually the best case I've seen, with two S. Usually there
are no cases of S, and sometimes just 1. The big mapping at the top
holds the hash table buckets, and I've never seen an S there. The
rest of them hold tuples.

Looking at the output of sysctl vm.pmap before and after a run, I saw:

vm.pmap.ad_emulation_superpage_promotions: 0
vm.pmap.num_superpage_accessed_emulations: 0
vm.pmap.num_accessed_emulations: 0
vm.pmap.num_dirty_emulations: 0
vm.pmap.pdpe.demotions: no change
vm.pmap.pde.promotions: +20
vm.pmap.pde.p_failures: +1
vm.pmap.pde.mappings: no change
vm.pmap.pde.demotions: +48
vm.pmap.pcid_save_cnt: 21392597
vm.pmap.pti: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1

With the attached patch, the syscalls look like this in truss in the
backend that creates each shm segment:

shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35081158656 (0x82b000000)
close(46) = 0 (0x0)

... and like this in other backends that map them in:

shm_open("/PostgreSQL.1214430502",O_RDWR,0600) = 46 (0x2e)
fstat(46,{ mode=-rw------- ,inode=20,size=8388608,blksize=4096 }) = 0 (0x0)
mmap(0x0,8388608,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35110518784 (0x82cc00000)
close(46) = 0 (0x0)

The access pattern for the memory is as follows:

1. In the "build" phase we first initialise the bucket segment with
zeroes (sequential), and then load all the tuples into the other
segments (sequential) and insert them into the buckets (random,
compare-and-swap). We add more segments as necessary, gradually
cranking up the sizes.

2. In the "probe" phase, all access is read only. We probe the
buckets (random) and follow pointers to tuples in the other segments
(random).

Afterwards we unmap them and shm_unlink() them, and the parallel
worker processes exit. It's possibly that we'll want to recycle
memory segments and worker processes in future, but I thought I'd
point out that we don't do that in case it's relevant.

I understand that the philosophy is not to provide explicit control
over page size. That makes sense, but I'd be grateful for any tips on
how to encourage super pages for this use case.

Thanks,

Thomas Munro

====

How to see this (assuming you have git, gmake, flex, bison, readline,
curl, ccache):

# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres

# you might want to apply the attached patch to get aligned segments
patch -p1 < super-aligned.patch

# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install

# create a db cluster
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '1GB'" >> ~/pgdata/postgresql.conf

# you can either start (and later stop) postgres in the background with pg_ctl:
~/install/bin/pg_ctl start -D ~/pgdata
# ... or just run it in the foreground and hit ^C to stop it:
# ~/install/bin/postgres -D ~/pgdata

# run the psql shell
~/install/bin/psql postgres

# inside psql, find your backend's pid
# (you can also find the parallel workers with top, but they come and
go with each query)
select pg_backend_pid();

# create a table and set memory size to avoid more complicated
batching behaviour
create table t as select generate_series(1, 8000000)::int i;
analyze t;
set work_mem = '1GB';

# if for some reason you want to change the number of parallel workers, try:
# set max_parallel_workers_per_gather = 2;

# this is quite handy for removing all disk IO from the picture
create extension pg_prewarm;
select pg_prewarm('t'::regclass);

# run a toy parallel hash join
explain analyze select count(*) from t t1 join t t2 using (i);

In procstat -v you should see that it spends about half its time
"building" which looks like slowly adding new mappings and touching
more and more pages, and then about half of its time "probing", where
there are no further changes visible in procstat -v. If your results
are like mine, only after building will you see any S mappings appear,
and then only rarely.
Konstantin Belousov
2018-10-11 00:19:54 UTC
Permalink
Post by Thomas Munro
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
Try to write zeroes instead of truncating.
This should activate the fast path in the fault handler, and if the
pages allocated for backing store of the shm object were from reservation,
you should get superpage mapping on the first fault without promotion.
Post by Thomas Munro
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35081158656 (0x82b000000)
close(46) = 0 (0x0)
Thomas Munro
2018-10-11 01:01:20 UTC
Permalink
Post by Konstantin Belousov
Post by Thomas Munro
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
Try to write zeroes instead of truncating.
This should activate the fast path in the fault handler, and if the
pages allocated for backing store of the shm object were from reservation,
you should get superpage mapping on the first fault without promotion.
If you just write() to a newly shm_open()'d fd you get a return code
of 0 so I assume that doesn't work. If you ftruncate() to the desired
size first, then loop writing 8192 bytes of zeroes at a time, it
works. But still no super pages. I tried also with a write buffer of
2MB of zeroes, but still no super pages. I tried abandoning
shm_open() and instead using a mapped file, and still no super pages.
Konstantin Belousov
2018-10-13 23:50:21 UTC
Permalink
Post by Thomas Munro
Post by Konstantin Belousov
Post by Thomas Munro
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
Try to write zeroes instead of truncating.
This should activate the fast path in the fault handler, and if the
pages allocated for backing store of the shm object were from reservation,
you should get superpage mapping on the first fault without promotion.
If you just write() to a newly shm_open()'d fd you get a return code
of 0 so I assume that doesn't work. If you ftruncate() to the desired
size first, then loop writing 8192 bytes of zeroes at a time, it
works. But still no super pages. I tried also with a write buffer of
2MB of zeroes, but still no super pages. I tried abandoning
shm_open() and instead using a mapped file, and still no super pages.
I did not quite scientific experiment, but you would need to try to find
the differences between what I did and what you observe. Below is the
naive test program that directly implements my suggestion, and the
output from the procstat -v for it after all things were set up.

/* $Id: shm_super.c,v 1.1 2018/10/13 23:49:37 kostik Exp kostik $ */

#include <sys/param.h>
#include <sys/mman.h>
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define M(x) ((x) * 1024 * 1024)
#define SZ M(4)

int
main(void)
{
char buf[128];
void *ptr;
off_t cnt;
int error, shmfd;

shmfd = shm_open(SHM_ANON, O_CREAT | O_RDWR, 0600);
if (shmfd == -1)
err(1, "shm_open");
error = ftruncate(shmfd, SZ);
if (error == -1)
err(1, "truncate");
memset(buf, 0, sizeof(buf));
for (cnt = 0; cnt < SZ; cnt += sizeof(buf)) {
error = write(shmfd, buf, sizeof(buf));
if (error == -1)
err(1, "write");
else if (error != sizeof(buf))
errx(1, "short write %d", (int)error);
}
ptr = mmap(NULL, SZ, PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_ALIGNED_SUPER, shmfd, 0);
if (ptr == MAP_FAILED)
err(1, "mmap");
for (cnt = 0; cnt < SZ; cnt += PAGE_SIZE)
*((char *)ptr + cnt) = 0;
printf("ptr %p\n", ptr);
snprintf(buf, sizeof(buf), "procstat -v %d", getpid());
system(buf);
}

$ ./shm_super
ptr 0x800e00000
PID START END PRT RES PRES REF SHD FLAG TP PATH
98579 0x400000 0x401000 r-x 1 3 1 0 CN-- vn /usr/home/kostik/work/build/bsd/DEV/stuff/tests/shm_super
98579 0x600000 0x601000 rw- 1 1 1 0 ---- df
98579 0x800600000 0x800620000 r-x 32 34 146 72 CN-- vn /libexec/ld-elf.so.1
98579 0x800620000 0x800644000 rw- 24 24 1 0 ---- df
98579 0x80081f000 0x800820000 rw- 1 0 1 0 C--- vn /libexec/ld-elf.so.1
98579 0x800820000 0x800821000 rw- 1 1 1 0 ---- df
98579 0x800821000 0x8009b3000 r-x 402 440 146 72 CN-- vn /lib/libc.so.7
98579 0x8009b3000 0x800bb3000 --- 0 0 0 0 CN-- --
98579 0x800bb3000 0x800bbf000 rw- 12 0 1 0 C--- vn /lib/libc.so.7
98579 0x800bbf000 0x800bd9000 rw- 5 14 2 0 ---- df
98579 0x800c00000 0x800e00000 rw- 9 14 2 0 ---- df
98579 0x800e00000 0x801200000 rw- 1024 1030 3 0 --S- df
98579 0x801200000 0x801400000 rw- 6 1030 3 0 ---- df
98579 0x7fffdffff000 0x7ffffffdf000 --- 0 0 0 0 ---- --
98579 0x7ffffffdf000 0x7ffffffff000 rw- 4 4 1 0 ---D df
98579 0x7ffffffff000 0x800000000000 r-x 1 1 81 0 ---- ph
Thomas Munro
2018-10-14 09:58:08 UTC
Permalink
Post by Konstantin Belousov
Post by Thomas Munro
Post by Konstantin Belousov
Post by Thomas Munro
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
Try to write zeroes instead of truncating.
This should activate the fast path in the fault handler, and if the
pages allocated for backing store of the shm object were from reservation,
you should get superpage mapping on the first fault without promotion.
If you just write() to a newly shm_open()'d fd you get a return code
of 0 so I assume that doesn't work. If you ftruncate() to the desired
size first, then loop writing 8192 bytes of zeroes at a time, it
works. But still no super pages. I tried also with a write buffer of
2MB of zeroes, but still no super pages. I tried abandoning
shm_open() and instead using a mapped file, and still no super pages.
I did not quite scientific experiment, but you would need to try to find
the differences between what I did and what you observe. Below is the
naive test program that directly implements my suggestion, and the
output from the procstat -v for it after all things were set up.
...
Post by Konstantin Belousov
98579 0x800e00000 0x801200000 rw- 1024 1030 3 0 --S- df
Huh. Your program doesn't result in an S mapping on my laptop, but I
tried on an EC2 t2.2xlarge machine and there it promotes to S, even if
I comment out the write() loop (the loop that assigned to every byte
is enough). The difference might be the amount of memory on the
system: on my 4GB laptop, it is very reluctant to use super pages (but
I have seen it do it, so I know it can). On a 32GB system, it does it
immediately, and it works nicely for PostgreSQL too. So perhaps my
problem is testing on a small RAM system, though I don't understand
why.
Konstantin Belousov
2018-10-14 11:45:44 UTC
Permalink
Post by Thomas Munro
Post by Konstantin Belousov
Post by Thomas Munro
Post by Konstantin Belousov
Post by Thomas Munro
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
Try to write zeroes instead of truncating.
This should activate the fast path in the fault handler, and if the
pages allocated for backing store of the shm object were from reservation,
you should get superpage mapping on the first fault without promotion.
If you just write() to a newly shm_open()'d fd you get a return code
of 0 so I assume that doesn't work. If you ftruncate() to the desired
size first, then loop writing 8192 bytes of zeroes at a time, it
works. But still no super pages. I tried also with a write buffer of
2MB of zeroes, but still no super pages. I tried abandoning
shm_open() and instead using a mapped file, and still no super pages.
I did not quite scientific experiment, but you would need to try to find
the differences between what I did and what you observe. Below is the
naive test program that directly implements my suggestion, and the
output from the procstat -v for it after all things were set up.
...
Post by Konstantin Belousov
98579 0x800e00000 0x801200000 rw- 1024 1030 3 0 --S- df
Huh. Your program doesn't result in an S mapping on my laptop, but I
tried on an EC2 t2.2xlarge machine and there it promotes to S, even if
I comment out the write() loop (the loop that assigned to every byte
is enough). The difference might be the amount of memory on the
system: on my 4GB laptop, it is very reluctant to use super pages (but
I have seen it do it, so I know it can). On a 32GB system, it does it
immediately, and it works nicely for PostgreSQL too. So perhaps my
problem is testing on a small RAM system, though I don't understand
why.
How many free memory does your system have ? Free as reported by top. If
the free memory is low and fragmented, and I suppose it is on 4G laptop
which you use with X, browser and other memory-consuming applications,
system would have troubles filling the reverve, i.e reserving 2M of
2M-aligned physical pages.

You can try the test programs right after booting into single user mode.
Thomas Munro
2018-10-14 22:42:15 UTC
Permalink
Post by Konstantin Belousov
Post by Thomas Munro
Huh. Your program doesn't result in an S mapping on my laptop, but I
tried on an EC2 t2.2xlarge machine and there it promotes to S, even if
I comment out the write() loop (the loop that assigned to every byte
is enough). The difference might be the amount of memory on the
system: on my 4GB laptop, it is very reluctant to use super pages (but
I have seen it do it, so I know it can). On a 32GB system, it does it
immediately, and it works nicely for PostgreSQL too. So perhaps my
problem is testing on a small RAM system, though I don't understand
why.
How many free memory does your system have ? Free as reported by top. If
the free memory is low and fragmented, and I suppose it is on 4G laptop
which you use with X, browser and other memory-consuming applications,
system would have troubles filling the reverve, i.e reserving 2M of
2M-aligned physical pages.
BTW, this can be explicitly verified with the sysctl vm.phys_free
sysctl. Superpage promotion requires free 2MB chunks from freelist 0,
pool 0.
Ah, I see. Straight after rebooting without X I get super pages and
vm.phys_free looks more healthy. I'd observed the same problem on
other machines including servers with a bit (but not a lot) more
memory, but clearly none of my FreeBSD systems are currently big
enough to keep suitable chunks around on the freelist. I wonder if
ZFS is a factor. Well, this was educational. Thanks very much for
your help!

Loading...