Thomas Munro
2018-10-10 23:59:41 UTC
Hello FreeBSD hackers,
In version 11 of PostgreSQL (about to escape) we've introduced shared
memory-based parallel hash joins. Academic reports and basic
intuition tell me that big hash tables should benefit from super pages
due to random access. I'm interested in exploring this effect on
FreeBSD. How can I encourage super pages in short-lived mappings,
given our access pattern?
I don't have a good understanding of virtual memory (though I'm trying
to learn), but let me explain what we're doing, and what I see so far
on FreeBSD. We have two kinds of shared memory, and here I'm
interested in the second one:
1. We have a big fixed sized area that lives as long as the database
is up, that holds our buffer pool and other shared state. It is
inherited by every PostgreSQL process. Its size is set in
postgresql.conf with eg shared_buffers = '1GB'. On any long running
not-completely-idle system I eventually see that it is using super
pages (though I wish procstat -v would tell me how many):
752 0x802e00000 0x80bb9e000 rw- 3730 3730 5 0 --S- df
That's cool, and we know from benchmarks and experience on Linux and
Windows (where we explicitly request huge/large pages with MAP_HUGETLB
and MEM_LARGE_PAGES respectively) that this has beneficial performance
effects, so I'm happy that FreeBSD eventually reaches that state too
(though I haven't yet grokked exactly how and when that happens or
attempted to measure its impact on FreeBSD). No problem here AFAIK.
2. In recent versions of PostgreSQL we got parallel computing fever,
and decided we needed more dynamic shared memory for that. So we
create chunks of memory with shm_open(), size them with ftruncate()
and then map them into our main process + worker processes (yeah,
still no threads after 30 years). This is where parallel hash join
data goes.
To get rid of obvious obstacles to super page promotion, I patched my
local copy of PostgreSQL to make sure that we always ask for multiples
of 2MB and set MAP_ALIGNED_SUPER, but still no cigar (or not much
cigar, anyway). Here's what I see on my super slow laptop running
recent HEAD. I'll put repro instructions below in case anyone is
interested. I ran a 3-process ~380MB join that takes ~90s. These
mappings appeared in my memory:
18288 0x826600000 0x82a621000 rw- 16385 16385 4 0 ---- df
18288 0x82a800000 0x82ac00000 rw- 1024 1024 4 0 ---- df
18288 0x82ac00000 0x82b000000 rw- 1024 1024 4 0 ---- df
18288 0x82b000000 0x82b400000 rw- 1024 1024 4 0 ---- df
18288 0x82b400000 0x82b800000 rw- 1024 1024 4 0 ---- df
18288 0x82b800000 0x82c000000 rw- 2048 2048 4 0 --S- df
18288 0x82c000000 0x82c800000 rw- 2048 2048 4 0 ---- df
18288 0x82c800000 0x82d000000 rw- 2048 2048 4 0 ---- df
18288 0x82d000000 0x82d800000 rw- 2048 2048 4 0 ---- df
18288 0x82d800000 0x82e800000 rw- 4096 4096 4 0 ---- df
18288 0x82e800000 0x82f800000 rw- 4096 4096 4 0 ---- df
18288 0x82f800000 0x830800000 rw- 4096 4096 4 0 ---- df
18288 0x830800000 0x831800000 rw- 4096 4096 4 0 ---- df
18288 0x831800000 0x833800000 rw- 8192 8192 4 0 --S- df
18288 0x833800000 0x835800000 rw- 8192 8192 4 0 ---- df
18288 0x835800000 0x837800000 rw- 8192 8192 4 0 ---- df
18288 0x837800000 0x839800000 rw- 8192 8192 4 0 ---- df
18288 0x839800000 0x83d800000 rw- 16102 16102 4 0 ---- df
That's actually the best case I've seen, with two S. Usually there
are no cases of S, and sometimes just 1. The big mapping at the top
holds the hash table buckets, and I've never seen an S there. The
rest of them hold tuples.
Looking at the output of sysctl vm.pmap before and after a run, I saw:
vm.pmap.ad_emulation_superpage_promotions: 0
vm.pmap.num_superpage_accessed_emulations: 0
vm.pmap.num_accessed_emulations: 0
vm.pmap.num_dirty_emulations: 0
vm.pmap.pdpe.demotions: no change
vm.pmap.pde.promotions: +20
vm.pmap.pde.p_failures: +1
vm.pmap.pde.mappings: no change
vm.pmap.pde.demotions: +48
vm.pmap.pcid_save_cnt: 21392597
vm.pmap.pti: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1
With the attached patch, the syscalls look like this in truss in the
backend that creates each shm segment:
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35081158656 (0x82b000000)
close(46) = 0 (0x0)
... and like this in other backends that map them in:
shm_open("/PostgreSQL.1214430502",O_RDWR,0600) = 46 (0x2e)
fstat(46,{ mode=-rw------- ,inode=20,size=8388608,blksize=4096 }) = 0 (0x0)
mmap(0x0,8388608,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35110518784 (0x82cc00000)
close(46) = 0 (0x0)
The access pattern for the memory is as follows:
1. In the "build" phase we first initialise the bucket segment with
zeroes (sequential), and then load all the tuples into the other
segments (sequential) and insert them into the buckets (random,
compare-and-swap). We add more segments as necessary, gradually
cranking up the sizes.
2. In the "probe" phase, all access is read only. We probe the
buckets (random) and follow pointers to tuples in the other segments
(random).
Afterwards we unmap them and shm_unlink() them, and the parallel
worker processes exit. It's possibly that we'll want to recycle
memory segments and worker processes in future, but I thought I'd
point out that we don't do that in case it's relevant.
I understand that the philosophy is not to provide explicit control
over page size. That makes sense, but I'd be grateful for any tips on
how to encourage super pages for this use case.
Thanks,
Thomas Munro
====
How to see this (assuming you have git, gmake, flex, bison, readline,
curl, ccache):
# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres
# you might want to apply the attached patch to get aligned segments
patch -p1 < super-aligned.patch
# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install
# create a db cluster
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '1GB'" >> ~/pgdata/postgresql.conf
# you can either start (and later stop) postgres in the background with pg_ctl:
~/install/bin/pg_ctl start -D ~/pgdata
# ... or just run it in the foreground and hit ^C to stop it:
# ~/install/bin/postgres -D ~/pgdata
# run the psql shell
~/install/bin/psql postgres
# inside psql, find your backend's pid
# (you can also find the parallel workers with top, but they come and
go with each query)
select pg_backend_pid();
# create a table and set memory size to avoid more complicated
batching behaviour
create table t as select generate_series(1, 8000000)::int i;
analyze t;
set work_mem = '1GB';
# if for some reason you want to change the number of parallel workers, try:
# set max_parallel_workers_per_gather = 2;
# this is quite handy for removing all disk IO from the picture
create extension pg_prewarm;
select pg_prewarm('t'::regclass);
# run a toy parallel hash join
explain analyze select count(*) from t t1 join t t2 using (i);
In procstat -v you should see that it spends about half its time
"building" which looks like slowly adding new mappings and touching
more and more pages, and then about half of its time "probing", where
there are no further changes visible in procstat -v. If your results
are like mine, only after building will you see any S mappings appear,
and then only rarely.
In version 11 of PostgreSQL (about to escape) we've introduced shared
memory-based parallel hash joins. Academic reports and basic
intuition tell me that big hash tables should benefit from super pages
due to random access. I'm interested in exploring this effect on
FreeBSD. How can I encourage super pages in short-lived mappings,
given our access pattern?
I don't have a good understanding of virtual memory (though I'm trying
to learn), but let me explain what we're doing, and what I see so far
on FreeBSD. We have two kinds of shared memory, and here I'm
interested in the second one:
1. We have a big fixed sized area that lives as long as the database
is up, that holds our buffer pool and other shared state. It is
inherited by every PostgreSQL process. Its size is set in
postgresql.conf with eg shared_buffers = '1GB'. On any long running
not-completely-idle system I eventually see that it is using super
pages (though I wish procstat -v would tell me how many):
752 0x802e00000 0x80bb9e000 rw- 3730 3730 5 0 --S- df
That's cool, and we know from benchmarks and experience on Linux and
Windows (where we explicitly request huge/large pages with MAP_HUGETLB
and MEM_LARGE_PAGES respectively) that this has beneficial performance
effects, so I'm happy that FreeBSD eventually reaches that state too
(though I haven't yet grokked exactly how and when that happens or
attempted to measure its impact on FreeBSD). No problem here AFAIK.
2. In recent versions of PostgreSQL we got parallel computing fever,
and decided we needed more dynamic shared memory for that. So we
create chunks of memory with shm_open(), size them with ftruncate()
and then map them into our main process + worker processes (yeah,
still no threads after 30 years). This is where parallel hash join
data goes.
To get rid of obvious obstacles to super page promotion, I patched my
local copy of PostgreSQL to make sure that we always ask for multiples
of 2MB and set MAP_ALIGNED_SUPER, but still no cigar (or not much
cigar, anyway). Here's what I see on my super slow laptop running
recent HEAD. I'll put repro instructions below in case anyone is
interested. I ran a 3-process ~380MB join that takes ~90s. These
mappings appeared in my memory:
18288 0x826600000 0x82a621000 rw- 16385 16385 4 0 ---- df
18288 0x82a800000 0x82ac00000 rw- 1024 1024 4 0 ---- df
18288 0x82ac00000 0x82b000000 rw- 1024 1024 4 0 ---- df
18288 0x82b000000 0x82b400000 rw- 1024 1024 4 0 ---- df
18288 0x82b400000 0x82b800000 rw- 1024 1024 4 0 ---- df
18288 0x82b800000 0x82c000000 rw- 2048 2048 4 0 --S- df
18288 0x82c000000 0x82c800000 rw- 2048 2048 4 0 ---- df
18288 0x82c800000 0x82d000000 rw- 2048 2048 4 0 ---- df
18288 0x82d000000 0x82d800000 rw- 2048 2048 4 0 ---- df
18288 0x82d800000 0x82e800000 rw- 4096 4096 4 0 ---- df
18288 0x82e800000 0x82f800000 rw- 4096 4096 4 0 ---- df
18288 0x82f800000 0x830800000 rw- 4096 4096 4 0 ---- df
18288 0x830800000 0x831800000 rw- 4096 4096 4 0 ---- df
18288 0x831800000 0x833800000 rw- 8192 8192 4 0 --S- df
18288 0x833800000 0x835800000 rw- 8192 8192 4 0 ---- df
18288 0x835800000 0x837800000 rw- 8192 8192 4 0 ---- df
18288 0x837800000 0x839800000 rw- 8192 8192 4 0 ---- df
18288 0x839800000 0x83d800000 rw- 16102 16102 4 0 ---- df
That's actually the best case I've seen, with two S. Usually there
are no cases of S, and sometimes just 1. The big mapping at the top
holds the hash table buckets, and I've never seen an S there. The
rest of them hold tuples.
Looking at the output of sysctl vm.pmap before and after a run, I saw:
vm.pmap.ad_emulation_superpage_promotions: 0
vm.pmap.num_superpage_accessed_emulations: 0
vm.pmap.num_accessed_emulations: 0
vm.pmap.num_dirty_emulations: 0
vm.pmap.pdpe.demotions: no change
vm.pmap.pde.promotions: +20
vm.pmap.pde.p_failures: +1
vm.pmap.pde.mappings: no change
vm.pmap.pde.demotions: +48
vm.pmap.pcid_save_cnt: 21392597
vm.pmap.pti: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1
With the attached patch, the syscalls look like this in truss in the
backend that creates each shm segment:
shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000) = 0 (0x0)
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35081158656 (0x82b000000)
close(46) = 0 (0x0)
... and like this in other backends that map them in:
shm_open("/PostgreSQL.1214430502",O_RDWR,0600) = 46 (0x2e)
fstat(46,{ mode=-rw------- ,inode=20,size=8388608,blksize=4096 }) = 0 (0x0)
mmap(0x0,8388608,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35110518784 (0x82cc00000)
close(46) = 0 (0x0)
The access pattern for the memory is as follows:
1. In the "build" phase we first initialise the bucket segment with
zeroes (sequential), and then load all the tuples into the other
segments (sequential) and insert them into the buckets (random,
compare-and-swap). We add more segments as necessary, gradually
cranking up the sizes.
2. In the "probe" phase, all access is read only. We probe the
buckets (random) and follow pointers to tuples in the other segments
(random).
Afterwards we unmap them and shm_unlink() them, and the parallel
worker processes exit. It's possibly that we'll want to recycle
memory segments and worker processes in future, but I thought I'd
point out that we don't do that in case it's relevant.
I understand that the philosophy is not to provide explicit control
over page size. That makes sense, but I'd be grateful for any tips on
how to encourage super pages for this use case.
Thanks,
Thomas Munro
====
How to see this (assuming you have git, gmake, flex, bison, readline,
curl, ccache):
# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres
# you might want to apply the attached patch to get aligned segments
patch -p1 < super-aligned.patch
# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install
# create a db cluster
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '1GB'" >> ~/pgdata/postgresql.conf
# you can either start (and later stop) postgres in the background with pg_ctl:
~/install/bin/pg_ctl start -D ~/pgdata
# ... or just run it in the foreground and hit ^C to stop it:
# ~/install/bin/postgres -D ~/pgdata
# run the psql shell
~/install/bin/psql postgres
# inside psql, find your backend's pid
# (you can also find the parallel workers with top, but they come and
go with each query)
select pg_backend_pid();
# create a table and set memory size to avoid more complicated
batching behaviour
create table t as select generate_series(1, 8000000)::int i;
analyze t;
set work_mem = '1GB';
# if for some reason you want to change the number of parallel workers, try:
# set max_parallel_workers_per_gather = 2;
# this is quite handy for removing all disk IO from the picture
create extension pg_prewarm;
select pg_prewarm('t'::regclass);
# run a toy parallel hash join
explain analyze select count(*) from t t1 join t t2 using (i);
In procstat -v you should see that it spends about half its time
"building" which looks like slowly adding new mappings and touching
more and more pages, and then about half of its time "probing", where
there are no further changes visible in procstat -v. If your results
are like mine, only after building will you see any S mappings appear,
and then only rarely.