Regression when trying to replace poll() with kqueue()

Discussion:

Thomas Munro

2018-10-02 02:24:23 UTC

Hello FreeBSD hackers,

(CCing mjg and a list of others FreeBSD hackers he suggested)

In a fit of enthusiasm for FreeBSD, a couple of years ago I wrote a
patch to teach PostgreSQL to use kqueue(2). That was after we
switched over to epoll(2) on Linux for performance reasons. Our
default is to use poll(2) unless we have something better. The most
common usage pattern is simply waiting for read/write readiness on the
socket that is connected to the client + a pipe connected to the
parent supervisor process ("postmaster"), but we have plans for more
interesting kinds of multiplexing involving many more descriptors, and
in general this sits behind our very thin abstraction called
WaitEventSet (see latch.c in the PostgreSQL source tree) that can be
used for many things.

We did some testing using "pgbench" (instructions below) on various
platforms that have kqueue(2), and we got some conflicting results
from FreeBSD. When the system is heavily overloaded (a scenario we
want to work well, or at least not get worse under kqueue, even if
it's not the ideal way to run your database server), mjg reported that
with the kqueue patch performance was way better than unpatched when
the pgbench test client was running on a different host. Huzzah!

Unfortunately, another tester reported the performance was worse when
running pgbench from the same host (originally he complained about
NetBSD performance and then we realised FreeBSD was the same under
those conditions), and I confirmed that was the case for both Unix
sockets and TCP sockets. In one 96 (!) thread test, the TPS reported
by pgbench dropped from 70k to 50k queries per second on an 8 CPU
system. As crazy as those test conditions may seem, that is not a
good result.

Curiously, when truss'd, in the overloaded scenario that performs
worse, we very rarely seem to actually reach kevent(2). It seems like
there is some kind of scheduling difference producing the change.
Each PostgreSQL server process looks like this over ~10 seconds:

syscall seconds calls errors
sendto 0.396840146 3452 0
recvfrom 0.415802029 3443 6
kevent 0.000626393 6 0
gettimeofday 2.723923249 24053 0
------------- ------- -------
3.537191817 30954 6

(That was captured on a virtualised system which had gettimeofday as a
syscall, but the effect has been reported on bare metal too and there
no gettimeofday calls show up; I don't believe that is a factor).

The pgbench client looks like this:

syscall seconds calls errors
ppoll 0.002773195 1 0
sendto 16.597880468 7217 0
recvfrom 25.646406008 7238 0
------------- ------- -------
42.247059671 14456 0

(For whatever reason pgbench uses ppoll() instead, but I assume that's
irrelevant here; it's also multi-threaded, unlike the server.) The
truss -c results for the server are not much different when using
poll(2) instead of kevent(2), although recvfrom in the pgbench client
seems to show a few seconds less total time, which is curious. You
can see that we're mostly able to do sendto() and recvfrom() without
seeing EWOULDBLOCK. So it's not direct access to the kqueue that is
affecting performance. It's something else, something caused by the
mere existence of the kqueue object holding the descriptor.

That led several people to speculate that there may be a difference in
the wakeup logic, when one end of a descriptor is in a kqueue (mjg
speculated wake-up-one vs broadcast could be a factor), and that may
be leading to worse scheduling behaviour.

To be clear, nobody thinks that 96 client threads talking to 96
processes on a single 8 CPU box is a great way to run a system in real
life! But it's still surprising that we lose performance whe using
kqueue, and it'd be great to understand why, and hopefully improve it.

The complete discussion on pgsql-hackers is here:

https://www.postgresql.org/message-id/flat/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com

Any ideas would be most welcome.

Thanks for reading!

====

Reproduction steps (assuming you have git, gmake, flex, bison,
readline, curl, ccache):

# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres

# grab kqueue patch
curl -O https://www.postgresql.org/message-id/attachment/65098/0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
git checkout -b kqueue
git am 0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch

# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install

# create a db cluster and set it to use 2GB of shmem so we can hold
whole dataset
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '2GB'" >> ~/pgdata/postgresql.conf

# you can either start (and later stop) postgres in the background with pg_ctl:
~/install/bin/pg_ctl start -D ~/pgdata
# ... or just run it in the foreground and hit ^C to stop it:
# ~/install/bin/postgres -D ~/pgdata

# this should produce about 1.1GB of data under ~/pgdata
~/install/bin/pgbench -s 10 -i postgres

# install the prewarm extension, so we can run the test without doing
any file IO
~/install/bin/psql postgres -c "create extension pg_prewarm"

# after that, after any server restart, prewarm like so:
~/install/bin/psql postgres -c "select pg_prewarm(c.oid::regclass)
from pg_class c where relkind in ('r', 'i')" | cat

# then 60 second pgbench runs are simply:
~/install/bin/pgbench -c 96 -j 96 -M prepared -S -T 60 postgres

# to make pgbench use TCP instead of Unix sockets, add -h localhost;
# to allow connection from another host, update ~/pgdata/postgresql.conf's
# listen_addresses

Thomas Munro

2018-10-02 02:26:25 UTC

Permalink

... The most
common usage pattern is simply waiting for read/write readiness on the
socket that is connected to the client + a pipe connected to the
parent supervisor process ("postmaster"),

... + a self pipe.

Allan Jude

2018-10-03 05:01:03 UTC

Permalink

Post by Thomas Munro
Hello FreeBSD hackers,
(CCing mjg and a list of others FreeBSD hackers he suggested)
In a fit of enthusiasm for FreeBSD, a couple of years ago I wrote a
patch to teach PostgreSQL to use kqueue(2). That was after we
switched over to epoll(2) on Linux for performance reasons. Our
default is to use poll(2) unless we have something better. The most
common usage pattern is simply waiting for read/write readiness on the
socket that is connected to the client + a pipe connected to the
parent supervisor process ("postmaster"), but we have plans for more
interesting kinds of multiplexing involving many more descriptors, and
in general this sits behind our very thin abstraction called
WaitEventSet (see latch.c in the PostgreSQL source tree) that can be
used for many things.
We did some testing using "pgbench" (instructions below) on various
platforms that have kqueue(2), and we got some conflicting results
from FreeBSD. When the system is heavily overloaded (a scenario we
want to work well, or at least not get worse under kqueue, even if
it's not the ideal way to run your database server), mjg reported that
with the kqueue patch performance was way better than unpatched when
the pgbench test client was running on a different host. Huzzah!
Unfortunately, another tester reported the performance was worse when
running pgbench from the same host (originally he complained about
NetBSD performance and then we realised FreeBSD was the same under
those conditions), and I confirmed that was the case for both Unix
sockets and TCP sockets. In one 96 (!) thread test, the TPS reported
by pgbench dropped from 70k to 50k queries per second on an 8 CPU
system. As crazy as those test conditions may seem, that is not a
good result.
Curiously, when truss'd, in the overloaded scenario that performs
worse, we very rarely seem to actually reach kevent(2). It seems like
there is some kind of scheduling difference producing the change.
syscall seconds calls errors
sendto 0.396840146 3452 0
recvfrom 0.415802029 3443 6
kevent 0.000626393 6 0
gettimeofday 2.723923249 24053 0
------------- ------- -------
3.537191817 30954 6
(That was captured on a virtualised system which had gettimeofday as a
syscall, but the effect has been reported on bare metal too and there
no gettimeofday calls show up; I don't believe that is a factor).
syscall seconds calls errors
ppoll 0.002773195 1 0
sendto 16.597880468 7217 0
recvfrom 25.646406008 7238 0
------------- ------- -------
42.247059671 14456 0
(For whatever reason pgbench uses ppoll() instead, but I assume that's
irrelevant here; it's also multi-threaded, unlike the server.) The
truss -c results for the server are not much different when using
poll(2) instead of kevent(2), although recvfrom in the pgbench client
seems to show a few seconds less total time, which is curious. You
can see that we're mostly able to do sendto() and recvfrom() without
seeing EWOULDBLOCK. So it's not direct access to the kqueue that is
affecting performance. It's something else, something caused by the
mere existence of the kqueue object holding the descriptor.
That led several people to speculate that there may be a difference in
the wakeup logic, when one end of a descriptor is in a kqueue (mjg
speculated wake-up-one vs broadcast could be a factor), and that may
be leading to worse scheduling behaviour.
To be clear, nobody thinks that 96 client threads talking to 96
processes on a single 8 CPU box is a great way to run a system in real
life! But it's still surprising that we lose performance whe using
kqueue, and it'd be great to understand why, and hopefully improve it.
https://www.postgresql.org/message-id/flat/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com
Any ideas would be most welcome.
Thanks for reading!
====
Reproduction steps (assuming you have git, gmake, flex, bison,
# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres
# grab kqueue patch
curl -O https://www.postgresql.org/message-id/attachment/65098/0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
git checkout -b kqueue
git am 0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install
# create a db cluster and set it to use 2GB of shmem so we can hold
whole dataset
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '2GB'" >> ~/pgdata/postgresql.conf
~/install/bin/pg_ctl start -D ~/pgdata
# ~/install/bin/postgres -D ~/pgdata
# this should produce about 1.1GB of data under ~/pgdata
~/install/bin/pgbench -s 10 -i postgres
# install the prewarm extension, so we can run the test without doing
any file IO
~/install/bin/psql postgres -c "create extension pg_prewarm"
~/install/bin/psql postgres -c "select pg_prewarm(c.oid::regclass)
from pg_class c where relkind in ('r', 'i')" | cat
~/install/bin/pgbench -c 96 -j 96 -M prepared -S -T 60 postgres
# to make pgbench use TCP instead of Unix sockets, add -h localhost;
# to allow connection from another host, update ~/pgdata/postgresql.conf's
# listen_addresses
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers

I have started to look into this a bit. I have not really gotten
anywhere yet, but I have produced a graph comparing the performance of
vanilla postgres vs your patch.

https://imgur.com/a/gKycGxW

They scale identically up to the 20 threads of hardware on my test
machine, and then kqueue falls off much more quickly.

Hopefully I'll have more useful findings tomorrow.

--
Allan Jude

Thomas Munro

2018-10-03 05:16:11 UTC

Permalink

Post by Allan Jude
I have started to look into this a bit. I have not really gotten
anywhere yet, but I have produced a graph comparing the performance of
vanilla postgres vs your patch.
https://imgur.com/a/gKycGxW
They scale identically up to the 20 threads of hardware on my test
machine, and then kqueue falls off much more quickly.

Great news, thanks for looking at this.