Discussion:
The pagedaemon evicts ARC before scanning the inactive page list
Alan Somers
2021-05-18 21:07:44 UTC
Permalink
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where most
of that RAM sits unused. For example, right now one of them has:

2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target

When a server gets into this situation, it stays there for days, with the
ARC target barely budging. All that inactive memory never gets reclaimed
and put to a good use. Frequently the server never recovers until a reboot.

I have a theory for what's going on. Ever since r334508^ the pagedaemon
sends the vm_lowmem event _before_ it scans the inactive page list. If the
ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
any. Is that order really correct? For reference, here's the relevant
code, from vm_pageout_worker:

shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0

Raising vfs.zfs.arc_min seems to workaround the problem. But ideally that
wouldn't be necessary.

-Alan

^ https://svnweb.freebsd.org/base?view=revision&revision=334508
Kevin Day
2021-05-18 21:37:22 UTC
Permalink
I'm not sure if this is the exact same thing, but I believe I'm seeing similar in 12.2-RELEASE as well.

Mem: 5628M Active, 4043M Inact, 8879M Laundry, 12G Wired, 1152M Buf, 948M Free
ARC: 8229M Total, 1010M MFU, 6846M MRU, 26M Anon, 32M Header, 315M Other
7350M Compressed, 9988M Uncompressed, 1.36:1 Ratio
Swap: 2689M Total, 2337M Used, 352M Free, 86% Inuse

Inact will keep growing, then it will exhaust all swap to the point it's complaining (swap_pager_getswapspace(xx): failed), and never recover until it reboots. ARC will keep shrinking and growing, but inactive grows forever. While it hasn't hit a point it's breaking things since the last reboot, on a bigger server (below) I can watch Inactive slowly grow and never free until it's swapping so badly I have to reboot.

Mem: 9648M Active, 604G Inact, 22G Laundry, 934G Wired, 1503M Buf, 415G Free
Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days, with the ARC target barely budging. All that inactive memory never gets reclaimed and put to a good use. Frequently the server never recovers until a reboot.
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally that wouldn't be necessary.
-Alan
^ https://svnweb.freebsd.org/base?view=revision&revision=334508 <https://svnweb.freebsd.org/base?view=revision&revision=334508>
Mark Johnston
2021-05-18 21:50:30 UTC
Permalink
Post by Kevin Day
I'm not sure if this is the exact same thing, but I believe I'm seeing similar in 12.2-RELEASE as well.
Mem: 5628M Active, 4043M Inact, 8879M Laundry, 12G Wired, 1152M Buf, 948M Free
ARC: 8229M Total, 1010M MFU, 6846M MRU, 26M Anon, 32M Header, 315M Other
7350M Compressed, 9988M Uncompressed, 1.36:1 Ratio
Swap: 2689M Total, 2337M Used, 352M Free, 86% Inuse
Inact will keep growing, then it will exhaust all swap to the point it's complaining (swap_pager_getswapspace(xx): failed), and never recover until it reboots. ARC will keep shrinking and growing, but inactive grows forever. While it hasn't hit a point it's breaking things since the last reboot, on a bigger server (below) I can watch Inactive slowly grow and never free until it's swapping so badly I have to reboot.
Mem: 9648M Active, 604G Inact, 22G Laundry, 934G Wired, 1503M Buf, 415G Free
This sounds somewhat unrelated. Under memory pressure the kernel will
reclaim clean pages from the inactive queue, making them available to
other memory consumers like the ARC. Dirty pages in the inactive queue
have to be written to stable storage before they may be reclaimed; pages
waiting for such treatment show up as "laundry". If swap space is all
used up, then the kernel likely has no way to reclaim dirty inactive
pages short of killing processes. So the real question is, what's the
main source of inactive memory on your servers?
Mark Johnston
2021-05-18 21:45:18 UTC
Permalink
Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where most
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days, with the
ARC target barely budging. All that inactive memory never gets reclaimed
and put to a good use. Frequently the server never recovers until a reboot.
I have a theory for what's going on. Ever since r334508^ the pagedaemon
sends the vm_lowmem event _before_ it scans the inactive page list. If the
ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
any. Is that order really correct? For reference, here's the relevant
That was the case even before r334508. Note that prior to that revision
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue. During a memory shortage we have pass > 0.
pass == 0 only when the page daemon is scanning the active queue.
Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally that
wouldn't be necessary.
vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time, the VM
doesn't know much about how much memory they are consuming. A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem. In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC. Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.

Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.
Alan Somers
2021-05-18 22:00:14 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where
most
Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days, with the
ARC target barely budging. All that inactive memory never gets reclaimed
and put to a good use. Frequently the server never recovers until a
reboot.
Post by Alan Somers
I have a theory for what's going on. Ever since r334508^ the pagedaemon
sends the vm_lowmem event _before_ it scans the inactive page list. If
the
Post by Alan Somers
ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
any. Is that order really correct? For reference, here's the relevant
That was the case even before r334508. Note that prior to that revision
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue. During a memory shortage we have pass > 0.
pass == 0 only when the page daemon is scanning the active queue.
Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally
that
Post by Alan Somers
wouldn't be necessary.
vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time, the VM
doesn't know much about how much memory they are consuming. A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem. In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC. Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.
Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.
I guess I don't understand why you would ever free from the ARC rather than
from the inactive list. When is inactive memory ever useful?
Mark Johnston
2021-05-18 22:10:42 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where
most
Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days, with the
ARC target barely budging. All that inactive memory never gets reclaimed
and put to a good use. Frequently the server never recovers until a
reboot.
Post by Alan Somers
I have a theory for what's going on. Ever since r334508^ the pagedaemon
sends the vm_lowmem event _before_ it scans the inactive page list. If
the
Post by Alan Somers
ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
any. Is that order really correct? For reference, here's the relevant
That was the case even before r334508. Note that prior to that revision
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue. During a memory shortage we have pass > 0.
pass == 0 only when the page daemon is scanning the active queue.
Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally
that
Post by Alan Somers
wouldn't be necessary.
vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time, the VM
doesn't know much about how much memory they are consuming. A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem. In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC. Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.
Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.
I guess I don't understand why you would ever free from the ARC rather than
from the inactive list. When is inactive memory ever useful?
Pages in the inactive queue are either unmapped or haven't had their
mappings referenced recently. But they may still be frequently accessed
by file I/O operations like sendfile(2). That's not to say that
reclaiming from other subsystems first is always the right strategy, but
note also that the page daemon may scan the inactive queue many times in
between vm_lowmem calls.
Alan Somers
2021-05-18 23:55:36 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where
most
Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days,
with the
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
ARC target barely budging. All that inactive memory never gets
reclaimed
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
and put to a good use. Frequently the server never recovers until a
reboot.
Post by Alan Somers
I have a theory for what's going on. Ever since r334508^ the
pagedaemon
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
sends the vm_lowmem event _before_ it scans the inactive page list.
If
Post by Alan Somers
Post by Alan Somers
the
Post by Alan Somers
ARC frees enough memory, then vm_pageout_scan_inactive won't need to
free
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
any. Is that order really correct? For reference, here's the
relevant
Post by Alan Somers
Post by Alan Somers
That was the case even before r334508. Note that prior to that
revision
Post by Alan Somers
Post by Alan Somers
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue. During a memory shortage we have pass >
0.
Post by Alan Somers
Post by Alan Somers
pass == 0 only when the page daemon is scanning the active queue.
Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally
that
Post by Alan Somers
wouldn't be necessary.
vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time, the VM
doesn't know much about how much memory they are consuming. A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem. In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC. Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.
Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.
I guess I don't understand why you would ever free from the ARC rather
than
Post by Alan Somers
from the inactive list. When is inactive memory ever useful?
Pages in the inactive queue are either unmapped or haven't had their
mappings referenced recently. But they may still be frequently accessed
by file I/O operations like sendfile(2). That's not to say that
reclaiming from other subsystems first is always the right strategy, but
note also that the page daemon may scan the inactive queue many times in
between vm_lowmem calls.
So By default ZFS tries to free (arc_target / 128) bytes of memory in
arc_lowmem. That's huge! On this server, pidctrl_daemon typically
requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it
would be easy to modify vm_lowmem to include the total amount of memory
that it wants freed. I could make such a patch. My next question is:
what's the fastest way to generate a lot of inactive memory? My first
attempt was "find . | xargs md5", but that isn't terribly effective. The
production machines are doing a lot of "zfs recv" and running some busy Go
programs, among other things, but I can't easily replicate that workload on
a development system.
-Alan
Rozhuk Ivan
2021-05-19 01:59:42 UTC
Permalink
On Tue, 18 May 2021 17:55:36 -0600
Post by Alan Somers
what's the fastest way to generate a lot of inactive memory? My first
attempt was "find . | xargs md5", but that isn't terribly effective.
Try this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195882
Konstantin Belousov
2021-05-19 03:24:58 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where
most
Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days,
with the
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
ARC target barely budging. All that inactive memory never gets
reclaimed
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
and put to a good use. Frequently the server never recovers until a
reboot.
Post by Alan Somers
I have a theory for what's going on. Ever since r334508^ the
pagedaemon
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
sends the vm_lowmem event _before_ it scans the inactive page list.
If
Post by Alan Somers
Post by Alan Somers
the
Post by Alan Somers
ARC frees enough memory, then vm_pageout_scan_inactive won't need to
free
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
any. Is that order really correct? For reference, here's the
relevant
Post by Alan Somers
Post by Alan Somers
That was the case even before r334508. Note that prior to that
revision
Post by Alan Somers
Post by Alan Somers
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue. During a memory shortage we have pass >
0.
Post by Alan Somers
Post by Alan Somers
pass == 0 only when the page daemon is scanning the active queue.
Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally
that
Post by Alan Somers
wouldn't be necessary.
vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time, the VM
doesn't know much about how much memory they are consuming. A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem. In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC. Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.
Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.
I guess I don't understand why you would ever free from the ARC rather
than
Post by Alan Somers
from the inactive list. When is inactive memory ever useful?
Pages in the inactive queue are either unmapped or haven't had their
mappings referenced recently. But they may still be frequently accessed
by file I/O operations like sendfile(2). That's not to say that
reclaiming from other subsystems first is always the right strategy, but
note also that the page daemon may scan the inactive queue many times in
between vm_lowmem calls.
So By default ZFS tries to free (arc_target / 128) bytes of memory in
arc_lowmem. That's huge! On this server, pidctrl_daemon typically
requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it
would be easy to modify vm_lowmem to include the total amount of memory
what's the fastest way to generate a lot of inactive memory? My first
attempt was "find . | xargs md5", but that isn't terribly effective. The
production machines are doing a lot of "zfs recv" and running some busy Go
programs, among other things, but I can't easily replicate that workload on
Is your machine ZFS-only? If yes, then typical source of inactive memory
can be of two kinds:
- anonymous memory that apps allocate with facilities like malloc(3).
If inactive is shrinkable then it is probably not, because dirty pages
from anon objects must go through laundry->swap route to get evicted,
and you did not mentioned swapping
- double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
If unmapped, these are mostly a waste. Even if mapped, the source
of truth for data is ARC, AFAIU, so they can be dropped as well, since
inactive state means that its content is not hot.

You can try to inspect the most outstanding objects adding to the
inactive queue with 'vmobject -o' to see where the most of inactive pages
come from.

If indeed they are double-copy, then perhaps ZFS can react even to the
current primitive vm_lowmem signal somewhat different. First, it could
do the pass over its vnodes and
- free clean unmapped pages
- if some targets are not met after that, laundry dirty pages,
then return to freeing clean unmapped pages
all that before ever touching its cache (ARC).
Alan Somers
2021-05-19 03:55:25 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation
where
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
most
Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days,
with the
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
ARC target barely budging. All that inactive memory never gets
reclaimed
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
and put to a good use. Frequently the server never recovers
until a
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
reboot.
Post by Alan Somers
I have a theory for what's going on. Ever since r334508^ the
pagedaemon
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
sends the vm_lowmem event _before_ it scans the inactive page
list.
Post by Alan Somers
Post by Alan Somers
If
Post by Alan Somers
Post by Alan Somers
the
Post by Alan Somers
ARC frees enough memory, then vm_pageout_scan_inactive won't
need to
Post by Alan Somers
Post by Alan Somers
free
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
any. Is that order really correct? For reference, here's the
relevant
Post by Alan Somers
Post by Alan Somers
That was the case even before r334508. Note that prior to that
revision
Post by Alan Somers
Post by Alan Somers
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0,
before
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
scanning the inactive queue. During a memory shortage we have
pass >
Post by Alan Somers
Post by Alan Somers
0.
Post by Alan Somers
Post by Alan Somers
pass == 0 only when the page daemon is scanning the active queue.
Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But
ideally
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
that
Post by Alan Somers
wouldn't be necessary.
vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time,
the VM
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
doesn't know much about how much memory they are consuming. A
better
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
strategy, at least for the ARC, would be reclaim memory based on
the
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
relative memory consumption of each subsystem. In your case, when
the
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
page daemon goes to reclaim memory, it should use the inactive
queue to
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
make up ~85% of the shortfall and reclaim the rest from the ARC.
Even
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
better would be if the ARC could use the page cache as a
second-level
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
cache, like the buffer cache does.
Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to
quickly
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
answer the question, "how much memory can I release if asked?",
then
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
the page daemon could use that to determine how much of its
reclamation
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
target should come from the ARC vs. the page cache.
I guess I don't understand why you would ever free from the ARC
rather
Post by Alan Somers
Post by Alan Somers
than
Post by Alan Somers
from the inactive list. When is inactive memory ever useful?
Pages in the inactive queue are either unmapped or haven't had their
mappings referenced recently. But they may still be frequently
accessed
Post by Alan Somers
Post by Alan Somers
by file I/O operations like sendfile(2). That's not to say that
reclaiming from other subsystems first is always the right strategy,
but
Post by Alan Somers
Post by Alan Somers
note also that the page daemon may scan the inactive queue many times
in
Post by Alan Somers
Post by Alan Somers
between vm_lowmem calls.
So By default ZFS tries to free (arc_target / 128) bytes of memory in
arc_lowmem. That's huge! On this server, pidctrl_daemon typically
requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it
would be easy to modify vm_lowmem to include the total amount of memory
what's the fastest way to generate a lot of inactive memory? My first
attempt was "find . | xargs md5", but that isn't terribly effective. The
production machines are doing a lot of "zfs recv" and running some busy
Go
Post by Alan Somers
programs, among other things, but I can't easily replicate that workload
on
Is your machine ZFS-only? If yes, then typical source of inactive memory
No, there is also FUSE. But there is typically < 1GB of Buf memory, so I
didn't mention it.
Post by Alan Somers
- anonymous memory that apps allocate with facilities like malloc(3).
If inactive is shrinkable then it is probably not, because dirty pages
from anon objects must go through laundry->swap route to get evicted,
and you did not mentioned swapping
No, there's no appreciable amount of swapping going on. Nor is the laundry
list typically more than a few hundred MB.
Post by Alan Somers
- double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
If unmapped, these are mostly a waste. Even if mapped, the source
of truth for data is ARC, AFAIU, so they can be dropped as well, since
inactive state means that its content is not hot.
So if a process mmap()'s a file on ZFS and reads from it but never writes
to it, will those pages show up as inactive?
Post by Alan Somers
You can try to inspect the most outstanding objects adding to the
inactive queue with 'vmobject -o' to see where the most of inactive pages
come from.
Wow, that did it! About 99% of the inactive pages come from just a few
vnodes which are used by the FUSE servers. But I also see a few large
entries like
1105308 333933 771375 1 0 WB df
what does that signify?
Post by Alan Somers
If indeed they are double-copy, then perhaps ZFS can react even to the
current primitive vm_lowmem signal somewhat different. First, it could
do the pass over its vnodes and
- free clean unmapped pages
- if some targets are not met after that, laundry dirty pages,
then return to freeing clean unmapped pages
all that before ever touching its cache (ARC).
Konstantin Belousov
2021-05-19 04:17:02 UTC
Permalink
Post by Alan Somers
Post by Konstantin Belousov
Is your machine ZFS-only? If yes, then typical source of inactive memory
No, there is also FUSE. But there is typically < 1GB of Buf memory, so I
didn't mention it.
As Mark mentioned, buffers use page cache as second-level cache. More
precisely, there is relatively limited number of buffers in the system,
which are just headers to describe a set of pages. When a buffer is
recycled, its pages are put on inactive queue.

This is why I asked is your machine ZFS-only or not, because io on
bufcache-using filesystems typically add to the inactive queue.
Post by Alan Somers
Post by Konstantin Belousov
- anonymous memory that apps allocate with facilities like malloc(3).
If inactive is shrinkable then it is probably not, because dirty pages
from anon objects must go through laundry->swap route to get evicted,
and you did not mentioned swapping
No, there's no appreciable amount of swapping going on. Nor is the laundry
list typically more than a few hundred MB.
Post by Konstantin Belousov
- double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
If unmapped, these are mostly a waste. Even if mapped, the source
of truth for data is ARC, AFAIU, so they can be dropped as well, since
inactive state means that its content is not hot.
So if a process mmap()'s a file on ZFS and reads from it but never writes
to it, will those pages show up as inactive?
It depends on workload, and it does not matter much if the pages are clean
or dirty. Right after mapping or under intense access pattern, they sit
on the active list. If not touched long enough, or cycled through the
buffer cache for io (but ZFS pages not go through buffer cache), they
are moved to inactive.
Post by Alan Somers
Post by Konstantin Belousov
You can try to inspect the most outstanding objects adding to the
inactive queue with 'vmobject -o' to see where the most of inactive pages
come from.
Wow, that did it! About 99% of the inactive pages come from just a few
vnodes which are used by the FUSE servers. But I also see a few large
entries like
1105308 333933 771375 1 0 WB df
what does that signify?
These are anonymous memory.
Post by Alan Somers
Post by Konstantin Belousov
If indeed they are double-copy, then perhaps ZFS can react even to the
current primitive vm_lowmem signal somewhat different. First, it could
do the pass over its vnodes and
- free clean unmapped pages
- if some targets are not met after that, laundry dirty pages,
then return to freeing clean unmapped pages
all that before ever touching its cache (ARC).
Alan Somers
2021-05-19 20:28:51 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Konstantin Belousov
Is your machine ZFS-only? If yes, then typical source of inactive
memory
Post by Alan Somers
No, there is also FUSE. But there is typically < 1GB of Buf memory, so I
didn't mention it.
As Mark mentioned, buffers use page cache as second-level cache. More
precisely, there is relatively limited number of buffers in the system,
which are just headers to describe a set of pages. When a buffer is
recycled, its pages are put on inactive queue.
This is why I asked is your machine ZFS-only or not, because io on
bufcache-using filesystems typically add to the inactive queue.
Post by Alan Somers
Post by Konstantin Belousov
- anonymous memory that apps allocate with facilities like malloc(3).
If inactive is shrinkable then it is probably not, because dirty
pages
Post by Alan Somers
Post by Konstantin Belousov
from anon objects must go through laundry->swap route to get evicted,
and you did not mentioned swapping
No, there's no appreciable amount of swapping going on. Nor is the
laundry
Post by Alan Somers
list typically more than a few hundred MB.
Post by Konstantin Belousov
- double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
If unmapped, these are mostly a waste. Even if mapped, the source
of truth for data is ARC, AFAIU, so they can be dropped as well,
since
Post by Alan Somers
Post by Konstantin Belousov
inactive state means that its content is not hot.
So if a process mmap()'s a file on ZFS and reads from it but never writes
to it, will those pages show up as inactive?
It depends on workload, and it does not matter much if the pages are clean
or dirty. Right after mapping or under intense access pattern, they sit
on the active list. If not touched long enough, or cycled through the
buffer cache for io (but ZFS pages not go through buffer cache), they
are moved to inactive.
Post by Alan Somers
Post by Konstantin Belousov
You can try to inspect the most outstanding objects adding to the
inactive queue with 'vmobject -o' to see where the most of inactive
pages
Post by Alan Somers
Post by Konstantin Belousov
come from.
Wow, that did it! About 99% of the inactive pages come from just a few
vnodes which are used by the FUSE servers. But I also see a few large
entries like
1105308 333933 771375 1 0 WB df
what does that signify?
These are anonymous memory.
Post by Alan Somers
Post by Konstantin Belousov
If indeed they are double-copy, then perhaps ZFS can react even to the
current primitive vm_lowmem signal somewhat different. First, it could
do the pass over its vnodes and
- free clean unmapped pages
- if some targets are not met after that, laundry dirty pages,
then return to freeing clean unmapped pages
all that before ever touching its cache (ARC).
Follow-up:
All of the big inactive-memory consumers were files on FUSE file systems
that were being exported as CTL LUNs. ZFS files exported by CTL do not use
any res or inactive memory. I didn't test UFS. Curiously, removing the
LUN does not free the memory, but shutting down the FUSE daemon does. A
valid workaround is to set the vfs.fusefs.data_cache_mode sysctl to 0.
That prevents the kernel from caching any data from the FUSE file system.
I've tested this on both FreeBSD 12.2 and 13.0 . Should the kernel do a
better job of reclaiming inactive memory before ARC? Yes, but in my case
it's better not to create so much inactive memory in the first place.
Thanks for everybody's help, especially kib's tip about "vmstat -o".
-Alan
Andriy Gapon
2021-05-20 06:44:20 UTC
Permalink
All of the big inactive-memory consumers were files on FUSE file systems that
were being exported as CTL LUNs.  ZFS files exported by CTL do not use any res
or inactive memory.  I didn't test UFS.  Curiously, removing the LUN does not
free the memory, but shutting down the FUSE daemon does.  A valid workaround is
to set the vfs.fusefs.data_cache_mode sysctl to 0.  That prevents the kernel
from caching any data from the FUSE file system.  I've tested this on both
FreeBSD 12.2 and 13.0 .  Should the kernel do a better job of reclaiming
inactive memory before ARC?  Yes, but in my case it's better not to create so
much inactive memory in the first place.  Thanks for everybody's help,
especially kib's tip about "vmstat -o".
Nevertheless, the larger problem still exists. I can confirm it on my system
and right now I am not using fusefs at all. I do, however, use some programs
that like to mmap some big data a lot.

The current pageout + ARC reclaim code really oppresses the ARC.
I think that Kostik and Mark made some good suggestion on how that can be fixed.
--
Andriy Gapon
Loading...