The pagedaemon evicts ARC before scanning the inactive page list

Discussion:

Alan Somers

2021-05-18 21:07:44 UTC

I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where most
of that RAM sits unused. For example, right now one of them has:

2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target

When a server gets into this situation, it stays there for days, with the
ARC target barely budging. All that inactive memory never gets reclaimed
and put to a good use. Frequently the server never recovers until a reboot.

I have a theory for what's going on. Ever since r334508^ the pagedaemon
sends the vm_lowmem event _before_ it scans the inactive page list. If the
ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
any. Is that order really correct? For reference, here's the relevant
code, from vm_pageout_worker:

shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0

Raising vfs.zfs.arc_min seems to workaround the problem. But ideally that
wouldn't be necessary.

-Alan

^ https://svnweb.freebsd.org/base?view=revision&revision=334508

Kevin Day

2021-05-18 21:37:22 UTC

Permalink

I'm not sure if this is the exact same thing, but I believe I'm seeing similar in 12.2-RELEASE as well.

Mem: 5628M Active, 4043M Inact, 8879M Laundry, 12G Wired, 1152M Buf, 948M Free
ARC: 8229M Total, 1010M MFU, 6846M MRU, 26M Anon, 32M Header, 315M Other
7350M Compressed, 9988M Uncompressed, 1.36:1 Ratio
Swap: 2689M Total, 2337M Used, 352M Free, 86% Inuse

Inact will keep growing, then it will exhaust all swap to the point it's complaining (swap_pager_getswapspace(xx): failed), and never recover until it reboots. ARC will keep shrinking and growing, but inactive grows forever. While it hasn't hit a point it's breaking things since the last reboot, on a bigger server (below) I can watch Inactive slowly grow and never free until it's swapping so badly I have to reboot.

Mem: 9648M Active, 604G Inact, 22G Laundry, 934G Wired, 1503M Buf, 415G Free

Post by Alan Somers
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days, with the ARC target barely budging. All that inactive memory never gets reclaimed and put to a good use. Frequently the server never recovers until a reboot.
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally that wouldn't be necessary.
-Alan
^ https://svnweb.freebsd.org/base?view=revision&revision=334508 <https://svnweb.freebsd.org/base?view=revision&revision=334508>

Mark Johnston

2021-05-18 21:50:30 UTC

Permalink

Post by Kevin Day
I'm not sure if this is the exact same thing, but I believe I'm seeing similar in 12.2-RELEASE as well.
Mem: 5628M Active, 4043M Inact, 8879M Laundry, 12G Wired, 1152M Buf, 948M Free
ARC: 8229M Total, 1010M MFU, 6846M MRU, 26M Anon, 32M Header, 315M Other
7350M Compressed, 9988M Uncompressed, 1.36:1 Ratio
Swap: 2689M Total, 2337M Used, 352M Free, 86% Inuse
Inact will keep growing, then it will exhaust all swap to the point it's complaining (swap_pager_getswapspace(xx): failed), and never recover until it reboots. ARC will keep shrinking and growing, but inactive grows forever. While it hasn't hit a point it's breaking things since the last reboot, on a bigger server (below) I can watch Inactive slowly grow and never free until it's swapping so badly I have to reboot.
Mem: 9648M Active, 604G Inact, 22G Laundry, 934G Wired, 1503M Buf, 415G Free

This sounds somewhat unrelated. Under memory pressure the kernel will
reclaim clean pages from the inactive queue, making them available to
other memory consumers like the ARC. Dirty pages in the inactive queue
have to be written to stable storage before they may be reclaimed; pages
waiting for such treatment show up as "laundry". If swap space is all
used up, then the kernel likely has no way to reclaim dirty inactive
pages short of killing processes. So the real question is, what's the
main source of inactive memory on your servers?

Mark Johnston

2021-05-18 21:45:18 UTC

Permalink

Post by Alan Somers
I'm using ZFS on servers with tons of RAM and running FreeBSD
12.2-RELEASE. Sometimes they get into a pathological situation where most
2 GB Active
529 GB Inactive
16 GB Free
99 GB ARC total
469 GB ARC max
86 GB ARC target
When a server gets into this situation, it stays there for days, with the
ARC target barely budging. All that inactive memory never gets reclaimed
and put to a good use. Frequently the server never recovers until a reboot.
I have a theory for what's going on. Ever since r334508^ the pagedaemon
sends the vm_lowmem event _before_ it scans the inactive page list. If the
ARC frees enough memory, then vm_pageout_scan_inactive won't need to free
any. Is that order really correct? For reference, here's the relevant

That was the case even before r334508. Note that prior to that revision
vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
scanning the inactive queue. During a memory shortage we have pass > 0.
pass == 0 only when the page daemon is scanning the active queue.

Post by Alan Somers
shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
if (shortage > 0) {
ofree = vmd->vmd_free_count;
if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
shortage -= min(vmd->vmd_free_count - ofree,
(u_int)shortage);
target_met = vm_pageout_scan_inactive(vmd, shortage,
&addl_shortage);
} else
addl_shortage = 0
Raising vfs.zfs.arc_min seems to workaround the problem. But ideally that
wouldn't be necessary.

vm_lowmem is too primitive: it doesn't tell subscribing subsystems
anything about the magnitude of the shortage. At the same time, the VM
doesn't know much about how much memory they are consuming. A better
strategy, at least for the ARC, would be reclaim memory based on the
relative memory consumption of each subsystem. In your case, when the
page daemon goes to reclaim memory, it should use the inactive queue to
make up ~85% of the shortfall and reclaim the rest from the ARC. Even
better would be if the ARC could use the page cache as a second-level
cache, like the buffer cache does.

Today I believe the ARC treats vm_lowmem as a signal to shed some
arbitrary fraction of evictable data. If the ARC is able to quickly
answer the question, "how much memory can I release if asked?", then
the page daemon could use that to determine how much of its reclamation
target should come from the ARC vs. the page cache.

Alan Somers

2021-05-18 22:00:14 UTC