Various problems with 13.0 amd64 on vultr.com

Discussion:

Mark Delany

2021-04-18 10:47:00 UTC

Hi all.

I rarely if ever post here so if there's a better place, LMK.

I've been running 12.2 on vultr.com instances for a long time without any issues. However
I recently attempted an upgrade to 13.0 and the system now exhibits a number of issues.

The most critical issue is that the system randomly wedged after running for a while
(anywhere from 10 minutes to a couple of hours) requiring a reboot to recover. No console
response or messages and limited network response (see below). No messages logged anywhere
as best I can tell.

The second issue is more annoying than critical: the system doesn't reboot with the
reboot/shutdown commands. The shutdown sequence seems to complete but the reboot never
occurs. I compiled and ran a "reboot(RB_AUTOBOOT | RB_VERBOSE)" but nothing interesting
showed up.

I have no idea whether the two issues are related excepting that neither occur with 12.2

Some details:

- I first upgraded with freebsd-update and then tried with a fresh ISO image and
completely overwrote the original file system.

- I've tried both UFS and ZFS root file systems.

- I tried with a fresh VM instance in case there was some sort of per-instance glitch

- The system is 99% idle with no memory pressure. It normally runs nsd, openntpd and a few
other processes installed via pkg, but nothing wierd as best I can tell.

- it has no kernel modules manually loaded

- It's configured with ipv4 and ipv6 and when it gets wedged I get a ping response from
the ipv6 address, but not from ipv4. Furthermore, if I try a tcp connection to ipv6 I
get a connection setup, but no data.

- The VM is configured as a single-CPU system

- I haven't raised the issue with vultr yet. Thought I'd see what the hive-mind thinks
first.

Not that it will surprise anyone, but I recently spun up 13.0 in Virtualbox on a lab
machine as well as on a different VM provider without any problems, so it's probably
something relatively unique to vultr.

That this is a virtually idle system on a single CPU with no oddball or unusual kernel
modules or network configs makes the situation surprising to me. There is no pattern that
I'm yet able to discern. The main thing I have left to try is to boot the system without
any networking activated, but apart from that I'm out of ideas in terms of identifying the
root cause.

So my questions are:

1. Anyone else having the same issue? Or not having the same issue?
2. Clues on how to diagnose? This is a non-critical system so I can try anything that
anyone suggests but I'm not particularly familiar with kernel-level debugging so a bit
of hand-holding might be needed if you have suggestions.

For those unfamiliar with vultr's VMs, here's the first part of dmesg:

FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr 9 04:24:09 UTC 2021
***@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
FreeBSD clang version 11.0.1 (***@github.com:llvm/llvm-project.git llvmorg-11.0.1-0-g43ff75f2c3fe)
VT(vga): text 80x25
CPU: Intel Xeon Processor (Cascadelake) (2993.02-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x50656 Family=0x6 Model=0x55 Stepping=6
Features=0x783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2>
Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
AMD Features2=0x21<LAHF,ABM>
Structured Extended Features=0xd18307a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL>
Structured Extended Features2=0x808<PKU,AVX512VNNI>
Structured Extended Features3=0xa4000000<IBPB,ARCH_CAP,SSBD>
XSAVE Features=0x1<XSAVEOPT>
IA32_ARCH_CAPS=0x2b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO>
Hypervisor: Origin = "KVMKVMKVM"
real memory = 1073741824 (1024 MB)
avail memory = 997744640 (951 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <BOCHS BXPCAPIC>
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
ioapic0 <Version 1.1> irqs 0-23
Timecounter "TSC-low" frequency 1496510010 Hz quality 800

in case it shows up anything odd to those who can decode this sort of stuff.

Mark.

Mason Loring Bliss

2021-04-20 02:13:18 UTC

Permalink

Post by Mark Delany
The second issue is more annoying than critical: the system doesn't
reboot with the reboot/shutdown commands.

I was curious about what you're describing, so I uploaded a 13.0 ISO and
noted that this hang on reboot is there started at the very first reboot
attempt, after booting from the ISO to install.

Since FreeBSD lacks "dmesg -w" I'm tailing dmesg in a loop (2s delay
between iterations) on the console to see if it catches anything funny. You
could do this as well, and if you do it from the console it you could also
test the system sans networking.

I haven't seen a hang yet, but the test system hasn't been up much more
than ten minutes, so I'll report back later.

--
Mason Loring Bliss (( If I have not seen as far as others, it is because
***@blisses.org )) giants were standing on my shoulders. - Hal Abelson

Mark Delany

2021-04-20 07:03:18 UTC

Permalink

Post by Mason Loring Bliss
I haven't seen a hang yet, but the test system hasn't been up much more
than ten minutes, so I'll report back later.

I think I've isolated it to natd traffic.

And for what it's worth I was able to reproduce the problem on a completely different VPS
provider. So I don't think it's specific to vultr.com any more.

I guess I should raise a PR or move on over to freebsd-net. Is that the right thing to do?

Mark.

Mark Delany

2021-04-20 05:32:37 UTC

Permalink

Post by Mason Loring Bliss
I haven't seen a hang yet, but the test system hasn't been up much more
than ten minutes, so I'll report back later.

I think I've isolated it to natd traffic.

The system stays up reliably with natd disabled but hangs within a couple of minutes of an
inbound ipv4 traffic.

If I just run with the ipfw rule and the divert kernel module, then no problem the system
runs albeit without any real ipv4 traffic working for obvious reasons. But I can happily
do anything I like in ipv6 and it runs fine.

But as soon as natd is run with inbound traffic such as an ssh session, then the system
mostly hangs and according to the vultr console, it's spinning at 100% CPU.

I say "mostly hangs" because I have now caused at least one core dump while ostensibly
reproducing the hang.

Here is a snippet of crashinfo data. Happy to provide more to anyone but it's 90K so I
didn't think it appropriate to post it here.

...
Unread portion of the kernel message buffer:
panic: sbappendaddr_locked
cpuid = 0
time = 1618895504
KDB: stack backtrace:
#0 0xffffffff80c57345 at kdb_backtrace+0x65
#1 0xffffffff80c09d21 at vpanic+0x181
#2 0xffffffff80c09b93 at panic+0x43
#3 0xffffffff80ca51e0 at sbappendaddr_locked_internal+0
#4 0xffffffff827eafd0 at divert_packet+0x1a0
#5 0xffffffff827a2c81 at ipfw_check_packet+0x2c1
#6 0xffffffff80d41f87 at pfil_run_hooks+0x97
#7 0xffffffff80db2d71 at ip_output+0xb61
#8 0xffffffff80dc94b4 at tcp_output+0x1b04
#9 0xffffffff80dcf973 at tcp_ctlinput+0x313
#10 0xffffffff80daf105 at icmp_input+0x795
#11 0xffffffff80dafc15 at ip_input+0x125
#12 0xffffffff80d3fa7b at swi_net+0x12b
#13 0xffffffff80bcae5d at ithread_loop+0x24d
#14 0xffffffff80bc7c5e at fork_exit+0x7e
#15 0xffffffff8106282e at fork_trampoline+0xe
Uptime: 11m13s
Dumping 123 out of 982 MB:..13%..26%..39%..52%..65%..78%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c09916 in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:486
#3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:919
#4 0xffffffff80c09b93 in panic (fmt=<unavailable>)
at /usr/src/sys/kern/kern_shutdown.c:843
#5 0xffffffff80ca51e0 in sbappendaddr_locked (sb=0xfffff800069b4c58,
asa=0xfffffe00491bcd00, m0=0xfffff80006b7a000, control=0x0)
at /usr/src/sys/kern/uipc_sockbuf.c:1198
#6 0xffffffff827eafd0 in divert_packet (m=0xfffff80006b7a000,
incoming=<optimized out>) at /usr/src/sys/netinet/ip_divert.c:285
#7 0xffffffff827a2c81 in ipfw_divert (m0=0xfffffe00491bcf58,
args=0xfffffe00491bcd70, tee=<optimized out>)
at /usr/src/sys/netpfil/ipfw/ip_fw_pfil.c:525
#8 ipfw_check_packet (m0=0xfffffe00491bcf58, ifp=0xfffff8000358a000,
flags=131072, ruleset=<optimized out>, inp=0xfffff80006f92000)
at /usr/src/sys/netpfil/ipfw/ip_fw_pfil.c:283
#9 0xffffffff80d41f87 in pfil_run_hooks (head=<optimized out>, p=...,
ifp=0xfffff8000358a000, flags=***@entry=131072,
inp=***@entry=0xfffff80006f92000) at /usr/src/sys/net/pfil.c:187
#10 0xffffffff80db2d71 in ip_output_pfil (mp=0xfffffe00491bcf58,
ifp=0xfffff8000358a000, flags=0, inp=0xfffff80006f92000,
dst=0xfffff80006f921a8, fibnum=<optimized out>, error=<optimized out>)
at /usr/src/sys/netinet/ip_output.c:130
#11 ip_output (m=0x0, ***@entry=0xfffff80006b7a000, opt=<optimized out>,
ro=<optimized out>, flags=0, imo=***@entry=0x0, inp=<optimized out>)
at /usr/src/sys/netinet/ip_output.c:705
#12 0xffffffff80dc94b4 in tcp_output (tp=0xfffffe008b5e1c48)
at /usr/src/sys/netinet/tcp_output.c:1492
#13 0xffffffff80dcf973 in tcp_ctlinput (cmd=<unavailable>,
***@entry=<error reading variable: value is not available>,
sa=<unavailable>,
***@entry=<error reading variable: value is not available>,
vip=0xfffff80006b511ac,
***@entry=<error reading variable: value is not available>)
at /usr/src/sys/netinet/tcp_subr.c:2544
#14 0xffffffff80daf105 in icmp_input (mp=0xfffffe00491bd300,
***@entry=<error reading variable: value is not available>,
offp=0xfffffe00491bd2fc,
***@entry=<error reading variable: value is not available>,
proto=<unavailable>,
***@entry=<error reading variable: value is not available>)
at /usr/src/sys/netinet/ip_icmp.c:571
#15 0xffffffff80dafc15 in ip_input (m=0x0)
at /usr/src/sys/netinet/ip_input.c:829
#16 0xffffffff80d3fa7b in netisr_process_workstream_proto (
nwsp=<optimized out>, proto=1) at /usr/src/sys/net/netisr.c:919
#17 swi_net (arg=<optimized out>) at /usr/src/sys/net/netisr.c:966
#18 0xffffffff80bcae5d in intr_event_execute_handlers (p=<optimized out>,
ie=0xfffff8000332bc00) at /usr/src/sys/kern/kern_intr.c:1168
#19 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8000332bc00)
at /usr/src/sys/kern/kern_intr.c:1181
#20 ithread_loop (arg=***@entry=0xfffff8000332fe00)
at /usr/src/sys/kern/kern_intr.c:1269
#21 0xffffffff80bc7c5e in fork_exit (
callout=0xffffffff80bcac10 <ithread_loop>, arg=0xfffff8000332fe00,
frame=0xfffffe00491bd480) at /usr/src/sys/kern/kern_fork.c:1069
#22 <signal handler called>
(kgdb)

...

Happy to provide further info and run anything that folk think might help provide more
useful diagnostic info.

Oh, the interface, if it's relevant, is:

vtnet0: <VirtIO Networking Adapter> on virtio_pci0

Mark.

Freddie Cash

2021-04-20 16:39:21 UTC

Permalink

Post by Mason Loring Bliss
I haven't seen a hang yet, but the test system hasn't been up much more
than ten minutes, so I'll report back later.

If you re-write your rules to use the in-kernel libalias support instead of
divert sockets sending traffic to natd, does it stay up while passing IPv4
traffic?

That would help narrow it down even further to natd issues.

--
Freddie Cash
***@gmail.com

Mark Delany

2021-04-21 07:15:50 UTC

Permalink

Post by Freddie Cash
If you re-write your rules to use the in-kernel libalias support instead of
divert sockets sending traffic to natd, does it stay up while passing IPv4
traffic?
That would help narrow it down even further to natd issues.

I've not used the in-kernel NAT support before so it'll take me a little while, but I'll
give it a shot and report back.

Mark.

Mark Delany

2021-05-27 01:45:19 UTC

Permalink

Post by Mark Delany

I've not used the in-kernel NAT support before so it'll take me a little while, but I'll
give it a shot and report back.

Well, lucky me, I no longer need to do this as it looks like the problem was fixed in
13.0-RELEASE-p1 as part of an Errata fixed for "Kernel double free when transmitting on a
divert socket".

I tested with the new kernel and now running natd no longer causes a kernel panic.

My one remaining beef with 13.0 relates to vultr more than anything else in that FBSD
doesn't reboot when requested. It gets into some late-stage of the shutdown process and
just spins on CPU. I can live with this one, but will keep an eye on it.

Mark.

Nicolas Embriz

2021-05-27 06:45:46 UTC

Permalink

Just in case you can reboot using:

shutdown -o -n -r now

Not ideal but works for now as a workaround.

Post by Freddie Cash

Post by Mark Delany

Post by Freddie Cash
If you re-write your rules to use the in-kernel libalias support

instead of

Post by Mark Delany

Post by Freddie Cash
divert sockets sending traffic to natd, does it stay up while passing

IPv4

Post by Mark Delany

Post by Freddie Cash
traffic?
That would help narrow it down even further to natd issues.

I've not used the in-kernel NAT support before so it'll take me a little

while, but I'll

Post by Mark Delany
give it a shot and report back.

Well, lucky me, I no longer need to do this as it looks like the problem was fixed in
13.0-RELEASE-p1 as part of an Errata fixed for "Kernel double free when transmitting on a
divert socket".
I tested with the new kernel and now running natd no longer causes a kernel panic.
My one remaining beef with 13.0 relates to vultr more than anything else in that FBSD
doesn't reboot when requested. It gets into some late-stage of the shutdown process and
just spins on CPU. I can live with this one, but will keep an eye on it.
Mark.

Mark Delany

2021-05-27 07:15:30 UTC

Permalink

Post by Nicolas Embriz
shutdown -o -n -r now
Not ideal but works for now as a workaround.

Nicolas, what a great suggestion. This works perfectly on my vultr instance. Thanks.

Mark.

Gautam Mani

2021-05-28 13:18:25 UTC

Permalink

Hi,

Post by Mark Delany

Post by Nicolas Embriz
shutdown -o -n -r now
Not ideal but works for now as a workaround.

Nicolas, what a great suggestion. This works perfectly on my vultr instance. Thanks.

I can confirm that I also see this issue on Vultr in a single-CPU ZFS based
system. On checking shutdown(8)

-n If the -o option is specified, prevent the file system cache
from
being flushed by passing -n to halt(8) or reboot(8). This
option
should probably not be used.

So using -n could possibly result in filesystem corruption ?

I also found: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253175
and also possibly related to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254513 which relates to
virtio_random.

Thanks,
Gautam

jon via freebsd-hackers

2021-04-20 00:45:23 UTC

Permalink

Post by Mark Delany
Hi all.
I rarely if ever post here so if there's a better place, LMK.
I've been running 12.2 on vultr.com instances for a long time without any issues. However
I recently attempted an upgrade to 13.0 and the system now exhibits a number of issues.
The most critical issue is that the system randomly wedged after running for a while
(anywhere from 10 minutes to a couple of hours) requiring a reboot to recover. No console
response or messages and limited network response (see below). No messages logged anywhere
as best I can tell.
The second issue is more annoying than critical: the system doesn't reboot with the
reboot/shutdown commands. The shutdown sequence seems to complete but the reboot never
occurs. I compiled and ran a "reboot(RB_AUTOBOOT | RB_VERBOSE)" but nothing interesting
showed up.
I have no idea whether the two issues are related excepting that neither occur with 12.2
- I first upgraded with freebsd-update and then tried with a fresh ISO image and
completely overwrote the original file system.
- I've tried both UFS and ZFS root file systems.
- I tried with a fresh VM instance in case there was some sort of per-instance glitch
- The system is 99% idle with no memory pressure. It normally runs nsd, openntpd and a few
other processes installed via pkg, but nothing wierd as best I can tell.
- it has no kernel modules manually loaded
- It's configured with ipv4 and ipv6 and when it gets wedged I get a ping response from
the ipv6 address, but not from ipv4. Furthermore, if I try a tcp connection to ipv6 I
get a connection setup, but no data.
- The VM is configured as a single-CPU system
- I haven't raised the issue with vultr yet. Thought I'd see what the hive-mind thinks
first.
Not that it will surprise anyone, but I recently spun up 13.0 in Virtualbox on a lab
machine as well as on a different VM provider without any problems, so it's probably
something relatively unique to vultr.
That this is a virtually idle system on a single CPU with no oddball or unusual kernel
modules or network configs makes the situation surprising to me. There is no pattern that
I'm yet able to discern. The main thing I have left to try is to boot the system without
any networking activated, but apart from that I'm out of ideas in terms of identifying the
root cause.
1. Anyone else having the same issue? Or not having the same issue?
2. Clues on how to diagnose? This is a non-critical system so I can try anything that
anyone suggests but I'm not particularly familiar with kernel-level debugging so a bit
of hand-holding might be needed if you have suggestions.
FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr 9 04:24:09 UTC 2021
VT(vga): text 80x25
CPU: Intel Xeon Processor (Cascadelake) (2993.02-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x50656 Family=0x6 Model=0x55 Stepping=6
Features=0x783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2>
Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
AMD Features2=0x21<LAHF,ABM>
Structured Extended Features=0xd18307a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL>
Structured Extended Features2=0x808<PKU,AVX512VNNI>
Structured Extended Features3=0xa4000000<IBPB,ARCH_CAP,SSBD>
XSAVE Features=0x1<XSAVEOPT>
IA32_ARCH_CAPS=0x2b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO>
Hypervisor: Origin = "KVMKVMKVM"
real memory = 1073741824 (1024 MB)
avail memory = 997744640 (951 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <BOCHS BXPCAPIC>
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
ioapic0 <Version 1.1> irqs 0-23
Timecounter "TSC-low" frequency 1496510010 Hz quality 800
in case it shows up anything odd to those who can decode this sort of stuff.

Hello,

I happen to be running FreeBSD 13.0-RELEASE on a Vultr instance as well,
but haven't had any problems in the ~4 days since I updated from
12 RELEASE. My VM is a single CPU with 2G memory and UFS for a
filesystem. I do see that our VMs have different CPUs listed. Here is
the first part of my dmesg :

FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr 9 04:24:09 UTC 2021
***@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
FreeBSD clang version 11.0.1 (***@github.com:llvm/llvm-project.git llvmorg-11.0.1-0-g43ff75f2c3fe)
VT(vga): text 80x25
CPU: Intel Core Processor (Skylake, IBRS) (3792.08-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x506e3 Family=0x6 Model=0x5e Stepping=3
Features=0x783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2>
Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
AMD Features2=0x21<LAHF,ABM>
Structured Extended Features=0xfb9<FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM>
Structured Extended Features3=0x84000000<IBPB,SSBD>
XSAVE Features=0x1<XSAVEOPT>
Hypervisor: Origin = "KVMKVMKVM"
real memory = 2147483648 (2048 MB)
avail memory = 2047262720 (1952 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <BOCHS BXPCAPIC>
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
ioapic0 <Version 1.1> irqs 0-23
Timecounter "TSC-low" frequency 1896040542 Hz
quality 800
KTLS: Initialized 1 threads

- Jon

Mark Delany

2021-04-20 05:02:29 UTC

Permalink

Post by jon via freebsd-hackers
I happen to be running FreeBSD 13.0-RELEASE on a Vultr instance as well,
but haven't had any problems in the ~4 days since I updated from
12 RELEASE. My VM is a single CPU with 2G memory and UFS for a
filesystem.

Ahh. Good to know, thanks for that.

Does /sbin/reboot work for you?

I think I've isolated the hangs, see my other post.

Mark.