Help diagnose my Ryzen build problem (in progress)

Discussion:

Meowthink

2018-08-28 15:47:20 UTC

Hi Peeter,

Unfortunately, that's for Ryzens family 17h model 00h-0fh, whereas my
Ryzen 5 2400G's model is 11h.
On the microcode. It shall be updated through UEFI/BIOS updates. I
think mine is now PinnaclePI-AM4_1.0.0.4 with microcode patchlevel
0x810100b.
Seems like ... the only thing I can do is sit down and wait?

The revision
https://svnweb.freebsd.org/base/head/sys/x86/x86/cpu_machdep.c?r1=336763&r2=336762&pathrev=336763
works around the mwait issue, i.e. it sets
sysctl machdep.idle_mwait=0
sysctl machdep.idle=hlt

I think that shall not apply to 2400G, which is model 11h not 1h.
machdep.idle: acpi
machdep.idle_available: spin, mwait, hlt, acpi
machdep.idle_apl31: 0
machdep.idle_mwait: 1

Now it may or may not relate to your problem, but it appears that
Ryzen 2400G also has another issue with HLT, see the DragonFly bug
report
https://bugs.dragonflybsd.org/issues/3131

Thanks a lot for that info.
It's much easier to prove your problem, since it's reproducible. But
mine was so random to catch...
Anyway, it seems like the IRET issue [1] is still not fixed? I'm
highly doubt that my issue is this related because my system became
significantly more stable since I stop that irq storm from bluetooth
module - Though it still panics occasionally.
So could anybody tell, what's the difference between FreeBSD
workaround [2] and the DragonflyBSD one?

which AMD is aware of and is possibly working on, but it may not have
appeared in the errata yet. The bug report says that until this is
fixed, the workaround is to also disable HLT in cpu_idle. I am not
sure what is the correct value for the sysctl on FreeBSD, perhaps
sysctl machdep.idle=0
or some other value?

In the meantime, I have this microcode
# cpucontrol -m 0x8b /dev/cpuctl0
MSR 0x8b: 0x00000000 0x0810100b
Hence I should use mwait?
Still don't know what should I set. Any idea?

If I was you, I'd play around with the sysctls mentioned above and see
if it helps. Start with disabling both mwait and hlt, perhaps
machdep.idle=spin
machdep.idle_mwait=0
(assuming that 'spin' means hlt will not used) and then if that does
not lead to a panic, try enabling mwait. I can't test 2400G since I
don't have it any more. I booted FreeBSD a couple of times but did not
run it over long periods of time.

It works!
After hours and hours of different stressing. I got 8 copies of gcc
built without any problem.

But it costs lots of power and the fan will become very annoying. As
so, I don't think I'll test long term stability with this state.

machdep.idle: acpi -> spin
- will add ~5W, maybe some deeper C states disabled?
machdep.idle_mwait: 1 -> 0
- will add another ~50W, CPUs are working insomniac.

I tried to set machdep.idle_mwait to 1, or machdep.idle to mwait. Both
failed with panics when I start building gcc pass by pass.

I'm pretty sure mwait will cause problem, as once I experienced a
panic immediately after I issued the sysctl command (the 2nd dump info
followed)

So my next step will be hlt. Still need some time, though.

Cheers
Peeter
--

Cheers,
meowthink

------------------------------------------------------------------------
machdep.idle=mwait

panic: ffs_syncvnode: syncing truncated data.
cpuid = 7
KDB: stack backtrace:
#0 0xffffffff80b414b7 at kdb_backtrace+0x67
#1 0xffffffff80afa9e7 at vpanic+0x177
#2 0xffffffff80afa863 at panic+0x43
#3 0xffffffff80dcddc4 at ffs_syncvnode+0x5a4
#4 0xffffffff80dcc915 at ffs_fsync+0x25
#5 0xffffffff810ffcb2 at VOP_FSYNC_APV+0x82
#6 0xffffffff80bc3a62 at sched_sync+0x412
#7 0xffffffff80abd813 at fork_exit+0x83
#8 0xffffffff80f5cc7e at fork_trampoline+0xe

------------------------------------------------------------------------
machdep.idle_mwait=1

Fatal trap 9: general protection fault while in kernel mode
cpuid = 7; apic id = 07
instruction pointer = 0x20:0xffffffff80e094fe
stack pointer = 0x0:0xfffffe081e5df9e0
frame pointer = 0x0:0xfffffe081e5dfa50
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 17 (dom0)
trap number = 9
panic: general protection fault
cpuid = 7
KDB: stack backtrace:
#0 0xffffffff80b414b7 at kdb_backtrace+0x67
#1 0xffffffff80afa9e7 at vpanic+0x177
#2 0xffffffff80afa863 at panic+0x43
#3 0xffffffff80f7c14f at trap_fatal+0x35f
#4 0xffffffff80f7b70e at trap+0x5e
#5 0xffffffff80f5bccc at calltrap+0x8
#6 0xffffffff80e07a17 at vm_pageout+0x87
#7 0xffffffff80abd813 at fork_exit+0x83
#8 0xffffffff80f5cc7e at fork_trampoline+0xe

Meowthink

2018-08-29 02:28:40 UTC

Permalink

Update:

machdep.idle = hlt and machdep.idle_mwait = 0 failed also. It can't
last even longer than machdep.idle = mwait, which could normally panic
after a few passes of building gcc. I tried hlt twice, both not longer
than half hour.

Now, as another round of building 4 gccs in parallel is going to finish, with
machdep.idle = spin and machdep.idle_mwait = 0.
Can I say Ryzen 2400G probably have issues with both mwait and hlt?

Regards,
meowthink

Fatal trap 12: page fault while in user mode
cpuid = 6; apic id = 06
fault virtual address = 0x819cd0000
fault code = user write data, reserved bits in PTE
instruction pointer = 0x43:0x80195de26
stack pointer = 0x3b:0x7fffffffb0b8
frame pointer = 0x3b:0x7fffffffb100
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 17888 (ld)
trap number = 12
panic: page fault
cpuid = 6
KDB: stack backtrace:
#0 0xffffffff80b414b7 at kdb_backtrace+0x67
#1 0xffffffff80afa9e7 at vpanic+0x177
#2 0xffffffff80afa863 at panic+0x43
#3 0xffffffff80f7c14f at trap_fatal+0x35f
#4 0xffffffff80f7c1a9 at trap_pfault+0x49
#5 0xffffffff80f7ba10 at trap+0x360
#6 0xffffffff80f5bccc at calltrap+0x8

Post by Meowthink
Hi Peeter,

I think that shall not apply to 2400G, which is model 11h not 1h.
machdep.idle: acpi
machdep.idle_available: spin, mwait, hlt, acpi
machdep.idle_apl31: 0
machdep.idle_mwait: 1

Now it may or may not relate to your problem, but it appears that
Ryzen 2400G also has another issue with HLT, see the DragonFly bug
report
https://bugs.dragonflybsd.org/issues/3131

In the meantime, I have this microcode
# cpucontrol -m 0x8b /dev/cpuctl0
MSR 0x8b: 0x00000000 0x0810100b
Hence I should use mwait?
Still don't know what should I set. Any idea?

It works!
After hours and hours of different stressing. I got 8 copies of gcc
built without any problem.
But it costs lots of power and the fan will become very annoying. As
so, I don't think I'll test long term stability with this state.
machdep.idle: acpi -> spin
- will add ~5W, maybe some deeper C states disabled?
machdep.idle_mwait: 1 -> 0
- will add another ~50W, CPUs are working insomniac.
I tried to set machdep.idle_mwait to 1, or machdep.idle to mwait. Both
failed with panics when I start building gcc pass by pass.
I'm pretty sure mwait will cause problem, as once I experienced a
panic immediately after I issued the sysctl command (the 2nd dump info
followed)
So my next step will be hlt. Still need some time, though.

Cheers
Peeter
--

Cheers,
meowthink
------------------------------------------------------------------------
machdep.idle=mwait
panic: ffs_syncvnode: syncing truncated data.
cpuid = 7
#0 0xffffffff80b414b7 at kdb_backtrace+0x67
#1 0xffffffff80afa9e7 at vpanic+0x177
#2 0xffffffff80afa863 at panic+0x43
#3 0xffffffff80dcddc4 at ffs_syncvnode+0x5a4
#4 0xffffffff80dcc915 at ffs_fsync+0x25
#5 0xffffffff810ffcb2 at VOP_FSYNC_APV+0x82
#6 0xffffffff80bc3a62 at sched_sync+0x412
#7 0xffffffff80abd813 at fork_exit+0x83
#8 0xffffffff80f5cc7e at fork_trampoline+0xe
------------------------------------------------------------------------
machdep.idle_mwait=1
Fatal trap 9: general protection fault while in kernel mode
cpuid = 7; apic id = 07
instruction pointer = 0x20:0xffffffff80e094fe
stack pointer = 0x0:0xfffffe081e5df9e0
frame pointer = 0x0:0xfffffe081e5dfa50
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 17 (dom0)
trap number = 9
panic: general protection fault
cpuid = 7
#0 0xffffffff80b414b7 at kdb_backtrace+0x67
#1 0xffffffff80afa9e7 at vpanic+0x177
#2 0xffffffff80afa863 at panic+0x43
#3 0xffffffff80f7c14f at trap_fatal+0x35f
#4 0xffffffff80f7b70e at trap+0x5e
#5 0xffffffff80f5bccc at calltrap+0x8
#6 0xffffffff80e07a17 at vm_pageout+0x87
#7 0xffffffff80abd813 at fork_exit+0x83
#8 0xffffffff80f5cc7e at fork_trampoline+0xe

karu.pruun

2018-08-29 08:11:24 UTC

Permalink

Post by Meowthink
machdep.idle = hlt and machdep.idle_mwait = 0 failed also. It can't
last even longer than machdep.idle = mwait, which could normally panic
after a few passes of building gcc. I tried hlt twice, both not longer
than half hour.
Now, as another round of building 4 gccs in parallel is going to finish, with
machdep.idle = spin and machdep.idle_mwait = 0.
Can I say Ryzen 2400G probably have issues with both mwait and hlt?

I suppose we can conclude that HLT is a culprit as it causes issues
both on FreeBSD and DragonFly. On DragonFly it occurs in an isolated
case where I had to run java. Otherwise the desktop would last for 3 -
4 weeks with no problems (and I only had to reboot since I built a new
kernel). Also, when I finished running java, I switched on HLT again
to get better power saving. But that said, yes, Ryzen 2400G has an
issue with HLT in general. This has not been mentioned in the Revision
guide for family 17h, and the latter does not explicitly include
2400G, which is model 11h as you said.

https://support.amd.com/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf

Regarding MWAIT, the DragonFly bug report says that the microcode
update 0x0810100B fixes this. But I can't comment since I did not
touch that at all.

Cheers

Peeter

--