Discussion:
nvme0: async event occurred (log page id=0x2)
Craig Leres
2018-05-04 03:56:54 UTC
Permalink
I have an intel nuc (NUC6i3SYH) that ran 10.3-RELEASE until a few weeks
ago and now 11.1-RELEASE. The system disk is an intel 600p M.2 SSD and
there is also a 2TB seagate laptop drive (ST2000LM007).

Occasionally the system SSD will go to sleep. It happened today with
this on the console:

nvme0: async event occurred (log page id=0x2)
nvme0: resetting controller
nvme0: nvme_ctrlr_wait_for_ready called with desired_val = 0 but
cc.en = 1

Later it would occasionally print out:

swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1509, size: 12200

There was an app playing music from the 2TB drive that was still working
when I reset the box. But no i/o was occurring with with the M.2 SSD.

I see PR 209571 might be related (same async event log anyway at least).

Does anyone have suggestions for me?

Craig
Warner Losh
2018-05-04 04:07:09 UTC
Permalink
Post by Craig Leres
I have an intel nuc (NUC6i3SYH) that ran 10.3-RELEASE until a few weeks
ago and now 11.1-RELEASE. The system disk is an intel 600p M.2 SSD and
there is also a 2TB seagate laptop drive (ST2000LM007).
Occasionally the system SSD will go to sleep. It happened today with
nvme0: async event occurred (log page id=0x2)
nvme0: resetting controller
nvme0: nvme_ctrlr_wait_for_ready called with desired_val = 0 but
cc.en = 1
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1509, size: 12200
There was an app playing music from the 2TB drive that was still working
when I reset the box. But no i/o was occurring with with the M.2 SSD.
I see PR 209571 might be related (same async event log anyway at least).
Does anyone have suggestions for me?
Async events are 'something went wrong' messages. Log page 2 is the smart
log page.

what does 'nvmecontrol logpage -p 2 nvme0' tell you right after this
happens. My guess is that it's overheating.

Warner
Craig Leres
2018-05-04 04:28:42 UTC
Permalink
Post by Warner Losh
Async events are 'something went wrong' messages. Log page 2 is the
smart log page.
what does 'nvmecontrol logpage -p 2 nvme0' tell you right after this
happens.  My guess is that it's overheating.
Interesting. I try to run smartd anywhere it's supported and have
appended the last few entries before things went sideways; 60° C/140° F
is a bit toasty!

This system is a couple of years old, might be time to blow the dust out
with compressed air and see if the bios has more aggressive fan settings.

Is the Raw_Read_Error_Rate changed a problem?

(Thanks!)

Craig

May 3 13:59:22 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 190 Airflow_Temperature_Cel changed from 59 to 60
May 3 13:59:22 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 194 Temperature_Celsius changed from 41 to 40
May 3 14:59:23 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 190 Airflow_Temperature_Cel changed from 60 to 58
May 3 14:59:23 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 194 Temperature_Celsius changed from 40 to 42
May 3 17:29:23 tiny smartd[770]: Device: /dev/ada0, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 75 to 76
Warner Losh
2018-05-04 04:33:53 UTC
Permalink
Post by Craig Leres
Post by Warner Losh
Async events are 'something went wrong' messages. Log page 2 is the
smart log page.
what does 'nvmecontrol logpage -p 2 nvme0' tell you right after this
happens. My guess is that it's overheating.
Interesting. I try to run smartd anywhere it's supported and have
appended the last few entries before things went sideways; 60° C/140° F
is a bit toasty!
This system is a couple of years old, might be time to blow the dust out
with compressed air and see if the bios has more aggressive fan settings.
Is the Raw_Read_Error_Rate changed a problem?
(Thanks!)
Craig
May 3 13:59:22 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 190 Airflow_Temperature_Cel changed from 59 to 60
May 3 13:59:22 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 194 Temperature_Celsius changed from 41 to 40
May 3 14:59:23 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 190 Airflow_Temperature_Cel changed from 60 to 58
May 3 14:59:23 tiny smartd[770]: Device: /dev/ada0, SMART Usage
Attribute: 194 Temperature_Celsius changed from 40 to 42
May 3 17:29:23 tiny smartd[770]: Device: /dev/ada0, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 75 to 76
Things are getting hot, and there was a recoverable error (since you didn't
report a read error, though you could also check page 1 for any errors).
Chances are the controller shut down completely (though from just a few
data points you've given aren't enough for me to be sure).

Warner

Loading...