David Cross
2018-08-24 21:54:42 UTC
Ok, I am seeing something truely bizzare, I am sending this out as a shot
across the bow since I am not even sure where or how to begin debugging
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at least
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it
basically never even touches, even when problems are occuring)
The first symptom was (and I think these are all aspects of the same root
underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
would *randomly* error out on a corrupt entry. Upon investigating this I
discovered by watching gstat that as this happened the IO on the drives
would STOP. the L(q) would hover at 1 for a number of seconds, and then
when it returned fsck was complaining about various corrupt structures. a
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some part'
of the stack handed back a zeroed page, or otherwise 'not the right data'
that geli dutifully 'decrypted'. No errors are ever logged in the kernel
about da0 or da1 (the respective underlying USB disks). It *seems* this is
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter how
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write tests
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup disks,
data is fully checksummed, and I was only fscking out of extreme paranoia
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli. On this
disk I have a postgres 10 database in WAL replication. This was working
fine and then the other day the system just locked for a few hours. During
that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and
it doing _1_ operation a second to the underlying disk... all the while
geli is logging 'error 11' to the console (nothing about the underlying
disk) After this happened a static file on the disk (a zip file) had bad
data in the middle of a page (after reboot the file was ok.. so it was
just in cache). Again, this disk fully checks ok, no corruption on the
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there is
some massive locking going on in geli in some way? Where should I even
begin looking? I run geli on most of my systems and don't have any issues.
across the bow since I am not even sure where or how to begin debugging
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at least
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it
basically never even touches, even when problems are occuring)
The first symptom was (and I think these are all aspects of the same root
underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
would *randomly* error out on a corrupt entry. Upon investigating this I
discovered by watching gstat that as this happened the IO on the drives
would STOP. the L(q) would hover at 1 for a number of seconds, and then
when it returned fsck was complaining about various corrupt structures. a
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some part'
of the stack handed back a zeroed page, or otherwise 'not the right data'
that geli dutifully 'decrypted'. No errors are ever logged in the kernel
about da0 or da1 (the respective underlying USB disks). It *seems* this is
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter how
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write tests
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup disks,
data is fully checksummed, and I was only fscking out of extreme paranoia
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli. On this
disk I have a postgres 10 database in WAL replication. This was working
fine and then the other day the system just locked for a few hours. During
that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and
it doing _1_ operation a second to the underlying disk... all the while
geli is logging 'error 11' to the console (nothing about the underlying
disk) After this happened a static file on the disk (a zip file) had bad
data in the middle of a page (after reboot the file was ok.. so it was
just in cache). Again, this disk fully checks ok, no corruption on the
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there is
some massive locking going on in geli in some way? Where should I even
begin looking? I run geli on most of my systems and don't have any issues.