Discussion:
weird geli behavior
David Cross
2018-08-24 21:54:42 UTC
Permalink
Ok, I am seeing something truely bizzare, I am sending this out as a shot
across the bow since I am not even sure where or how to begin debugging
this.

Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at least
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it
basically never even touches, even when problems are occuring)

The first symptom was (and I think these are all aspects of the same root
underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
would *randomly* error out on a corrupt entry. Upon investigating this I
discovered by watching gstat that as this happened the IO on the drives
would STOP. the L(q) would hover at 1 for a number of seconds, and then
when it returned fsck was complaining about various corrupt structures. a
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some part'
of the stack handed back a zeroed page, or otherwise 'not the right data'
that geli dutifully 'decrypted'. No errors are ever logged in the kernel
about da0 or da1 (the respective underlying USB disks). It *seems* this is
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter how
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write tests
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.

I was willing to live with that.. weird as it was (these are backup disks,
data is fully checksummed, and I was only fscking out of extreme paranoia
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli. On this
disk I have a postgres 10 database in WAL replication. This was working
fine and then the other day the system just locked for a few hours. During
that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and
it doing _1_ operation a second to the underlying disk... all the while
geli is logging 'error 11' to the console (nothing about the underlying
disk) After this happened a static file on the disk (a zip file) had bad
data in the middle of a page (after reboot the file was ok.. so it was
just in cache). Again, this disk fully checks ok, no corruption on the
disk, no errors from the disk itself.


Halp? where do I even begin with this? It really feels like there is
some massive locking going on in geli in some way? Where should I even
begin looking? I run geli on most of my systems and don't have any issues.
John-Mark Gurney
2018-08-25 01:00:23 UTC
Permalink
Post by David Cross
Ok, I am seeing something truely bizzare, I am sending this out as a shot
across the bow since I am not even sure where or how to begin debugging
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at least
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it
basically never even touches, even when problems are occuring)
I assume you've applied the lazyfpu SA patch?
Post by David Cross
The first symptom was (and I think these are all aspects of the same root
underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
would *randomly* error out on a corrupt entry. Upon investigating this I
discovered by watching gstat that as this happened the IO on the drives
would STOP. the L(q) would hover at 1 for a number of seconds, and then
when it returned fsck was complaining about various corrupt structures. a
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some part'
of the stack handed back a zeroed page, or otherwise 'not the right data'
that geli dutifully 'decrypted'. No errors are ever logged in the kernel
about da0 or da1 (the respective underlying USB disks). It *seems* this is
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter how
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write tests
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup disks,
data is fully checksummed, and I was only fscking out of extreme paranoia
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli. On this
disk I have a postgres 10 database in WAL replication. This was working
fine and then the other day the system just locked for a few hours. During
that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and
it doing _1_ operation a second to the underlying disk... all the while
geli is logging 'error 11' to the console (nothing about the underlying
disk) After this happened a static file on the disk (a zip file) had bad
data in the middle of a page (after reboot the file was ok.. so it was
just in cache). Again, this disk fully checks ok, no corruption on the
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there is
some massive locking going on in geli in some way? Where should I even
begin looking? I run geli on most of my systems and don't have any issues.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."
David Cross
2018-08-25 02:42:57 UTC
Permalink
Post by John-Mark Gurney
Post by David Cross
Ok, I am seeing something truely bizzare, I am sending this out as a shot
across the bow since I am not even sure where or how to begin debugging
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at least
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it
basically never even touches, even when problems are occuring)
I assume you've applied the lazyfpu SA patch?
I will definitely apply this, but I don’t think it applies to the problem in question. This system doesn’t have AESNI, this problem well preceded the lazyfpu patch, and i am not seeing any corruption on disk.
Post by John-Mark Gurney
Post by David Cross
The first symptom was (and I think these are all aspects of the same root
underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
would *randomly* error out on a corrupt entry. Upon investigating this I
discovered by watching gstat that as this happened the IO on the drives
would STOP. the L(q) would hover at 1 for a number of seconds, and then
when it returned fsck was complaining about various corrupt structures. a
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some part'
of the stack handed back a zeroed page, or otherwise 'not the right data'
that geli dutifully 'decrypted'. No errors are ever logged in the kernel
about da0 or da1 (the respective underlying USB disks). It *seems* this is
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter how
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write tests
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup disks,
data is fully checksummed, and I was only fscking out of extreme paranoia
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli. On this
disk I have a postgres 10 database in WAL replication. This was working
fine and then the other day the system just locked for a few hours. During
that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and
it doing _1_ operation a second to the underlying disk... all the while
geli is logging 'error 11' to the console (nothing about the underlying
disk) After this happened a static file on the disk (a zip file) had bad
data in the middle of a page (after reboot the file was ok.. so it was
just in cache). Again, this disk fully checks ok, no corruption on the
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there is
some massive locking going on in geli in some way? Where should I even
begin looking? I run geli on most of my systems and don't have any issues.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
John-Mark Gurney
2018-08-25 05:23:30 UTC
Permalink
Post by John-Mark Gurney
Post by David Cross
Ok, I am seeing something truely bizzare, I am sending this out as a shot
across the bow since I am not even sure where or how to begin debugging
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at least
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap, which it
basically never even touches, even when problems are occuring)
I assume you've applied the lazyfpu SA patch?
I will definitely apply this, but I don???t think it applies to the problem in question. This system doesn???t have AESNI, this problem well preceded the lazyfpu patch, and i am not seeing any corruption on disk.
Hmm, k...
Post by John-Mark Gurney
Post by David Cross
The first symptom was (and I think these are all aspects of the same root
underlying cause) that fsck on a geli encrypted d stripe of 2 USB drives
would *randomly* error out on a corrupt entry. Upon investigating this I
discovered by watching gstat that as this happened the IO on the drives
would STOP. the L(q) would hover at 1 for a number of seconds, and then
when it returned fsck was complaining about various corrupt structures. a
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some part'
of the stack handed back a zeroed page, or otherwise 'not the right data'
that geli dutifully 'decrypted'. No errors are ever logged in the kernel
about da0 or da1 (the respective underlying USB disks). It *seems* this is
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter how
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write tests
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup disks,
data is fully checksummed, and I was only fscking out of extreme paranoia
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli. On this
disk I have a postgres 10 database in WAL replication. This was working
fine and then the other day the system just locked for a few hours. During
that time I saw the L(q) of the _internal_ disk in the 10,000+ range, and
it doing _1_ operation a second to the underlying disk... all the while
geli is logging 'error 11' to the console (nothing about the underlying
disk) After this happened a static file on the disk (a zip file) had bad
data in the middle of a page (after reboot the file was ok.. so it was
just in cache). Again, this disk fully checks ok, no corruption on the
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there is
some massive locking going on in geli in some way? Where should I even
begin looking? I run geli on most of my systems and don't have any issues.
Can you post actual log lines? geli has lots of error log lines, so w/o
more info, pretty hard to say WHAT in geli is returning EAGAIN. I do see
that _read_done and _write_done may not handle an EAGAIN error, which could
cause this problem, but to confirm, I need the actual log lines...
--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."
David Cross
2018-08-25 20:22:47 UTC
Permalink
Top-posting because this just took a radical change in direction

Ok, I need to take some of this back; I do still believe that there is some
weird GELI-ness going on (or perhaps GEOM itself), in terms of hangs, but
it may be being tripped over by some weird USB-DA behavior I noticed in
trying to track this down.

I turned debugging ALL the way up on geli (3) and noticed the hangs always
happened when geli handed off a 20480 length read to the layer below (in
this case mirror).. I used the rebugging output to create a dummy program
called 'pread' to try to simulate these failures, and I was quite
successful. Attached is 'pread.c' which takes on stdin a offset and a
length to read. In that version I overwrote it to always ask for 32k., and
that works every time. If I eliminate that line and let the input data
control it, on some of the 20480 (and a different set each time) it hangs,
after the hang the pread returns _0_ (and obviously no errno), no kernel
logs. "pread" is run directly against /dev/da0, so no mirror or geli
layers at this point.

Kernel attach messages of relevance:
da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
da0: <WD Elements 107C 1065> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number 574343344532485548465239
da0: 40.000MB/s transfers
da0: 3815415MB (976746240 4096 byte sectors)
da0: quirks=0x2<NO_6_BYTE>

Relevant output of a pread run:
~ # ./pread /dev/da0 < requestdata.txt
117535768576(20480)=0(0)
189095018496(20480)=0(0)

(last 0 is a trivial patch to include errno output)
Post by David Cross
Post by John-Mark Gurney
Post by David Cross
Ok, I am seeing something truely bizzare, I am sending this out as a
shot
Post by John-Mark Gurney
Post by David Cross
across the bow since I am not even sure where or how to begin
debugging
Post by John-Mark Gurney
Post by David Cross
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G ECC
memory, 11.2, fully patched. Though this has been a problem since at
least
Post by John-Mark Gurney
Post by David Cross
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap,
which it
Post by John-Mark Gurney
Post by David Cross
basically never even touches, even when problems are occuring)
I assume you've applied the lazyfpu SA patch?
I will definitely apply this, but I don???t think it applies to the
problem in question. This system doesn???t have AESNI, this problem well
preceded the lazyfpu patch, and i am not seeing any corruption on disk.
Hmm, k...
Post by John-Mark Gurney
Post by David Cross
The first symptom was (and I think these are all aspects of the same
root
Post by John-Mark Gurney
Post by David Cross
underlying cause) that fsck on a geli encrypted d stripe of 2 USB
drives
Post by John-Mark Gurney
Post by David Cross
would *randomly* error out on a corrupt entry. Upon investigating
this I
Post by John-Mark Gurney
Post by David Cross
discovered by watching gstat that as this happened the IO on the
drives
Post by John-Mark Gurney
Post by David Cross
would STOP. the L(q) would hover at 1 for a number of seconds, and
then
Post by John-Mark Gurney
Post by David Cross
when it returned fsck was complaining about various corrupt
structures. a
Post by John-Mark Gurney
Post by David Cross
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some
part'
Post by John-Mark Gurney
Post by David Cross
of the stack handed back a zeroed page, or otherwise 'not the right
data'
Post by John-Mark Gurney
Post by David Cross
that geli dutifully 'decrypted'. No errors are ever logged in the
kernel
Post by John-Mark Gurney
Post by David Cross
about da0 or da1 (the respective underlying USB disks). It *seems*
this is
Post by John-Mark Gurney
Post by David Cross
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no matter
how
Post by John-Mark Gurney
Post by David Cross
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write
tests
Post by John-Mark Gurney
Post by David Cross
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup
disks,
Post by John-Mark Gurney
Post by David Cross
data is fully checksummed, and I was only fscking out of extreme
paranoia
Post by John-Mark Gurney
Post by David Cross
every reboot) Then I added an internal drive, configured with gmirror
(broken mirror currently, second disk hasn't been added) and geli.
On this
Post by John-Mark Gurney
Post by David Cross
disk I have a postgres 10 database in WAL replication. This was
working
Post by John-Mark Gurney
Post by David Cross
fine and then the other day the system just locked for a few hours.
During
Post by John-Mark Gurney
Post by David Cross
that time I saw the L(q) of the _internal_ disk in the 10,000+ range,
and
Post by John-Mark Gurney
Post by David Cross
it doing _1_ operation a second to the underlying disk... all the
while
Post by John-Mark Gurney
Post by David Cross
geli is logging 'error 11' to the console (nothing about the
underlying
Post by John-Mark Gurney
Post by David Cross
disk) After this happened a static file on the disk (a zip file) had
bad
Post by John-Mark Gurney
Post by David Cross
data in the middle of a page (after reboot the file was ok.. so it
was
Post by John-Mark Gurney
Post by David Cross
just in cache). Again, this disk fully checks ok, no corruption on
the
Post by John-Mark Gurney
Post by David Cross
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there
is
Post by John-Mark Gurney
Post by David Cross
some massive locking going on in geli in some way? Where should I
even
Post by John-Mark Gurney
Post by David Cross
begin looking? I run geli on most of my systems and don't have any
issues.
Can you post actual log lines? geli has lots of error log lines, so w/o
more info, pretty hard to say WHAT in geli is returning EAGAIN. I do see
that _read_done and _write_done may not handle an EAGAIN error, which could
cause this problem, but to confirm, I need the actual log lines...
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
Eugene Grosbein
2018-08-25 20:44:43 UTC
Permalink
Post by David Cross
I turned debugging ALL the way up on geli (3) and noticed the hangs always
happened when geli handed off a 20480 length read to the layer below (in
this case mirror).. I used the rebugging output to create a dummy program
called 'pread' to try to simulate these failures, and I was quite
successful. Attached is 'pread.c' which takes on stdin a offset and a
length to read. In that version I overwrote it to always ask for 32k., and
that works every time. If I eliminate that line and let the input data
control it, on some of the 20480 (and a different set each time) it hangs,
after the hang the pread returns _0_ (and obviously no errno), no kernel
logs. "pread" is run directly against /dev/da0, so no mirror or geli
layers at this point.
The lists strips non-textual attachments, so requestdata.txt get passed
but pread.c was stripped from the message. You need to re-send it changing
name to something like pread.c.txt
David Cross
2018-08-27 17:54:30 UTC
Permalink
Post by David Cross
Top-posting because this just took a radical change in direction
Ok, I need to take some of this back; I do still believe that there is
some weird GELI-ness going on (or perhaps GEOM itself), in terms of hangs,
but it may be being tripped over by some weird USB-DA behavior I noticed in
trying to track this down.
Ok... I think I found the smoking gun, I think between all of the multiple
layers of GEOM (geli->mirror->intermediate physical->physical) and
something that postgres is doing its filling up the 'transient arena'; I
even found a few pause()s in geom_io.c that seem to correspond to what I am
witnessing.

Also there's:
kern.geom.transient_map_soft_failures: 356796
kern.geom.transient_map_hard_failures: 35650

My *guess* as to what is going on is that there is a MASSIVE spike of IO
operations triggered by postgres doing 'something' with the the data on
disk to close out a transaction, this causes a spike in mappings to ELI...
which cannot pass things off to mirror because the transient area is full..
the system then sleeps for a second seeing if something has cleared (which
it cannot because it cannot get to disk).. it then fails the request.
Eventually this bleeds down to where the queue can flow (and is where I see
the spike in operations and everything clearing up at once).

The default transient arena seems to be just 1024 pages (4M?) and is
hard-coded. It also looks like there aren't any kernel messages at the
default level when you hit this (or if they are they seem to get eaten by
the fact that nothing can log... because, well... IO is deadlocked).

I won't be able to test this until this evening when I can bump this up...
any reason to not set this to 10240 or higher? its a 64bit machine with
72G of RAM.

Longer term, if this is the cause, and with the inherent nesting nature of
GEOM I think we need a way to handle this better.

Next up: figure out why random 20480 byte reads are choking on this
machine.. I tried the earlier mentioned pread test on a different machine
with the EXACT same model external array, and it worked perfectly. so its
either the machine or the drives... more experimenting to do!
Post by David Cross
I turned debugging ALL the way up on geli (3) and noticed the hangs always
happened when geli handed off a 20480 length read to the layer below (in
this case mirror).. I used the rebugging output to create a dummy program
called 'pread' to try to simulate these failures, and I was quite
successful. Attached is 'pread.c' which takes on stdin a offset and a
length to read. In that version I overwrote it to always ask for 32k., and
that works every time. If I eliminate that line and let the input data
control it, on some of the 20480 (and a different set each time) it hangs,
after the hang the pread returns _0_ (and obviously no errno), no kernel
logs. "pread" is run directly against /dev/da0, so no mirror or geli
layers at this point.
da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
da0: <WD Elements 107C 1065> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number 574343344532485548465239
da0: 40.000MB/s transfers
da0: 3815415MB (976746240 4096 byte sectors)
da0: quirks=0x2<NO_6_BYTE>
~ # ./pread /dev/da0 < requestdata.txt
117535768576(20480)=0(0)
189095018496(20480)=0(0)
(last 0 is a trivial patch to include errno output)
Post by David Cross
Post by John-Mark Gurney
Post by David Cross
Ok, I am seeing something truely bizzare, I am sending this out as a
shot
Post by John-Mark Gurney
Post by David Cross
across the bow since I am not even sure where or how to begin
debugging
Post by John-Mark Gurney
Post by David Cross
this.
Some background. This in on an Intel Xeon 5520 based machine, 72G
ECC
Post by John-Mark Gurney
Post by David Cross
memory, 11.2, fully patched. Though this has been a problem since
at least
Post by John-Mark Gurney
Post by David Cross
11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap,
which it
Post by John-Mark Gurney
Post by David Cross
basically never even touches, even when problems are occuring)
I assume you've applied the lazyfpu SA patch?
I will definitely apply this, but I don???t think it applies to the
problem in question. This system doesn???t have AESNI, this problem well
preceded the lazyfpu patch, and i am not seeing any corruption on disk.
Hmm, k...
Post by John-Mark Gurney
Post by David Cross
The first symptom was (and I think these are all aspects of the same
root
Post by John-Mark Gurney
Post by David Cross
underlying cause) that fsck on a geli encrypted d stripe of 2 USB
drives
Post by John-Mark Gurney
Post by David Cross
would *randomly* error out on a corrupt entry. Upon investigating
this I
Post by John-Mark Gurney
Post by David Cross
discovered by watching gstat that as this happened the IO on the
drives
Post by John-Mark Gurney
Post by David Cross
would STOP. the L(q) would hover at 1 for a number of seconds, and
then
Post by John-Mark Gurney
Post by David Cross
when it returned fsck was complaining about various corrupt
structures. a
Post by John-Mark Gurney
Post by David Cross
ktrace of fsck shows that it got back data from the pread() that was
partially corrupted (I am guessing, but I cannot confirm that 'some
part'
Post by John-Mark Gurney
Post by David Cross
of the stack handed back a zeroed page, or otherwise 'not the right
data'
Post by John-Mark Gurney
Post by David Cross
that geli dutifully 'decrypted'. No errors are ever logged in the
kernel
Post by John-Mark Gurney
Post by David Cross
about da0 or da1 (the respective underlying USB disks). It *seems*
this is
Post by John-Mark Gurney
Post by David Cross
*always* on phase 2 of fsck (files and paths), and its never the same
inode. no data is *ever* corrupted when in the filesystem, no
matter how
Post by John-Mark Gurney
Post by David Cross
hard I hit the disks (all data on these devices is fully checksummed)
Devices have passed multiple SMART full diag checks, full read/write
tests
Post by John-Mark Gurney
Post by David Cross
with no issues. Under heavy FS IO it does occasionally lock.. but
recovers, and again data and filesystem are fully consistent.
I was willing to live with that.. weird as it was (these are backup
disks,
Post by John-Mark Gurney
Post by David Cross
data is fully checksummed, and I was only fscking out of extreme
paranoia
Post by John-Mark Gurney
Post by David Cross
every reboot) Then I added an internal drive, configured with
gmirror
Post by John-Mark Gurney
Post by David Cross
(broken mirror currently, second disk hasn't been added) and geli.
On this
Post by John-Mark Gurney
Post by David Cross
disk I have a postgres 10 database in WAL replication. This was
working
Post by John-Mark Gurney
Post by David Cross
fine and then the other day the system just locked for a few hours.
During
Post by John-Mark Gurney
Post by David Cross
that time I saw the L(q) of the _internal_ disk in the 10,000+
range, and
Post by John-Mark Gurney
Post by David Cross
it doing _1_ operation a second to the underlying disk... all the
while
Post by John-Mark Gurney
Post by David Cross
geli is logging 'error 11' to the console (nothing about the
underlying
Post by John-Mark Gurney
Post by David Cross
disk) After this happened a static file on the disk (a zip file)
had bad
Post by John-Mark Gurney
Post by David Cross
data in the middle of a page (after reboot the file was ok.. so it
was
Post by John-Mark Gurney
Post by David Cross
just in cache). Again, this disk fully checks ok, no corruption on
the
Post by John-Mark Gurney
Post by David Cross
disk, no errors from the disk itself.
Halp? where do I even begin with this? It really feels like there
is
Post by John-Mark Gurney
Post by David Cross
some massive locking going on in geli in some way? Where should I
even
Post by John-Mark Gurney
Post by David Cross
begin looking? I run geli on most of my systems and don't have any
issues.
Can you post actual log lines? geli has lots of error log lines, so w/o
more info, pretty hard to say WHAT in geli is returning EAGAIN. I do see
that _read_done and _write_done may not handle an EAGAIN error, which could
cause this problem, but to confirm, I need the actual log lines...
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
Continue reading on narkive:
Loading...