Request for opinions - gvinum or ccd?

Discussion:

x***@googlemail.com

2009-05-30 17:52:39 UTC

Hello.

I'm planning to stripe two disks into a RAID0 configuration. As
far as I can tell, my hardware has no hardware RAID support and
therefore I'll be going the software route.

The machine in question is a workstation used to process large
datasets (audio and video) and do lots of compilation.

Simple question then as the handbook describes both ccd and gvinum -
which should I pick?

Mike Meyer

2009-05-30 18:43:54 UTC

Permalink

On Sat, 30 May 2009 18:52:39 +0100

Post by x***@googlemail.com
Simple question then as the handbook describes both ccd and gvinum -
which should I pick?

My first reaction was "neither", then I realized - you didn't say what
version of FreeBSD you're running. But if you're running a supported
version of FreeBSD, that doesn't change my answer.

If you're running 5.3 or later, you probably want gstripe. If you're
running something older than that, then gvinum won't be available
either, so you'll need to use ccd. I always figured gvinum was a
transition tool to help move from vinum to geom, which is why it's
managed to get to the 7.0 release with some pretty painful bugs in it,
which don't show up in gstripe.

The handbook clearly needs to be rewritten - ccd isn't supported
anymore, except via the geom ccd class. However, I think zfs is going
to change it all again, so such a rewrite wont' be useful for very
long. I don't think zfs supports a two-disk stripe, thought it does do
JBOD.

If you're running a 7.X 64-bit system with a couple of GIG of ram,
expect it to be in service for years without having to reformat the
disks, and can afford another drive, I'd recommend going to raidz on a
three-drive system. That will give you close to the size/performance
of your RAID0 system, but let you lose a disk without losing data. The
best you can do with zfs on two disks is a mirror, which means write
throughput will suffer.

<mike

x***@googlemail.com

2009-05-30 19:18:40 UTC

Permalink

Post by Mike Meyer
On Sat, 30 May 2009 18:52:39 +0100

Post by x***@googlemail.com
Simple question then as the handbook describes both ccd and gvinum -
which should I pick?

My first reaction was "neither", then I realized - you didn't say what
version of FreeBSD you're running. But if you're running a supported
version of FreeBSD, that doesn't change my answer.

Sorry, yeah. FreeBSD 7.2-RELEASE on AMD64.

Post by Mike Meyer
If you're running 5.3 or later, you probably want gstripe. If you're
running something older than that, then gvinum won't be available
either, so you'll need to use ccd. I always figured gvinum was a
transition tool to help move from vinum to geom, which is why it's
managed to get to the 7.0 release with some pretty painful bugs in it,
which don't show up in gstripe.

That sounds like the kind of entertainment I don't particularly want!

Post by Mike Meyer
The handbook clearly needs to be rewritten - ccd isn't supported
anymore, except via the geom ccd class. However, I think zfs is going
to change it all again, so such a rewrite wont' be useful for very
long. I don't think zfs supports a two-disk stripe, thought it does do
JBOD.
If you're running a 7.X 64-bit system with a couple of GIG of ram,
expect it to be in service for years without having to reformat the
disks, and can afford another drive, I'd recommend going to raidz on a
three-drive system. That will give you close to the size/performance
of your RAID0 system, but let you lose a disk without losing data. The
best you can do with zfs on two disks is a mirror, which means write
throughput will suffer.

Certainly a lot to think about.

The system has 12gb currently, with room to upgrade. I currently have
two 500gb drives and one 1tb drive. I wanted the setup to be essentially
two drives striped, backed up onto one larger one nightly. I wanted the
large backup drive to be as "isolated" as possible, eg, in the event of
some catastrophic hardware failure, I can remove it and place it in
another machine without a lot of stressful configuration to recover the
data (not possible with a RAID configuration involving all three drives,
as far as I'm aware).

xw

Mike Meyer

2009-05-30 20:27:44 UTC

Permalink

On Sat, 30 May 2009 20:18:40 +0100

Post by x***@googlemail.com

Post by Mike Meyer
If you're running a 7.X 64-bit system with a couple of GIG of ram,
expect it to be in service for years without having to reformat the
disks, and can afford another drive, I'd recommend going to raidz on a
three-drive system. That will give you close to the size/performance
of your RAID0 system, but let you lose a disk without losing data. The
best you can do with zfs on two disks is a mirror, which means write
throughput will suffer.

Certainly a lot to think about.
The system has 12gb currently, with room to upgrade. I currently have
two 500gb drives and one 1tb drive. I wanted the setup to be essentially
two drives striped, backed up onto one larger one nightly. I wanted the
large backup drive to be as "isolated" as possible, eg, in the event of
some catastrophic hardware failure, I can remove it and place it in
another machine without a lot of stressful configuration to recover the
data (not possible with a RAID configuration involving all three drives,
as far as I'm aware).

The last bit is wrong. Moving a zfs pool between two systems is pretty
straightforward. The configuration information is on the drives; you
just do "zpool import <pool>" after plugging them in, and if the mount
point exists, it'll mount it. If the system crashed with the zfs pool
active, you might have to do -f to force an import. Geom is pretty
much the same way, except you can configure it to not write the config
data to disk, thus forcing you to do it manually (what you
expect). I'm not sure geom is as smart if the drives change names,
though.

RAID support and volume management has come a long way from the days
of ccd and vinum. zfs in particular is a major advance. If you aren't
aware of it's advantages, take the time to read the zfs & zpool man
pages, at the very least, before committing to geom (not that geom
isn't pretty slick in and of itself, but zfs solves a more pressing
problem).

Hmm. Come to think of it, you ought to be able to use gstrip to stripe
your disks, then put a zpool on that, which should get you the
advantages of zfs with a striped disk. But that does seem odd to me.

<mike

x***@googlemail.com

2009-05-30 21:36:43 UTC

Permalink

Post by Mike Meyer
The last bit is wrong. Moving a zfs pool between two systems is pretty
straightforward. The configuration information is on the drives; you
just do "zpool import <pool>" after plugging them in, and if the mount
point exists, it'll mount it. If the system crashed with the zfs pool
active, you might have to do -f to force an import. Geom is pretty
much the same way, except you can configure it to not write the config
data to disk, thus forcing you to do it manually (what you
expect). I'm not sure geom is as smart if the drives change names,
though.
RAID support and volume management has come a long way from the days
of ccd and vinum. zfs in particular is a major advance. If you aren't
aware of it's advantages, take the time to read the zfs & zpool man
pages, at the very least, before committing to geom (not that geom
isn't pretty slick in and of itself, but zfs solves a more pressing
problem).
Hmm. Come to think of it, you ought to be able to use gstrip to stripe
your disks, then put a zpool on that, which should get you the
advantages of zfs with a striped disk. But that does seem odd to me.

I'll definitely be looking at ZFS. Thanks for the info.

I've never been dead set on any option in particular, it's just that I
wasn't aware of anything that would do what I wanted that wasn't just
simple RAID0 and manual backups.

Michael Reifenberger

2009-05-31 07:45:30 UTC

Permalink

On Sat, 30 May 2009, ***@googlemail.com wrote:
...

Post by x***@googlemail.com
I'll definitely be looking at ZFS. Thanks for the info.
I've never been dead set on any option in particular, it's just that I
wasn't aware of anything that would do what I wanted that wasn't just
simple RAID0 and manual backups.

Just for the record:
ZFS is the only FreeBSD filesystem which ensures your data integrity.
I've had dozends of silent block corruptions on SAMSUNG HD103UJ drives.
Thanks to 'zfs scrub' I was able to recognize and repair the corruptions.
Traditional RAID's only work if the whole drive fails.

Bye/2
---
Michael Reifenberger
***@Reifenberger.com
http://www.Reifenberger.com

krad

2009-05-31 12:13:24 UTC

Permalink

Please don't whack gstripe and zfs together. It should work but is ugly and
you might run into issues. Getting out of them will be harder than a pure
zfs solution

ZFS does support striping by default across vdevs

Eg

Zpool create data da1
Zpool add data da2

Would create a striped data set across da1 and da2

Zpool create data mirror da1 da2
Zpool add data mirror da3 da4

This would create a raid 10 across all drives

Zpool create data raidz2 da1 da2 da3 da5
Zpool add data raidz2 da6 da7 da8 da9

Would create a raid 60

If you replace the add keyword with attach, mirroring is performed rather
than striping

Just for fun here is one of the configs off one of our sun x4500 at work,
its opensolaris not freebsd, but it is zfs. One whoping big array of ~ 28 TB

zpool create -O compression=lzjb -O atime=off data raidz2 c3t0d0 c4t0d0
c8t0d0 c10t0d0 c11t0d0 c3t1d0 c4t1d0 c8t1d0 c9t1d0 c10t1d0 c11t1d0 raidz2
c3t2d0 c4t2d0 c8t2d0 c9t2d0 c11t2d0 c3t3d0 c4t3d0 c8t3d0 c9t3d0 c10t3d0
c11t3d0 raidz2 c3t4d0 c4t4d0 c8t4d0 c10t4d0 c11t4d0 c3t5d0 c4t5d0 c8t5d0
c9t5d0 c10t5d0 c11t5d0 raidz2 c3t6d0 c4t6d0 c8t6d0 c9t6d0 c10t6d0 c11t6d0
c3t7d0 c4t7d0 c9t7d0 c10t7d0 c11t7d0 spare c10t2d0 c8t7d0

$ zpool status
pool: archive-2
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: scrub completed after 11h9m with 0 errors on Sun May 31 01:09:22
2009
config:

NAME STATE READ WRITE CKSUM
archive-2 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c10t0d0 ONLINE 0 0 0
c11t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c8t1d0 ONLINE 0 0 0
c9t1d0 ONLINE 0 0 0
c10t1d0 ONLINE 0 0 0
c11t1d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c8t2d0 ONLINE 0 0 0
c9t2d0 ONLINE 0 0 0
c11t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c8t3d0 ONLINE 0 0 0
c9t3d0 ONLINE 0 0 0
c10t3d0 ONLINE 0 0 0
c11t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c8t4d0 ONLINE 0 0 0
c10t4d0 ONLINE 0 0 0
c11t4d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c8t5d0 ONLINE 0 0 0
c9t5d0 ONLINE 0 0 0
c10t5d0 ONLINE 0 0 0
c11t5d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c8t6d0 ONLINE 0 0 0
c9t6d0 ONLINE 0 0 0
c10t6d0 ONLINE 0 0 0
c11t6d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c9t7d0 ONLINE 0 0 0
c10t7d0 ONLINE 0 0 0
c11t7d0 ONLINE 0 0 0
spares
c10t2d0 AVAIL
c8t7d0 AVAIL

errors: No known data errors

ZFS also check sums all data blocks written to the drive so data integrity
is guaranteed. If you are paranoid you can also set it to keep multiple
copies of each file. This will eat up loads of disk space so its best to use
it sparingly one the most important stuff. You can only do it on a fs basis
but this inst a big deal with zfs

Zfs create data/important_stuff
Zfs set copies=3 data/important_stuff

You can also do compression as well, the big example above has this.

In the near future encryption and deduping are also getting integrated into
zfs. This is probably happening in the next few months on opensolaris, but
if you want those features in freebsd I guess it will take at least 6 months
after that.

With regards to your backup I suggest you definitely look at doing regular
fs snapshots. To be real safe, id install the tb drive (probably worth
getting another as well as they are cheap) into another machine, and have it
in another room, or building if possible. Replicate you data using
incremental zfs sends, as this is the most efficient way. You can easily
push it through ssh for security as well. Rsync will work fine but you will
loose all you zfs fs settings with it as it works at the user level not the
fs level.

Hope this helps, im really looking forward to zfs maturing on bsd and having
pure zfs systems 8)

-----Original Message-----
From: owner-freebsd-***@freebsd.org
[mailto:owner-freebsd-***@freebsd.org] On Behalf Of Mike Meyer
Sent: 30 May 2009 21:28
To: ***@googlemail.com
Cc: freebsd-***@freebsd.org
Subject: Re: Request for opinions - gvinum or ccd?

On Sat, 30 May 2009 20:18:40 +0100

Post by x***@googlemail.com

Certainly a lot to think about.
The system has 12gb currently, with room to upgrade. I currently have
two 500gb drives and one 1tb drive. I wanted the setup to be essentially
two drives striped, backed up onto one larger one nightly. I wanted the
large backup drive to be as "isolated" as possible, eg, in the event of
some catastrophic hardware failure, I can remove it and place it in
another machine without a lot of stressful configuration to recover the
data (not possible with a RAID configuration involving all three drives,
as far as I'm aware).

--
Mike Meyer <***@mired.org> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.

O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-***@freebsd.org"

x***@googlemail.com

2009-05-31 20:14:45 UTC

Permalink

Post by krad
Please don't whack gstripe and zfs together. It should work but is ugly and
you might run into issues. Getting out of them will be harder than a pure
zfs solution

Yeah, will be using pure ZFS having read everything I can find on it so far.
I was skeptical of ZFS at first as it appeared to have come out of nowhere
but it seems it's older (and more mature) than I thought.

Post by krad
ZFS does support striping by default across vdevs
Eg
Zpool create data da1
Zpool add data da2
Would create a striped data set across da1 and da2

What kind of performance gain can I expect from this? I'm purely thinking
about performance now - the integrity checking stuff of ZFS is a pleasant
extra.

Post by krad
Just for fun here is one of the configs off one of our sun x4500 at work,
its opensolaris not freebsd, but it is zfs. One whoping big array of ~ 28 TB

Impressive!

Post by krad
Hope this helps, im really looking forward to zfs maturing on bsd and having
pure zfs systems 8)

Absolutely.

xw

Wojciech Puchar

2009-05-31 21:57:27 UTC

Permalink

Post by x***@googlemail.com

Post by krad
Would create a striped data set across da1 and da2

What kind of performance gain can I expect from this? I'm purely thinking
about performance now - the integrity checking stuff of ZFS is a pleasant
extra.

with stripping - as much as with gstripe, ZFS do roughly the same.

with RAID-z - faster transfer, rougly same IOps as single disk. After i
read ZFS papers i know that RAID-z is actually more like RAID-3 not RAID-5.

Wojciech Puchar

2009-05-31 23:55:53 UTC

Permalink

should really use raidz2 in zfs (or some double parity raid on other
systems) if you are worried about data integrity. The reason being the odds
of the crc checking not detecting an error are much more likely these days.
The extra layer of parity pushes these odds into being much bigger

you are right with capacity but not performance. once again - RAIDz is
more like RAID-3 not RAID-5, RAIDz2 is somehow like RAID3 with double
parity disk.

you will get IOps from RAIDz/RAIDz2 set not much more than from single
drive, even on reads.

But if it's used for mostly linear reading big files you are right.

x***@googlemail.com

2009-05-31 23:59:43 UTC

Permalink

There is one last thing I'd like clarified. From the zpool
manpage:

In order to take advantage of these features, a pool must make use of
some form of redundancy, using either mirrored or raidz groups. While
ZFS supports running in a non-redundant configuration, where each root
vdev is simply a disk or file, this is strongly discouraged. A single
case of bit corruption can render some or all of your data unavailable.

Is this supposed to mean:

"ZFS is more fragile than most. If you don't use redundancy, one
case of bit corruption will destroy the filesystem"

Or:

"Hard disks explode often. Use redundancy."

Mike Meyer

2009-06-01 00:14:08 UTC

Permalink

On Mon, 1 Jun 2009 00:59:43 +0100

Post by x***@googlemail.com
There is one last thing I'd like clarified. From the zpool
In order to take advantage of these features, a pool must make use of
some form of redundancy, using either mirrored or raidz groups. While
ZFS supports running in a non-redundant configuration, where each root
vdev is simply a disk or file, this is strongly discouraged. A single
case of bit corruption can render some or all of your data unavailable.
"ZFS is more fragile than most. If you don't use redundancy, one
case of bit corruption will destroy the filesystem"
"Hard disks explode often. Use redundancy."

How about (from an old disk recover paper):

Disks, unlike software, sometimes fail. Using redundancy can help
you prevent this from resulting in data loss.

That said, there aren't many file systems that can recover from data
errors in the underlying storage. ZFS appropriately configured is one.
I don't believe the default config is appropriate, though. You need
both checksum on and copies > 1 on, and the latter isn't the
default. It's probably better to let zpool provide the redundancy via
a mirror or raid configuration than to let zfs do it anyway.

<mike

Wojciech Puchar

2009-06-01 00:26:29 UTC

Permalink

Post by Mike Meyer
Disks, unlike software, sometimes fail. Using redundancy can help

modern SATA drives fail VERY often. about 30% of drives i bought recently
failed in less than a year.

Post by Mike Meyer
both checksum on and copies > 1 on, and the latter isn't the
default. It's probably better to let zpool provide the redundancy via
a mirror or raid configuration than to let zfs do it anyway.

Wojciech Puchar

2009-06-01 09:04:57 UTC

Permalink

You shouldn't need to alter the copies attribute to recover from disk
failures as the normal raid should take care of that. What the copies is

I don't think we understand each other. I say that when i want 2 copies,
ZFS should rebuild second copy if it's gone and i run resilver.

it does not, what doesn't make sense for me.

krad

2009-06-01 12:19:54 UTC

Permalink

Its all done on write, so if you update the file it will have multiple
copies again

This explains it quite well

http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection

-----Original Message-----
From: Wojciech Puchar [mailto:***@wojtek.tensor.gdynia.pl]
Sent: 01 June 2009 10:05
To: krad
Cc: 'Mike Meyer'; freebsd-***@freebsd.org; ***@googlemail.com
Subject: RE: Request for opinions - gvinum or ccd?

You shouldn't need to alter the copies attribute to recover from disk
failures as the normal raid should take care of that. What the copies is

I don't think we understand each other. I say that when i want 2 copies,
ZFS should rebuild second copy if it's gone and i run resilver.

it does not, what doesn't make sense for me.

Wojciech Puchar

2009-06-01 15:49:52 UTC

Permalink

Post by krad
Its all done on write, so if you update the file it will have multiple
copies again

what is exactly what i said in the beginning.

krad

2009-06-02 08:52:17 UTC

Permalink

" You need to write a program that will just rewrite all files to make
this."

No you don't you just make sure you scrub the pools regularly once a week
for instance. This way you will hopefully see small block error way before
you have a full drive failure.

At the end of the day if the data is super critical to you then you should
follow best practices and build an array with sufficient amount of
redundancy in for you needs.

If you cant justify getting 4+ drives to build a raidz2 or raid 10 type
scenario then the data on said array cant actually be worth much itself,
therefore I might be annoying but in the whole scale of things not that
costly if you loose it. If it were you would do thing properly in the first
place

-----Original Message-----
From: Wojciech Puchar [mailto:***@wojtek.tensor.gdynia.pl]
Sent: 01 June 2009 16:50
To: krad
Cc: 'Mike Meyer'; freebsd-***@freebsd.org; ***@googlemail.com
Subject: RE: Request for opinions - gvinum or ccd?

Post by krad
Its all done on write, so if you update the file it will have multiple
copies again

what is exactly what i said in the beginning.

Wojciech Puchar

2009-06-02 15:17:36 UTC

Permalink

Post by krad
this."
No you don't you just make sure you scrub the pools regularly once a week
for instance.

AGAIN example - i had one drive failed, got it out but recovered all data
as for everything copies was set to more than one.

Or most data, but those with copies=1 wasn't critical for me.

then i add new drive, then do scrub and missing copies are NOT REBUILD.

krad

2009-06-01 08:42:24 UTC

Permalink

You shouldn't need to alter the copies attribute to recover from disk
failures as the normal raid should take care of that. What the copies is
useful for is when you get undetected write errors on the drive due to crc
collisions or the drive simply being rubbish

Zfs will intelligently assign the copies across multiple drives if possible,
so if you had 3 vdevs and copies set to three, One copy should end up on
each vdev. Note this isn't the same as mirroring, as each vdev could be a
raidz2 group. With copies=1 you would get a 3rd of the file on each. With
copies 3 you would get a full version of each.

This is obviously very costly on drive space though. It is tunable per file
system though so you don't have to enable it for the whole pool just the
bits you want.

-----Original Message-----
From: owner-freebsd-***@freebsd.org
[mailto:owner-freebsd-***@freebsd.org] On Behalf Of Wojciech Puchar
Sent: 01 June 2009 01:26
To: Mike Meyer
Cc: freebsd-***@freebsd.org; ***@googlemail.com
Subject: Re: Request for opinions - gvinum or ccd?

Post by Mike Meyer
Disks, unlike software, sometimes fail. Using redundancy can help

modern SATA drives fail VERY often. about 30% of drives i bought recently
failed in less than a year.

ZFS copies are far from what i consider useful.

for example you set copies=2. You write a file, and get 2 copies.

Then one disk with one copy fails, then you put another, do resilver but
ZFS DOES NOT rebuild second copy.

You need to write a program that will just rewrite all files to make this.
_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-***@freebsd.org"

Freddie Cash

2009-06-01 02:14:06 UTC

Permalink

Unless you specify mirror or raidz on the create/add line, zfs (in
essence) creates a RAID0 stripe of all the vdevs. Hence, if a single
drive dies, the whole thing dies. Just like in a normal
hardware/software RAID0 array. Nothing special or new here.

Just like "normal" RAID, unless you add redundancy (RAID1/5/6) to a
stripe set, losing a single disk means losing the whole array.

--
Freddie Cash
***@gmail.com

krad

2009-06-01 08:32:09 UTC

Permalink

Zfs has been designed for highly scalable redundant disk pools therefore
using it on a single drive kind of goes against it ethos. Remember a lot of
the blurb in the man page was written by sun and therefore is written with
corporates in mind, therefore the cost with of the data vs an extra drive
being so large why wouldn't you make it redundant.

Having said that sata drives are cheap these days so you would have to be on
the tightest of budgets not to do a mirror.

Having said all this we quite often us zfs on a single drive, well sort of.
The sun clusters have external storage for the shared file systems. These
are usually a bunch of drives, raid 5, 10 or whatever. Then export a single
lun, which is presented to the various nodes. There is a zpool created on
this LUN. So to all intents and purposes zfs thinks its on a single drive
(the redundancy provided by the external array). This is common practice and
we see no issues with it.

-----Original Message-----
From: owner-freebsd-***@freebsd.org
[mailto:owner-freebsd-***@freebsd.org] On Behalf Of
***@googlemail.com
Sent: 01 June 2009 01:00
To: freebsd-***@freebsd.org
Subject: Re: Request for opinions - gvinum or ccd?

There is one last thing I'd like clarified. From the zpool
manpage:

In order to take advantage of these features, a pool must make use of
some form of redundancy, using either mirrored or raidz groups. While
ZFS supports running in a non-redundant configuration, where each root
vdev is simply a disk or file, this is strongly discouraged. A single
case of bit corruption can render some or all of your data unavailable.

Is this supposed to mean:

"ZFS is more fragile than most. If you don't use redundancy, one
case of bit corruption will destroy the filesystem"

Or:

"Hard disks explode often. Use redundancy."

_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-***@freebsd.org"

Tom Evans

2009-06-01 12:50:12 UTC

Permalink

Post by krad
Zfs has been designed for highly scalable redundant disk pools therefore
using it on a single drive kind of goes against it ethos. Remember a lot of
the blurb in the man page was written by sun and therefore is written with
corporates in mind, therefore the cost with of the data vs an extra drive
being so large why wouldn't you make it redundant.
Having said that sata drives are cheap these days so you would have to be on
the tightest of budgets not to do a mirror.
Having said all this we quite often us zfs on a single drive, well sort of.
The sun clusters have external storage for the shared file systems. These
are usually a bunch of drives, raid 5, 10 or whatever. Then export a single
lun, which is presented to the various nodes. There is a zpool created on
this LUN. So to all intents and purposes zfs thinks its on a single drive
(the redundancy provided by the external array). This is common practice and
we see no issues with it.

By doing this surely you lose a lot of the self healing that ZFS offers?
For instance, if the underlying vdev is just a raid5, then a disk
failure combined with an undetected checksum error on a different disk
would lead you to lose all your data. Or am I missing something?

(PS, top posting is bad)

Tom

krad

2009-06-01 13:19:50 UTC

Permalink

no you would only loose the data for that block. Zfs also checksums meta
data, but by default keeps multiple copies of it so that's fairly resilient.
If you had the copies set to > 1 then you wouldn't loose the block either,
unless you were real unlucky.

It's just about pushing the odds back further and further. If you are super
paranoid by all means put in 48 drive, group them into 5 x 8 drive raidz2
vdevs, have a bunch of hot spares, and enable copies=5 for blocks and
metadata, then duplicate the system and put the other box on another
continent and zfs send all you updates every 15 mins via a private
deadicated. This will all prove very resilient, but you will get very little
% storage from your drives, and have quite a large bandwidth bill 8)

Oh and don't forget the scrub you disk regularly. BTW that would rebuild any
missing copies as well (eg if you increase the number of copies after data
is stored on the fs)

-----Original Message-----
From: Tom Evans [mailto:***@googlemail.com]
Sent: 01 June 2009 13:50
To: krad
Cc: ***@googlemail.com; freebsd-***@freebsd.org
Subject: RE: Request for opinions - gvinum or ccd?

Tom Evans

2009-06-01 14:15:43 UTC

Permalink

Post by krad
no you would only loose the data for that block. Zfs also checksums meta
data, but by default keeps multiple copies of it so that's fairly resilient.
If you had the copies set to > 1 then you wouldn't loose the block either,
unless you were real unlucky.
It's just about pushing the odds back further and further. If you are super
paranoid by all means put in 48 drive, group them into 5 x 8 drive raidz2
vdevs, have a bunch of hot spares, and enable copies=5 for blocks and
metadata, then duplicate the system and put the other box on another
continent and zfs send all you updates every 15 mins via a private
deadicated. This will all prove very resilient, but you will get very little
% storage from your drives, and have quite a large bandwidth bill 8)
Oh and don't forget the scrub you disk regularly. BTW that would rebuild any
missing copies as well (eg if you increase the number of copies after data
is stored on the fs)

Well, no you wouldn't, because ZFS would never get to try to recover
that error. Since that one block is bad, and you lost a disk, your
underlying RAID-5 would not be able to recoverable, and you just lost
the entire contents of the RAID-5. ZFS wouldn't be able to recover
anything from it. The only time ZFS could recover from this scenario is
if you scrubbed before you had your disk failure. Hard to predict disk
failures..

What I'm trying to say (badly) is that this is redundancy that ZFS knows
nothing about, so it cannot recover from it in the same manner that a 5
disk raidz can. If this happened to a 5 disk raid-z, you would lose just
the corrupted block, rather than all your data.

PS, top posting is still bad. Thanks for making me cut the context out
of all these emails.

Cheers

Tom

krad

2009-05-31 22:13:19 UTC

Permalink

Yep, its also worth noting that with the capacities of drives these days you
should really use raidz2 in zfs (or some double parity raid on other
systems) if you are worried about data integrity. The reason being the odds
of the crc checking not detecting an error are much more likely these days.
The extra layer of parity pushes these odds into being much bigger

-----Original Message-----
From: Wojciech Puchar [mailto:***@wojtek.tensor.gdynia.pl]
Sent: 31 May 2009 22:57
To: ***@googlemail.com
Cc: krad; freebsd-***@freebsd.org
Subject: Re: Request for opinions - gvinum or ccd?

Post by x***@googlemail.com

Post by krad
Would create a striped data set across da1 and da2

What kind of performance gain can I expect from this? I'm purely thinking
about performance now - the integrity checking stuff of ZFS is a pleasant
extra.

Mike Meyer

2009-05-31 20:33:05 UTC

Permalink

On Sun, 31 May 2009 13:13:24 +0100

Post by krad
Please don't whack gstripe and zfs together. It should work but is ugly and
you might run into issues. Getting out of them will be harder than a pure
zfs solution

Yeah, I sorta suspected that might be the case.

Post by krad
ZFS does support striping by default across vdevs

This isn't documented - at least not in my copies of the manual
page. Not being able to find that was the only reason to even consider
mixing technologies like that.

Thanks,
<mike

krad

2009-05-31 22:00:35 UTC

Permalink

Yep it probably isn't clear enough, it does mention stuff about spreading it
across vdevs, but doesn't say striped. But that's sun for you.

The man page should probably be bsdified more as its more or less pulled
from solaris. Note the devices don't look anything like bsd ones (c0t0d0)

-----Original Message-----
From: Mike Meyer [mailto:***@mired.org]
Sent: 31 May 2009 21:33
To: krad
Cc: 'Mike Meyer'; ***@googlemail.com; freebsd-***@freebsd.org
Subject: Re: Request for opinions - gvinum or ccd?

On Sun, 31 May 2009 13:13:24 +0100

Post by krad
Please don't whack gstripe and zfs together. It should work but is ugly and
you might run into issues. Getting out of them will be harder than a pure
zfs solution

Yeah, I sorta suspected that might be the case.

Post by krad
ZFS does support striping by default across vdevs

This isn't documented - at least not in my copies of the manual
page. Not being able to find that was the only reason to even consider
mixing technologies like that.

Thanks,
<mike

Wojciech Puchar

2009-06-01 00:31:07 UTC

Permalink

Post by krad
Yep it probably isn't clear enough, it does mention stuff about spreading it
across vdevs, but doesn't say striped.

isn't spreading and stripping actually the same?

Wojciech Puchar

2009-06-01 09:02:51 UTC

Permalink

In the case of zfs yes, but not always. Eg you could have a concatenated
volume. Where you only start writing to the second disk when the 1st is
full.

i don't know how ZFS exactly allocates space, but i use gconcat with UFS
and it isn't true.

UFS do "jump" between zones (called cyllinder group) when files are
written to prevent unevenly filling them, so every zone always has some
space to allocate.

The effect is that both drives gets quite evenly filled

krad

2009-06-01 08:36:12 UTC

Permalink

In the case of zfs yes, but not always. Eg you could have a concatenated
volume. Where you only start writing to the second disk when the 1st is
full.

-----Original Message-----
From: Wojciech Puchar [mailto:***@wojtek.tensor.gdynia.pl]
Sent: 01 June 2009 01:31
To: krad
Cc: 'Mike Meyer'; freebsd-***@freebsd.org; 'Mike Meyer';
***@googlemail.com
Subject: RE: Request for opinions - gvinum or ccd?

Post by krad
Yep it probably isn't clear enough, it does mention stuff about spreading it
across vdevs, but doesn't say striped.

isn't spreading and stripping actually the same?

Ivan Voras

2009-06-06 19:54:15 UTC

Permalink

Sorry to come into the discussion late, but I just want to confirm
something.

The configuration below is a stripe of four components, each of which is
RAIDZ2, right?

If, as was discussed later in the thread, RAIDZ(2) is more similar to
RAID3 than RAID5 for random performance, the given configuration can be
(very roughly, in the non-sequential access case) expected to deliver
performance of four drives in a RAID0 array?

Post by krad
zpool create -O compression=lzjb -O atime=off data raidz2 c3t0d0 c4t0d0
c8t0d0 c10t0d0 c11t0d0 c3t1d0 c4t1d0 c8t1d0 c9t1d0 c10t1d0 c11t1d0 raidz2
c3t2d0 c4t2d0 c8t2d0 c9t2d0 c11t2d0 c3t3d0 c4t3d0 c8t3d0 c9t3d0 c10t3d0
c11t3d0 raidz2 c3t4d0 c4t4d0 c8t4d0 c10t4d0 c11t4d0 c3t5d0 c4t5d0 c8t5d0
c9t5d0 c10t5d0 c11t5d0 raidz2 c3t6d0 c4t6d0 c8t6d0 c9t6d0 c10t6d0 c11t6d0
c3t7d0 c4t7d0 c9t7d0 c10t7d0 c11t7d0 spare c10t2d0 c8t7d0
NAME STATE READ WRITE CKSUM
archive-2 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c8t0d0 ONLINE 0 0 0
c10t0d0 ONLINE 0 0 0
c11t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c8t1d0 ONLINE 0 0 0
c9t1d0 ONLINE 0 0 0
c10t1d0 ONLINE 0 0 0
c11t1d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c8t2d0 ONLINE 0 0 0
c9t2d0 ONLINE 0 0 0
c11t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c8t3d0 ONLINE 0 0 0
c9t3d0 ONLINE 0 0 0
c10t3d0 ONLINE 0 0 0
c11t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c8t4d0 ONLINE 0 0 0
c10t4d0 ONLINE 0 0 0
c11t4d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c8t5d0 ONLINE 0 0 0
c9t5d0 ONLINE 0 0 0
c10t5d0 ONLINE 0 0 0
c11t5d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c3t6d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c8t6d0 ONLINE 0 0 0
c9t6d0 ONLINE 0 0 0
c10t6d0 ONLINE 0 0 0
c11t6d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c9t7d0 ONLINE 0 0 0
c10t7d0 ONLINE 0 0 0
c11t7d0 ONLINE 0 0 0
spares
c10t2d0 AVAIL
c8t7d0 AVAIL
errors: No known data errors

Freddie Cash

2009-06-06 20:16:32 UTC

Permalink

Post by Ivan Voras
Sorry to come into the discussion late, but I just want to confirm
something.
The configuration below is a stripe of four components, each of which is
RAIDZ2, right?
If, as was discussed later in the thread, RAIDZ(2) is more similar to
RAID3 than RAID5 for random performance, the given configuration can be
(very roughly, in the non-sequential access case) expected to deliver
performance of four drives in a RAID0 array?

According to all the Sun documentation, the I/O throughput of a raidz
configuration is equal to that of a single drive.

Hence their recommendation to not use more than 8 or 9 drives in a
single raidz vdev, and to use multiple raidz vdevs. As you add vdevs,
the throughput increases.

We made the mistake early on of creating a 24-drive raidz2 vdev.
Performance was not very good. And when we had to replace a drive, it
spent over a week trying to resilver. But the resilver operation has
to touch every single drive in the raidz vdev. :(

We remade the pool using 3x 8-drive raidz2 vdevs, and performance has
been great (400 MBytes/s write, almost 3 GBytes/s sequential read, 800
MBytes/s random read).

--
Freddie Cash
***@gmail.com

krad

2009-06-07 00:01:17 UTC

Permalink

Post by Freddie Cash
We remade the pool using 3x 8-drive raidz2 vdevs, and performance has
been great (400 MBytes/s write, almost 3 GBytes/s sequential read, 800
MBytes/s random read).

Yep that corresponds with what we saw, although we were getting a little
higher write rates with our 46 drive configuration

Wojciech Puchar

2009-06-07 07:01:34 UTC

Permalink

Post by Freddie Cash

Post by Ivan Voras
(very roughly, in the non-sequential access case) expected to deliver
performance of four drives in a RAID0 array?

According to all the Sun documentation, the I/O throughput of a raidz
configuration is equal to that of a single drive.

exactly what i say. it's like RAID3. Not RAID5 which have close to n times
single drive throughput on read and rougly n/4 on writes.

Post by Freddie Cash
We remade the pool using 3x 8-drive raidz2 vdevs, and performance has
been great (400 MBytes/s write, almost 3 GBytes/s sequential read, 800

why write performance is so slow? in Sun theory it should have the same
speed as reads. I would say that it should be even better a bit -
filesystem get data first in cache and can plan ahead.

Post by Freddie Cash
MBytes/s random read).

random read on how big chunks?

Are you sure you get 3GB/s on read? it would mean each drive must be able
to do 140MB/s

What disks do you use?

Freddie Cash

2009-06-08 19:11:17 UTC

Permalink

On Sun, Jun 7, 2009 at 12:01 AM, Wojciech Puchar<

Post by Wojciech Puchar

Post by Freddie Cash

Post by Ivan Voras
(very roughly, in the non-sequential access case) expected to deliver
performance of four drives in a RAID0 array?

According to all the Sun documentation, the I/O throughput of a raidz
configuration is equal to that of a single drive.

exactly what i say. it's like RAID3. Not RAID5 which have close to n times
single drive throughput on read and rougly n/4 on writes.

Post by Freddie Cash
We remade the pool using 3x 8-drive raidz2 vdevs, and performance has
been great (400 MBytes/s write, almost 3 GBytes/s sequential read, 800

why write performance is so slow? in Sun theory it should have the same
speed as reads. I would say that it should be even better a bit -

filesystem

Post by Wojciech Puchar
get data first in cache and can plan ahead.

Post by Freddie Cash
MBytes/s random read).

random read on how big chunks?
Are you sure you get 3GB/s on read? it would mean each drive must be able

Post by Wojciech Puchar
do 140MB/s
What disks do you use?

12x 500 GB Seagate EL2 SATA drives, part of their enterprise near-line
storage line.
12x 500 GB WD SATA drives, generic off-the-shelf drives

I re-ran the iozone tests, letting them run to completion, and here are the
results:

The iozone command: iozone -M -e -+u -T -t <threads> -r 128k -s 40960 -i 0
-i 1 -i 2 -i 8 -+p 70 -C
I ran the command using 32, 64, 128, and 256 for <threads>

Write speeds range from 236 MBytes/sec to 582 MBytes/sec for sequential; and
from 242 MBytes/sec to 550 MBytes/sec for random.

Read speeds range from 3.3 GBytes/sec to 5.5 GBytes/sec for sequential; and
from 1.8 GBytes/sec to 5.5 GBytes/sec for random.

All the gory details are below.

32-threads: Children see ... 32 initial writers = 582468.13 KB/sec
32-threads: Parent sees ... 32 initial writers = 108808.46 KB/sec
64-threads: Children see ... 64 initial writers = 236144.47 KB/sec
64-threads: Parent sees ... 64 initial writers = 86942.94 KB/sec
128-threads: Children see ... 128 initial writers = 284706.68 KB/sec
128-threads: Parent sees ... 128 initial writers = 10850.40 KB/sec
256-threads: Children see ... 256 initial writers = 258260.59 KB/sec
256-threads: Parent sees ... 256 initial writers = 9882.16 KB/sec

32-threads: Children see ... 32 rewriters = 545347.52 KB/sec
32-threads: Parent sees ... 32 rewriters = 339308.08 KB/sec
64-threads: Children see ... 64 rewriters = 419838.51 KB/sec
64-threads: Parent sees ... 64 rewriters = 335620.45 KB/sec
128-threads: Children see ... 128 rewriters = 350668.51 KB/sec
128-threads: Parent sees ... 128 rewriters = 319452.97 KB/sec
256-threads: Children see ... 256 rewriters = 317751.52 KB/sec
256-threads: Parent sees ... 256 rewriters = 295579.66 KB/sec

32-threads: Children see ... 32 random writers = 379256.37 KB/sec
32-threads: Parent sees ... 32 random writers = 95298.44 KB/sec
64-threads: Children see ... 64 random writers = 551767.68 KB/sec
64-threads: Parent sees ... 64 random writers = 113397.95 KB/sec
128-threads: Children see ... 128 random writers = 241980.60 KB/sec
128-threads: Parent sees ... 128 random writers = 74584.01 KB/sec
256-threads: Children see ... 256 random writers = 398427.84 KB/sec
256-threads: Parent sees ... 256 random writers = 20219.56 KB/sec

32-threads: Children see ... 32 readers = 5023742.86 KB/sec
32-threads: Parent sees ... 32 readers = 4661309.72 KB/sec
64-threads: Children see ... 64 readers = 5516460.71 KB/sec
64-threads: Parent sees ... 64 readers = 3949337.61 KB/sec
128-threads: Children see ... 128 readers = 4748635.74 KB/sec
128-threads: Parent sees ... 128 readers = 3208982.03 KB/sec
256-threads: Children see ... 256 readers = 4358453.38 KB/sec
256-threads: Parent sees ... 256 readers = 2741593.08 KB/sec

32-threads: Children see ... 32 re-readers = 5502926.62 KB/sec
32-threads: Parent sees ... 32 re-readers = 4650327.75 KB/sec
64-threads: Children see ... 64 re-readers = 5509400.02 KB/sec
64-threads: Parent sees ... 64 re-readers = 4526444.40 KB/sec
128-threads: Children see ... 128 re-readers = 4072363.55 KB/sec
128-threads: Parent sees ... 128 re-readers = 2840317.47 KB/sec
256-threads: Children see ... 256 re-readers = 3329375.95 KB/sec
256-threads: Parent sees ... 256 re-readers = 2183894.33 KB/sec

32-threads: Children see ... 32 random readers = 5555090.45 KB/sec
32-threads: Parent sees ... 32 random readers = 4602383.62 KB/sec
64-threads: Children see ... 64 random readers = 4402270.77 KB/sec
64-threads: Parent sees ... 64 random readers = 2059081.52 KB/sec
128-threads: Children see ... 128 random readers = 3070466.93 KB/sec
128-threads: Parent sees ... 128 random readers = 525076.11 KB/sec
256-threads: Children see ... 256 random readers = 1888676.12 KB/sec
256-threads: Parent sees ... 256 random readers = 293304.53 KB/sec

32-threads: Children see ... 32 mixed workload = 3130000.18 KB/sec
32-threads: Parent sees ... 32 mixed workload = 123281.78 KB/sec
64-threads: Children see ... 64 mixed workload = 1587053.33 KB/sec
64-threads: Parent sees ... 64 mixed workload = 294586.82 KB/sec
128-threads: Children see ... 128 mixed workload = 807349.95 KB/sec
128-threads: Parent sees ... 128 mixed workload = 98998.77 KB/sec
256-threads: Children see ... 256 mixed workload = 393469.55 KB/sec
256-threads: Parent sees ... 256 mixed workload = 112394.90 KB/sec

--
Freddie Cash
***@gmail.com