Discussion:
File locking, closes and performance in a distributed file system env
(too old to reply)
Richard Sharpe
2002-05-14 17:31:12 UTC
Permalink
Hi,

I might be way off base here, but we have run into what looks like a
performance issue with locking and file closes.

We have implemented a distributed file system and were looking at some
performace issues.

At the moment, if a process locks a file, the code in kern_descrip.c that
handles it does the following:

p->p_flag |= P_ADVLOCK;

to indicate that files might be locked.

Later in closef, we see the following:

if (p && (p->p_flag & P_ADVLOCK) && fp->f_type == DTYPE_VNODE) {
lf.l_whence = SEEK_SET;
lf.l_start = 0;
lf.l_len = 0;
lf.l_type = F_UNLCK;
vp = (struct vnode *)fp->f_data;
(void) VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK, &lf,
F_POSIX);
}

This seems to mean that once a process locks a file, every close after
that will pay the penalty of calling the underlying vnode unlock call. In
a distributed file system, with a simple implementation, that could be an
RPC to the lock manager to implement.

Now, there seems to be a few ways to migitate this:

1. Keep (more) state at the vnode layer that allows us to not issue a
network traversing unlock if the file was not locked. This means that any
process that has opened the file will have to issue the network traversing
unlock request once the flag is set on the vnode.

2. Place a flag in the struct file structure that keeps the state of any
locks on the file. This means that any processes that share the struct
(those related by fork) will need to issue unlock requests if one of them
locks the file.

3. Change a file descriptor table that hangs off the process structure so
that it includes state about whether or not this process has locked the
file.

It seems that each of these reduces the performance penalty that processes
that might be sharing the file, but which have not locked the file, might
have to pay.

Option 2 looks easy.

Are there any comments?

Regards
-----
Richard Sharpe, ***@ns.aus.com, ***@samba.org,
***@ethereal.com


To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Terry Lambert
2002-05-15 00:19:44 UTC
Permalink
Post by Richard Sharpe
I might be way off base here, but we have run into what looks like a
performance issue with locking and file closes.
[ ... ]
Post by Richard Sharpe
This seems to mean that once a process locks a file, every close after
that will pay the penalty of calling the underlying vnode unlock call. In
a distributed file system, with a simple implementation, that could be an
RPC to the lock manager to implement.
Yes. This is pretty much required by the POSIX locking
semantics, which require that the first close remove all
locks. Unfortunately, you can't know on a per process
basis that there are no locks remaining on *any* vnode for
a given process, so the overhead is sticky.
Post by Richard Sharpe
1. Keep (more) state at the vnode layer that allows us to not issue a
network traversing unlock if the file was not locked. This means that any
process that has opened the file will have to issue the network traversing
unlock request once the flag is set on the vnode.
2. Place a flag in the struct file structure that keeps the state of any
locks on the file. This means that any processes that share the struct
(those related by fork) will need to issue unlock requests if one of them
locks the file.
3. Change a file descriptor table that hangs off the process structure so
that it includes state about whether or not this process has locked the
file.
It seems that each of these reduces the performance penalty that processes
that might be sharing the file, but which have not locked the file, might
have to pay.
Option 2 looks easy.
Are there any comments?
#3 is really unreasonable. It implies non-coelescing. I know that
CIFS requires this, and so does NFSv4, so it's not an unreasonable
thing do do eventually (historical behaviour can be maintained by
removing all locks in the overlap region on an unlock, yielding
logical coelescing). The amount of things that will need to be
touched by this, though, means it's probably not worth doing now.

In reality, for remote FS's, you want to assert the lock locally
before transitting the network anyway, in case there is a local
conflict, in which case you avoid propagating the request over
the network. For union mounts of local and remote FS's, for
which there is a local lock against the local FS by another process
that doesn't respect the union (a legitimate thing to have happen),
it's actually a requirement, since the remote system may promote
or coelesce locks, and that means that there is no reverse process
for a remote success followed by a local failure.

This is basically a twist on #1:

a) Assert the lock locally before asserting it remotely;
if the assertion fails, then you have avoided a network
operation which is doomed to failure (the RPC call you
are trying to avoid is similar).

b) When unlocking, verify that the lock exists locally
before attempting to deassert it remotely. This means
there there is still the same local overhead as there
always was, but at least you avoid the RPC in the case
where there are no outstanding locks that will be
cleared by the call.

I've actually wanted the VOP_ADVLOCK to be veto-based for going
on 6 years now, to avoid precisely the type of problems your are
now facing. If the upper layer code did local assertion on vnodes,
and called the lower layer code only in the success cases, then the
implementation would actually be done for you already.

-- Terry

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Richard Sharpe
2002-05-15 01:52:45 UTC
Permalink
On Tue, 14 May 2002, Terry Lambert wrote:

Hmmm, I wasn't very clear ...

What I am proposing is a 'simple' fix that simply changes

p->p_flag |= P_ADVLOCK;

to

fp->l_flag |= P_ADVLOCK;

And never resets it, and then in closef,

if ((fp->l_flag & P_ADVLOCK) && fp->f_type == DTYPE_VNODE) {
lf.l_whence = SEEK_SET;
lf.l_start = 0;
lf.l_len = 0;
lf.l_type = F_UNLCK;
vp = (struct vnode *)fp->f_data;
(void) VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK, &lf,
F_POSIX);
}

Which still means that the correct functionality is implemented, but we
only try to unlock files that have ever been locked before, or where we
are sharing a file struct with another (related) process and one of them
has locked the file.
Post by Terry Lambert
Post by Richard Sharpe
I might be way off base here, but we have run into what looks like a
performance issue with locking and file closes.
[ ... ]
Post by Richard Sharpe
This seems to mean that once a process locks a file, every close after
that will pay the penalty of calling the underlying vnode unlock call. In
a distributed file system, with a simple implementation, that could be an
RPC to the lock manager to implement.
Yes. This is pretty much required by the POSIX locking
semantics, which require that the first close remove all
locks. Unfortunately, you can't know on a per process
basis that there are no locks remaining on *any* vnode for
a given process, so the overhead is sticky.
Post by Richard Sharpe
1. Keep (more) state at the vnode layer that allows us to not issue a
network traversing unlock if the file was not locked. This means that any
process that has opened the file will have to issue the network traversing
unlock request once the flag is set on the vnode.
2. Place a flag in the struct file structure that keeps the state of any
locks on the file. This means that any processes that share the struct
(those related by fork) will need to issue unlock requests if one of them
locks the file.
3. Change a file descriptor table that hangs off the process structure so
that it includes state about whether or not this process has locked the
file.
It seems that each of these reduces the performance penalty that processes
that might be sharing the file, but which have not locked the file, might
have to pay.
Option 2 looks easy.
Are there any comments?
#3 is really unreasonable. It implies non-coelescing. I know that
CIFS requires this, and so does NFSv4, so it's not an unreasonable
thing do do eventually (historical behaviour can be maintained by
removing all locks in the overlap region on an unlock, yielding
logical coelescing). The amount of things that will need to be
touched by this, though, means it's probably not worth doing now.
In reality, for remote FS's, you want to assert the lock locally
before transitting the network anyway, in case there is a local
conflict, in which case you avoid propagating the request over
the network. For union mounts of local and remote FS's, for
which there is a local lock against the local FS by another process
that doesn't respect the union (a legitimate thing to have happen),
it's actually a requirement, since the remote system may promote
or coelesce locks, and that means that there is no reverse process
for a remote success followed by a local failure.
a) Assert the lock locally before asserting it remotely;
if the assertion fails, then you have avoided a network
operation which is doomed to failure (the RPC call you
are trying to avoid is similar).
b) When unlocking, verify that the lock exists locally
before attempting to deassert it remotely. This means
there there is still the same local overhead as there
always was, but at least you avoid the RPC in the case
where there are no outstanding locks that will be
cleared by the call.
I've actually wanted the VOP_ADVLOCK to be veto-based for going
on 6 years now, to avoid precisely the type of problems your are
now facing. If the upper layer code did local assertion on vnodes,
and called the lower layer code only in the success cases, then the
implementation would actually be done for you already.
-- Terry
--
Regards
-----
Richard Sharpe, ***@ns.aus.com, ***@samba.org,
***@ethereal.com


To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Terry Lambert
2002-05-15 05:57:54 UTC
Permalink
Post by Richard Sharpe
Hmmm, I wasn't very clear ...
What I am proposing is a 'simple' fix that simply changes
p->p_flag |= P_ADVLOCK;
to
fp->l_flag |= P_ADVLOCK;
And never resets it, and then in closef,
if ((fp->l_flag & P_ADVLOCK) && fp->f_type == DTYPE_VNODE) {
lf.l_whence = SEEK_SET;
lf.l_start = 0;
lf.l_len = 0;
lf.l_type = F_UNLCK;
vp = (struct vnode *)fp->f_data;
(void) VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK, &lf,
F_POSIX);
}
Which still means that the correct functionality is implemented, but we
only try to unlock files that have ever been locked before, or where we
are sharing a file struct with another (related) process and one of them
has locked the file.
Do you expect to share the same "fp" between multiple open
instances for a given file within a single process?

I think your approach will fail to implement proper POSIX
file locking semantics.

I really hate POSIX semantics, but you have to implement
them exactly (at least by default), because programs are
written to expect them.

Basically, this means that if you open a file twice, lock it
via the first fd, then close the second fd, all locks are
released. In your code, it looks like what would happen is
that when you closed the second fd, the fp->l_flag won't have
the bit set. Correct me if I'm wrong?

The reason for the extra overhead now is that you can't do
this on an open instance basis because of POSIX, so it does it
on a process instance basis.

The only other alternative is to do it on a vp basis -- and
since multiple fp's can point to the same vp, your option #2
will fail, as described above, but my suggestion to do the
locking locally will associate it the the vp (or the v_data,
depending on which version of FreeBSD, and where the VOP_ADVLOCK
hangs the lock list off of: the vnode or the inode) will
maintain the proper semantics.

Your intent isn't really to avoid the VOP_ADVLOCK call, it's
to avoid making an RPC call to satisfy the VOP_ADVLOCK call,
right?

You can't really avoid *all* the "avoidable overhead", without
restructuring the VOP_ADVLOCK interface, which is politically
difficult.

-- Terry

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Alfred Perlstein
2002-05-15 06:24:25 UTC
Permalink
Post by Richard Sharpe
Hmmm, I wasn't very clear ...
What I am proposing is a 'simple' fix that simply changes
p->p_flag |= P_ADVLOCK;
to
fp->l_flag |= P_ADVLOCK;
As Terry stated you can't do that, however you could cache that the
VNODE has a lock, that would reduce the requirement for calling the
ADVLOCK VOP.
--
-Alfred Perlstein [***@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'
Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Terry Lambert
2002-05-15 08:35:48 UTC
Permalink
Post by Alfred Perlstein
As Terry stated you can't do that, however you could cache that the
VNODE has a lock, that would reduce the requirement for calling the
ADVLOCK VOP.
You'd really have to know when the lock list went to NULL, to get
any benefit out of it, since locking would still end up being per-file
sticky. You could post-check after every successful unlock... but to
cache the remote state would mean another RPC to ask for locks, which
would just be front-loading the expense, instead of back-loading it.

I don't think this would be a win: applications don't really choose
selective locking: either they tend to lock everything or they don't
respect locking at all.

Also, it's very common to check a lock before doing something, so
you'll still have all that unnecessary traffic.

You really need to cache the entire local lock list.

The most reasonable way to so this would be for a "magic" return
value from the lock primitive. But since you can't get cooperation
from the other end on that, the only reasonable way is to maintain
the lock list locally as well as remotely, and then do a check for
NULL (instead of the flag) before attempting to proxy the unnecessary
unlock request.

Which is basically what I said the first time. 8-).

I suspect there are other things he's not telling us, too, like
session handoff or other clustering protocols, which means that
you could have your locks handed off, and end up without the
"there's a lock" flag set, in any case, so there's still room
to shoot your foot off. This may be an incorrect suspicion,
but... generally, distributed file systems are built that way
for a particular reason/application set, which includes things
like client virtualization/migration/handoff.

Too fun. 8-).

-- Terry

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Alfred Perlstein
2002-05-15 15:51:01 UTC
Permalink
Post by Terry Lambert
Post by Alfred Perlstein
As Terry stated you can't do that, however you could cache that the
VNODE has a lock, that would reduce the requirement for calling the
ADVLOCK VOP.
You'd really have to know when the lock list went to NULL, to get
any benefit out of it, since locking would still end up being per-file
sticky. You could post-check after every successful unlock... but to
cache the remote state would mean another RPC to ask for locks, which
would just be front-loading the expense, instead of back-loading it.
[snip]

He could also maintain a local cache of this per vnode, basically
maintain a mirror of the lock list locally in order to see if a remote
op must be done.
--
-Alfred Perlstein [***@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'
Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Andrew R. Reiter
2002-05-15 16:53:38 UTC
Permalink
On Wed, 15 May 2002, Alfred Perlstein wrote:

:* Terry Lambert <***@mindspring.com> [020515 01:36] wrote:
:> Alfred Perlstein wrote:
:> > As Terry stated you can't do that, however you could cache that the
:> > VNODE has a lock, that would reduce the requirement for calling the
:> > ADVLOCK VOP.
:> You'd really have to know when the lock list went to NULL, to get
:> any benefit out of it, since locking would still end up being per-file
:> sticky. You could post-check after every successful unlock... but to
:> cache the remote state would mean another RPC to ask for locks, which
:> would just be front-loading the expense, instead of back-loading it.
:
:[snip]
:
:He could also maintain a local cache of this per vnode, basically
:maintain a mirror of the lock list locally in order to see if a remote
:op must be done.

Isn't this sorta like coda?

--
Andrew R. Reiter
***@watson.org
***@FreeBSD.org


To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Alfred Perlstein
2002-05-15 17:06:36 UTC
Permalink
Post by Andrew R. Reiter
:> > As Terry stated you can't do that, however you could cache that the
:> > VNODE has a lock, that would reduce the requirement for calling the
:> > ADVLOCK VOP.
:> You'd really have to know when the lock list went to NULL, to get
:> any benefit out of it, since locking would still end up being per-file
:> sticky. You could post-check after every successful unlock... but to
:> cache the remote state would mean another RPC to ask for locks, which
:> would just be front-loading the expense, instead of back-loading it.
:[snip]
:He could also maintain a local cache of this per vnode, basically
:maintain a mirror of the lock list locally in order to see if a remote
:op must be done.
Isn't this sorta like coda?
I'm not a coda expert so I wouldn't know, but I wasn't professing to
have invented something profound by suggesting a client cache. :)
--
-Alfred Perlstein [***@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'
Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
O Senhor
2002-05-15 18:08:17 UTC
Permalink
Do you know about performance in postfix? I have on FreeBSD (4.5) box
running postfix and delivering mail in 65.000 mailboxes... I know about
maildirs... but, how maildir would help me??? The postfix delivery agent
simply can't do the jog. This is because a lot of entries???

help please.


To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
void
2002-05-16 17:41:47 UTC
Permalink
Post by O Senhor
Do you know about performance in postfix? I have on FreeBSD (4.5) box
running postfix and delivering mail in 65.000 mailboxes... I know about
maildirs... but, how maildir would help me??? The postfix delivery agent
simply can't do the jog. This is because a lot of entries???
Please don't cross-post between hackers and questions. Also, please
don't start a new thread by responding to a message in an unrelated
thread.

You don't provide nearly enough information to tell why your system is
underperforming. It could be a tuning issue, or your system may simply
have not enough RAM or (more likely) not enough I/O bandwidth.

I recommend that you contact the postfix-users mailing list. The people
there give good Postfix tuning advice. The list is closed so you'll
have to subscribe before you can post.

http://www.postfix.org/lists.html

When you post there, you should describe your hardware, including CPU
speed, how much RAM you have, and the number, type and layout of your
disks. If you've made any significant changes to the default Postfix
config, describe them; if you haven't, tell the list that you haven't.
Good luck.
--
Ben

"An art scene of delight
I created this to be ..." -- Sun Ra

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Terry Lambert
2002-05-15 18:38:27 UTC
Permalink
Post by Andrew R. Reiter
:He could also maintain a local cache of this per vnode, basically
:maintain a mirror of the lock list locally in order to see if a remote
:op must be done.
Isn't this sorta like coda?
Lock cache, not data cache.

It's "sort of like":

http://www.blackflag.ru/patches/nfs-client-and-server-locking-4.5-STABLE-20020312.diff

-- Terry

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Terry Lambert
2002-05-15 18:36:42 UTC
Permalink
Post by Alfred Perlstein
He could also maintain a local cache of this per vnode, basically
maintain a mirror of the lock list locally in order to see if a remote
op must be done.
I think we are talking past each other.

This is what I've been suggesting since my first message,
but suggested was a political problem when going to make
the necessary VOP_ADVLOCK interface changes.

It's also one of the patches I put up a long time ago, and
which have been updated to FreeBSD 4.5 by Andrey.

He could just take Andrey's code, but if it isn't committed
back to FreeBSD, he would have compatability issues.

-- Terry

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Richard Sharpe
2002-05-15 18:51:41 UTC
Permalink
Post by Terry Lambert
Post by Richard Sharpe
Hmmm, I wasn't very clear ...
What I am proposing is a 'simple' fix that simply changes
p->p_flag |= P_ADVLOCK;
to
fp->l_flag |= P_ADVLOCK;
And never resets it, and then in closef,
if ((fp->l_flag & P_ADVLOCK) && fp->f_type == DTYPE_VNODE) {
lf.l_whence = SEEK_SET;
lf.l_start = 0;
lf.l_len = 0;
lf.l_type = F_UNLCK;
vp = (struct vnode *)fp->f_data;
(void) VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK, &lf,
F_POSIX);
}
Which still means that the correct functionality is implemented, but we
only try to unlock files that have ever been locked before, or where we
are sharing a file struct with another (related) process and one of them
has locked the file.
Do you expect to share the same "fp" between multiple open
instances for a given file within a single process?
I think your approach will fail to implement proper POSIX
file locking semantics.
I really hate POSIX semantics, but you have to implement
them exactly (at least by default), because programs are
written to expect them.
Basically, this means that if you open a file twice, lock it
via the first fd, then close the second fd, all locks are
released. In your code, it looks like what would happen is
that when you closed the second fd, the fp->l_flag won't have
the bit set. Correct me if I'm wrong?
The reason for the extra overhead now is that you can't do
this on an open instance basis because of POSIX, so it does it
on a process instance basis.
OK, you have convinced me. I have looked at the POSIX spec in this area,
and agree that I can't do what I want to do.
Post by Terry Lambert
The only other alternative is to do it on a vp basis -- and
since multiple fp's can point to the same vp, your option #2
will fail, as described above, but my suggestion to do the
locking locally will associate it the the vp (or the v_data,
depending on which version of FreeBSD, and where the VOP_ADVLOCK
hangs the lock list off of: the vnode or the inode) will
maintain the proper semantics.
Your intent isn't really to avoid the VOP_ADVLOCK call, it's
to avoid making an RPC call to satisfy the VOP_ADVLOCK call,
right?
Yes, correct. We will have to do it in the vnode layer as you suggest.
Currently we are using 4.3 and moving to 4.5, so we will have to figure
out the differences.
Post by Terry Lambert
You can't really avoid *all* the "avoidable overhead", without
restructuring the VOP_ADVLOCK interface, which is politically
difficult.
I wouldn't want to try. Too much code to change and too much chance of a
massive screw-up.

Thanks for perservering with me.

Regards
-----
Richard Sharpe, ***@ns.aus.com, ***@samba.org,
***@ethereal.com


To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Andrey Alekseyev
2002-05-15 07:10:54 UTC
Permalink
Btw, Terry's implementation of this ported to 4.5-STABLE could be found here:

http://www.blackflag.ru/patches/nfs-client-and-server-locking-4.5-STABLE-20020312.diff

I've been testing it continuously for a month or so with an NFS server
on Solaris. Particularly, that was a combination of Connectathon NFS Testsuite
and several hundred of Perl scripts doing flock on a remote and local files.
So far found no problems with that.

:)

JFYI
Post by Terry Lambert
I've actually wanted the VOP_ADVLOCK to be veto-based for going
on 6 years now, to avoid precisely the type of problems your are
now facing. If the upper layer code did local assertion on vnodes,
and called the lower layer code only in the success cases, then the
implementation would actually be done for you already.
-- Terry
with "unsubscribe freebsd-hackers" in the body of the message
--
Andrey Alekseyev. Zenon N.S.P.
Senior Unix systems administrator

To Unsubscribe: send mail to ***@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Continue reading on narkive:
Loading...