NFS locking: lockf freezes (rpc.lockd problem?)

Mon Aug 28 09:48:54 UTC 2006

Well, the result I have to report is so interesting, I'll give the 
executive summary right away:

 	If rpc.lockd is run with -d255 it works perfectly
 	If rpc.lockd is run with -d1 (or no -d option) it locks up

Sweet.

Does anyone out there who understands rpc.lockd fancy a deeper look?

On Sun, 27 Aug 2006, Greg Byshenk wrote:
> The problem here is that the process is waiting for somthing, and
> thus not listening to signals (including your 'kill').
>
> I'm not an expert on this, but my first guess would be that saturn (your
> server) is offering something that it can't deliver.  That is, the client
> asks the server "can you do X?", and the server says "yes I can", so the
> client says "do X" and waits -- and the server never does it.

I buy that analysis.

However, I think that the client's (venus) behaviour is unreasonable, and 
there must be some problem at the driver level: unkillable processes? 
(Tries to bang drum about this!)

Interesting: it looks like `nfslocking stop` releases the processes.

> Or alternatively (based on your rpc.statd dying), rpc.lockd on your
> client is trying to use rpc.statd to communicate with your server.  And
> it starts successfully, but then rpc.statd dies (for some reason) and
> your lock ends up waiting forever for it to answer.

Not quite: it was the first instantiation of rpc.lockd that went away, 
almost as if it was just waiting for something to happen!  However, it 
doesn't do this on the simple one-line test, so yes, I think there's 
something here to investigate.

Definitely: after running the test with no failures (see below), both 
lockd instances are still up and about.

> I would recommend starting both rpc.lockd and rpc.statd with the '-d'
> flag, to see if this provides any information as to what is going on.
> There may well be a bug somewhere, but you need to find where it is.
> I suspect that it is not actually in rpc.statd, as nothing in the
> source has changed since January 2005.

Ok, I'll try that.  I'll try -d1, see what I get.

venus# /etc/rc.d/nfslocking stop
venus# /etc/rc.d/nfslocking start
Oddly, running restart only restarts statd, not lockd.  Strange naming 
convention for the commands, too -- have to put the following in rc.conf:
 	rpc_lockd_enable=YES
 	lockd_flags=-d1
 	rpc_statd_enable=YES
 	statd_flags=-d
Hmm.  Not terribly consistent.

Ok, let's see what a single lockf exchange looks like:

venus# mount saturn:$EXPORT /mnt
venus# lockf /mnt/test ls /mnt
test
venus# tail -n3 /var/log/debug.log
Aug 28 08:52:44 venus rpc.statd: unmon_all for host: NFS NLM prog: 0 ver: 0 proc: 0
Aug 28 08:54:19 venus rpc.lockd: nlm_lock_res from saturn.araneidae.co.uk
Aug 28 08:54:19 venus rpc.lockd: nlm_unlock_res from saturn.araneidae.co.uk

Good.  Now let's run the test:
venus# cd /usr/src; make installworld DESTDIR=/mnt
and, at the same time, in another shell:
venus# tail -f /var/log/debug.log

Well, that's odd.  I get five nlm_lock_res/nlm_unlock_res pairs, with the 
last three in less than a second... and then nothing: the last, blocking, 
lockf doesn't generate any messages at all!

Interesting: stopping the lock daemon, by running `/etc/rc.d/nfslocking 
stop`, releases the lock!  Good.  Now I can run the test again with more 
logging (I'll set lockd_flags=-d255, though a quick grep of the source 
suggests that 6 would suffice).

Hmmm.  It's working perfectly now!  Well well well.  What are the 
differences?
  1.  I'm rerunning the test after restarting the lock and stat daemons 
without an intervening reboot.
  2.  I'm running lockd with maximum debugging.
  3.  I'm running the test in a different virtual console (I think we can 
ignore that difference!)

Fantastic: it ran to completion without a fault!  Wow.  I'll reboot (keep 
the same debug level) and try again...

Astounding!  Watch carefully:

1. Reboot
2. Login on virtual consoles 1 & 2
3. On console 2 run
venus# tail -f /var/log/debug.log
4. On console 1 run
venus# mount saturn:$EXPORT /mnt
venus# rm -rf /mnt/*
venus# cd /usr/src; make installworld DESTDIR=/mnt
5. Switch to console 2 and watch the console and the PC's activity lights. 
Lots and lots of network activity (well, there's a surprise), and the 
occasional flurry of rcp.lockd messages.
6. When the lights stop flashing (five minutes or so) switch back to 
console 1 and see that everything ran perfectly.

Runs ok.

Well.  I can think of two possiblities:
1. High levels of logging do more than just logging: there's an 
inadvertent side effect.
2. There's a tight timing issue that is changed by the extra delays 
introduced by logging.

On the whole I buy 2, and it's not altogether clear whether the issue is 
on the client or the server.  Hmm.  Maybe I need to try Kostik Belousov's 
suggestion of running tcpdump.  Another message for that...

> An alternative would be to update to RELENG_6 (or at least RELENG_6_1)
> and then try again.

This machine is so painfully slow that I'll probably have to do that 
overnight, and then I'm out of time until next weekend.  Just installed 
cvsup (from ports; oops, my mistake, forgot about having to build ezm3!)