High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release
Niels Kobschätzki
niels at kobschaetzki.net
Sat Apr 14 05:14:29 UTC 2018
On 04/14/2018 03:49 AM, Rick Macklem wrote:
> Niels Kobschätzki wrote:
>> sorry for the cross-posting but so far I had no real luck on the forum
>> or on question, thus I want to try my luck here as well.
> I read email lists but don't do the other stuff, so I just saw this yesterday.
> Short answer, I haven't a clue why cache hits rate would have changed.
>
> The code that decides if there is a hit/miss for the attribute cache is in
> ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
> except the old code did a mtx_lock(&Giant), but I can't imagine how that
> would affect the code.
>
> You might want to:
> # sysctl -a | fgrep vfs.nfs
> for both the 10.3 and 11.1 systems, to check if any defaults have somehow
> been changed. (I don't recall any being changed, but??)
I did that and there did nothing change.
> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c}
> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
> top, where it calculates "timeo" from it.
> Running this hacked kernel might show you if either of these fields is bogus.
> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if"
> clause that increments "attrcache_misses", which is where the cache misses
> happen to see why it is missing the cache.)
> If you could do this for the 10.3 kernel as well, this might indicate why the
> miss rate has increased?
I will do this next week. On monday we switch for other reasons to other
nfs-servers and when we see that they run stable, I will do this next.
Btw. I calculated now the percentages. The old servers had a attr miss
rate of something like 0.004%, while the upgraded one has more like
2.7%. This is till low from what I've read (I remember that you should
start adjusting acreg* when you hit more than 40% misses) but far higher
than before.
nfsstat -c for one of the working servers looks like this (I did a -cz
before to reset it and did this a couple of seconds later):
Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits
Misses
10085375 255 9163995 577 540 0 0
0
BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits
Misses
1380 0 0 0 0 0 9169427
277
and for the non-working server:
Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits
Misses
1606365 20647 1418205 239 581 0 0
0
BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits
Misses
895 0 0 0 0 0 1439080
337
>> I upgraded a machine from 10.3-Prerelease (custom kernel with
>> tcp_fastopen added) to 11.1-Release (standard kernel) with
>> freebsd-update. I have two other machines that are still on
>> 10.3-Prerelease. Those machines mount an NFS-export from a
>> Linux-NFS-server and use NFSv3. The machine that got upgraded shows now
>> far more cache misses for getattr than on the 10.3-machines (we talk a
>> factor of 100) in munin. munin also shows a lot more cache-misses for
>> other metrics like biow, biorl, biod (where can I find what those
>> metrics mean…currently I have not even an understanding what these are)
>> etc.
>>
>> Can anybody help me how I can debug this problem or has an idea what
>> could cause the problem? The result of this behavior is that this
>> machine shows a lower performance than the others and I cannot upgrade
>> other machines before I didn't fix this bug.
> I haven't run a 10.x system in quite a while. When I get home in a few days,
> I might be able to reproduce this. If I can. I can poke at it, but it would be at
> least a week before I might have an answer and I may not figure it out for a
> long time.
Ok, thanks a lot. That would be great.
Niels
More information about the freebsd-net
mailing list