when has a pNFS data server failed?
Ronald Klop
ronald-lists at klop.ws
Tue Aug 22 09:26:43 UTC 2017
On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem <rmacklem at uoguelph.ca>
wrote:
> This is kind of a "big picture" question that I thought I 'd throw out.
>
> As a brief background, I now have the code for running mirrored pNFS
> Data Servers
> working for normal operation. You can look at:
> http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
> if you are interested in details related to the pNFS server code/testing.
>
> So, now I am facing the interesting part:
> 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has
> failed at some
> point. Once that happens, it stops using the DS, etc.
> --> This brings me to the question of "when should the MDS decide that
> the DS has
> failed and should be taken offline?".
> - I'm not up to date w.r.t. the TCP stack, so I'm not sure how
> long it will take for the
> TCP connection to decide that a DS server is no longer working
> and fail the TCP
> connection. I think it takes a fair amount of time, so I'm not
> sure if TCP connection
> loss is a good indicator of DS server failure or not?
> - It seems to me that the MDS should wait a fairly long time before
> failing the DS,
> since this will have a major impact on the pNFS server, requiring
> repair/resilvering
> by a sysadmin once it happens.
> So, any comments or thoughts on this? rick
Hi,
This is a quite common problem for all clustered/connected systems. I
think there is no general answer. And there are a lot of papers written
about it.
For example: in NFS you have the 'soft' option. It is recommended not to
use it. I can imagine that if your home-dir or /usr is mounted over NFS,
but at work I want my http-servers to not hang and just give an IO-error
when the backend fileserver with data is gone.
Something similar happens here.
Doesn't the protocol definition say something about this? Or what do other
implemenations do?
Regards,
Ronald.
More information about the freebsd-fs
mailing list