zfs deadlock on r360452 relating to busy vm page

Thu May 14 15:26:47 UTC 2020

On Thu, May 14, 2020 at 02:16:15PM +0300, Andriy Gapon wrote:
> On 13/05/2020 17:42, Mark Johnston wrote:
> > On Wed, May 13, 2020 at 10:45:24AM +0300, Andriy Gapon wrote:
> >> On 13/05/2020 10:35, Andriy Gapon wrote:
> >>> In r329363 I re-worked zfs_getpages and introduced range locking to it.
> >>> At the time I believed that it was safe and maybe it was, please see the commit
> >>> message.
> >>> There, indeed, have been many performance / concurrency improvements to the VM
> >>> system and r358443 is one of them.
> >>
> >> Thinking more about it, it could be r352176.
> >> I think that vm_page_grab_valid (and later vm_page_grab_valid_unlocked) are not
> >> equivalent to the code that they replaced.
> >> The original code would check valid field before any locking and it would
> >> attempt any locking / busing if a page is invalid.  The object was required to
> >> be locked though.
> >> The new code tries to busy the page in any case.
> >>
> >>> I am not sure how to resolve the problem best.  Maybe someone who knows the
> >>> latest VM code better than me can comment on my assumptions stated in the commit
> >>> message.
> > 
> > The general trend has been to use the page busy lock as the single point
> > of synchronization for per-page state.  As you noted, updates to the
> > valid bits were previously interlocked by the object lock, but this is
> > coarse-grained and hurts concurrency.  I think you are right that the
> > range locking in getpages was ok before the recent change, but it seems
> > preferable to try and address this in ZFS.
> > 
> >>> In illumos (and, I think, in OpenZFS/ZoL) they don't have the range locking in
> >>> this corner of the code because of a similar deadlock a long time ago.
> > 
> > Do they just not implement readahead?
> 
> I think so, but not 100% sure.
> I recall seeing a comment in illumos code that they do not care about read-ahead
> because there is ZFS prefetch and the data will be cached in ARC.  That makes
> sense from the I/O point of view, but it does not help with page faults.
> 
> > Can you explain exactly what the
> > range lock accomplishes here - is it entirely to ensure that znode block
> > size remains stable?
> 
> As far as I can recall, this is the reason indeed.

It seems to me that zfs_getpages() could use a non-blocking
rangelock_enter() operation to avoid the deadlock.  The ZFS rangelock
implementation doesn't have one, but it is easy to add.  I'm not able to
trigger the deadlock with this patch: https://reviews.freebsd.org/D24839