Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad

From: Kyle Evans <kevans_at_FreeBSD.org>
Date: Fri, 01 Sep 2023 15:46:00 UTC
On 9/1/23 08:41, Alexander Motin wrote:
> On 31.08.2023 22:18, Kyle Evans wrote:
>> It seems to have clearly been stomped on by uma trashing. Encountered 
>> while running a pkgbase build, I think while it was in the packaging 
>> phase. I note in particular in that frame:
>>
>> (kgdb) p/x lwb->lwb_issued_timestamp
>> $4 = 0xdeadc0dedeadc0de
>>
>> So I guess it was freed sometime during one of the previous two 
>> zio_nowait() calls.
> 
> Thank you, Kyle.  If the source lines are resolved correctly and it 
> really crashes on lwb_child_zio access, then I do see there a possible 
> race condition, even though I think it would involve at least 2 or may 
> be even 3 different threads.
> 

Oh, sorry- yes, it was the access to lwb_child_zio there.


> I've just created this new PR to address it:
> https://github.com/openzfs/zfs/pull/15233
> 
> If you'll be able to test it, include also the two previous:
> https://github.com/openzfs/zfs/pull/15227
> https://github.com/openzfs/zfs/pull/15228
> 
> Thank you for something actionable, it really feels much better! :)
> 

Perfect, thanks! I haven't been able to reproduce it since the first 
time, but your explanation sounds plausible to me.

I'm not a ZFS developer, but it's not clear to me how I didn't end up 
tripping over other assertions, though; e.g., in zil_lwb_flush_vdevs_done:

1442         ASSERT3S(lwb->lwb_state, ==, LWB_STATE_WRITE_DONE); 

1443         lwb->lwb_state = LWB_STATE_FLUSH_DONE; 


lwb_state seems to only be set to LWB_STATE_WRITE_DONE in 
zil_lwb_write_done (lwb_write_zio's completion routine). I would've 
thought all three of these were executed synchronously in 
__zio_execute(), which would presumably put us in LWB_STATE_ISSUED at 
the time of completing the lwb_root_zio?

Thanks,

Kyle Evans