Re: madvise(MADV_FREE) doesn't work in some cases?
Date: Mon, 05 Jul 2021 18:54:58 UTC
On Mon, Jul 05, 2021 at 07:32:00PM +0300, Vitaliy Gusev wrote: > Hi, > > > Does it mean madvise() doesn't work well in FreeBSD or test does something wrong? > > > > Your program does not exactly what you described above. There is a generic > > race to consume memory, and some specific details about madvise(2) on FreeBSD. > > > > >From the code, you do: > > - mmap anonymous private region > > - fork > > - both child and parent start touching the mmaped region. > > > > Two processes race to consume 1/2 of RAM on your system. If one of > > them happen to execute faster then another, you do get to the case where > > one of them does madvise(). But it could be that processes execute in > > lockstep, and try to eat all the memory before going to madvise(). > > Did you excluded this case? > I believe I did all things right. You can see sleeps that serialise execution. To check again I modified test and added time printing and use MADV_DONTNEED: > > Here is source http://cpp.sh/2rd4f <http://cpp.sh/2rd4f> > > I’ve run: > > $ ./mmapfork 2300 > mmap 0x801000000 pid 40628 > end 0x890c00000 len 0x8fc00000 > pid 40628 > pid 40629 > 40629: [1625500831] touch > 40629: [1625500832] sleep before madvise > 40629: [1625500833] madvise > 40629: [1625500834] Press enter to exit > 40628: [1625500845] touch > 40628: [1625500846] sleep before madvise > 40628: [1625500851] madvise > 40628: [1625500852] Press enter to exit > > And you can see that child started running in 11 seconds after parent had already called madvise() for all scope of touched memory. > > And finally in dmesg: > > pid 40629 (mmapfork), jid 0, uid 1001, was killed: out of swap space > > So the same result as I wrote in the first email. > > > Now, about the specific of madvise(MADV_FREE) on FreeBSD. Due to the way > > CoW is implemented with the shadow chain of objects, we cannot drop the > > top of the shadow chain, otherwise instead of returning zeroed pages next > > time, we would return content back in the time. It was relatively recent > > discovery, see bf5661f4a1af6931ec4b6, PR 240061. > > > Thanks, I will look at it. > > To explain it in simplified form, when there is potential old content > > under the CoW copy for the mapping, we cannot drop CoW-ed pages. This > > is the motivation why madvise(MADV_FREE) does nothing for your program. > > When you run two instances without fork, there is no previous content > > and no Cow, so madvise() can safely remove the pages from the object, > > and on the next access they are zero-filled. > > Do I understand right, that it should work with MADV_DONTNEED? But “dontneed" variant doesn’t work. DONTNEED does not allow system to free pages at all. It means that pages are less useful and can be paged out with higher priority. > > > > You can read more details in the referenced commit, as well as some musings > > about way to make it somewhat better. > > > > I must say, that trying to allocated 1/2 + 1/2 of RAM this way, on a system > > without swap, is the way to ask for troubles anyway. > I’ve just notify that other operation systems work well with that, whereas FreeBSD has troubles. Probably something in madvise() is not finished ? Well, yes, as I said, non-trivial shadow chains for MADV_FREE are not handled due to the 'old content revival' bug. For your specific case, the following patch might help (modulo bugs). But it is very specific for your example, for instance it would not work if you try to mark not the whole mapped area as _FREE, but only some significant part of it. We would need to start fragmenting map to handle such partial madvises better. commit 0392eb3c93b7dacc31dbdf8ec2fc40fa5ba67c62 Author: Konstantin Belousov <kib@FreeBSD.org> Date: Mon Jul 5 21:53:22 2021 +0300 madvise(MADV_FREE): try harder to handle shadow chain In particular, collapse top object and see if there is no backing object after, which means that we would not revert to older content if drop the top object. diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c index 1ac4ccf72f11..80abac223f29 100644 --- a/sys/vm/vm_map.c +++ b/sys/vm/vm_map.c @@ -3033,6 +3033,7 @@ vm_map_madvise( entry = vm_map_entry_succ(entry); for (; entry->start < end; entry = vm_map_entry_succ(entry)) { + vm_object_t obj; vm_offset_t useEnd, useStart; if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) != 0) @@ -3046,9 +3047,16 @@ vm_map_madvise( * backing object can change. */ if (behav == MADV_FREE && - entry->object.vm_object != NULL && - entry->object.vm_object->backing_object != NULL) - continue; + (obj = entry->object.vm_object) != NULL && + obj->backing_object != NULL) { + VM_OBJECT_WLOCK(obj); + if ((obj->flags & OBJ_DEAD) != 0) + continue; + vm_object_collapse(obj); + VM_OBJECT_WUNLOCK(obj); + if (obj->backing_object != NULL) + continue; + } pstart = OFF_TO_IDX(entry->offset); pend = pstart + atop(entry->end - entry->start);