Stale memory during post fork cow pmap update

Sat Feb 10 05:13:24 UTC 2018

Greetings-

I've been hunting for the root cause of elusive, slight memory 
corruptions in a large, complex process that manages many threads. All 
failures and experimentation thus far has been on x86_64 architecture 
machines, and pmap_pcid is not in use.

I believe I have stumbled into a very unlikely race condition in the way 
the vm code updates the pmap during write fault processing following a 
fork of the process.  In this situation, when the process is forked, 
appropriate vm entries are marked copy-on-write. One such entry 
allocated by static process initialization is frequently used by many 
threads in the process.  This makes it a prime candidate to write-fault 
shortly after a fork system call is made.  In this scenario, such a 
fault normally burdens the faulting thread with the task of allocating a 
new page, entering the page as part of managed memory, and updating the 
pmap with the new physical address and the change to writeable status.  
This action is followed with an invalidation of the TLB on the current 
CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same 
on other CPUs (there are often many active threads in this process).  
Before this remote TLB invalidation has completed, other CPUs are free 
to act on either the old OR new page characteristics.  If other threads 
are alive and using contents of the faulting page on other CPUs, bad 
things can occur.

In one simplified and somewhat contrived example, one thread attempts to 
write to a location on the faulting page under the protection of a lock 
while another thread attempts to read from the same location twice in 
succession under the protection of the same lock.  If both the writing 
thread and reading thread are running on different CPUs, and if the 
write is directed to the new physical address, the reads may come from 
different physical addresses if a TLB invalidation occurs between them.  
This seemingly violates the guarantees provided by the locking 
primitives and can result in subtle memory corruption symptoms.

It took me quite a while to chase these symptoms from user-space down 
into the operating system, and even longer to end up with a stand-alone 
test fixture able to reproduce the situation described above on demand.  
If I alter the kernel code to perform a two-stage update of the pmap 
entry, the observed corruption symptoms disappear.  This two-stage 
mechanism updates and invalidates the new physical address in a 
read-only state first, and then does a second pmap update and 
invalidation to change the status to writeable.  The intended effect was 
to cause any other threads writing to the faulting page to become 
obstructed until the earlier fault is complete, thus eliminating the 
possibility of the physical pages having different contents until the 
new physical address was fully visible.  This is goofy, and from an 
efficiency standpoint it is obviously undesirable, but it was the first 
thing that came to mind, and it seems to be working fine.

I am not terribly familliar with the higher level design here, so it is 
unclear to me if this problem is simply a very unlikely race condition 
that hasn't yet been diagnosed or if this is instead the breakdown of 
some other mechanism of which I am not aware.  I would appreciate the 
insights of those of you who have more history and experience with this 
area of the code.

Thank you for your time!

Elliott Rabe
elliott_rabe at dell.com