Stale memory during post fork cow pmap update

Sat Feb 10 21:57:44 UTC 2018

On 02/10/2018 05:18 AM, Konstantin Belousov wrote:
> On Sat, Feb 10, 2018 at 05:12:11AM +0000, Elliott.Rabe at dell.com wrote:
>> Greetings-
>>
>> I've been hunting for the root cause of elusive, slight memory
>> corruptions in a large, complex process that manages many threads. All
>> failures and experimentation thus far has been on x86_64 architecture
>> machines, and pmap_pcid is not in use.
>>
>> I believe I have stumbled into a very unlikely race condition in the way
>> the vm code updates the pmap during write fault processing following a
>> fork of the process.  In this situation, when the process is forked,
>> appropriate vm entries are marked copy-on-write. One such entry
>> allocated by static process initialization is frequently used by many
>> threads in the process.  This makes it a prime candidate to write-fault
>> shortly after a fork system call is made.  In this scenario, such a
>> fault normally burdens the faulting thread with the task of allocating a
>> new page, entering the page as part of managed memory, and updating the
>> pmap with the new physical address and the change to writeable status.
>> This action is followed with an invalidation of the TLB on the current
>> CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
>> on other CPUs (there are often many active threads in this process).
>> Before this remote TLB invalidation has completed, other CPUs are free
>> to act on either the old OR new page characteristics.  If other threads
>> are alive and using contents of the faulting page on other CPUs, bad
>> things can occur.
>>
>> In one simplified and somewhat contrived example, one thread attempts to
>> write to a location on the faulting page under the protection of a lock
>> while another thread attempts to read from the same location twice in
>> succession under the protection of the same lock.  If both the writing
>> thread and reading thread are running on different CPUs, and if the
>> write is directed to the new physical address, the reads may come from
>> different physical addresses if a TLB invalidation occurs between them.
>> This seemingly violates the guarantees provided by the locking
>> primitives and can result in subtle memory corruption symptoms.
>>
>> It took me quite a while to chase these symptoms from user-space down
>> into the operating system, and even longer to end up with a stand-alone
>> test fixture able to reproduce the situation described above on demand.
>> If I alter the kernel code to perform a two-stage update of the pmap
>> entry, the observed corruption symptoms disappear.  This two-stage
>> mechanism updates and invalidates the new physical address in a
>> read-only state first, and then does a second pmap update and
>> invalidation to change the status to writeable.  The intended effect was
>> to cause any other threads writing to the faulting page to become
>> obstructed until the earlier fault is complete, thus eliminating the
>> possibility of the physical pages having different contents until the
>> new physical address was fully visible.  This is goofy, and from an
>> efficiency standpoint it is obviously undesirable, but it was the first
>> thing that came to mind, and it seems to be working fine.
>>
>> I am not terribly familliar with the higher level design here, so it is
>> unclear to me if this problem is simply a very unlikely race condition
>> that hasn't yet been diagnosed or if this is instead the breakdown of
>> some other mechanism of which I am not aware.  I would appreciate the
>> insights of those of you who have more history and experience with this
>> area of the code.
> You are right describing the operation of the memory copy on fork. But I
> cannot understand what parts of it, exactly, are problematic, from your
> description. It is necessary for you to provide the test and provide
> some kind of the test trace or the output which illustrates the issue
> you found.
>
> Do you mean something like that:
> - after fork
> - thread A writes into the page, causing page fault and COW because the
>    entry has write permissions removed
> - thread B reads from the page, and since invalidation IPI was not yet
>    delivered, it reads from the need-copy page, effectively seeing the
>    old content, before thread A write
>
> And you claim is that you can create a situation where both threads A
> and B owns the same lock around the write and read ?  I do not understand
> this, since if thread A owns a (usermode) lock, it prevents thread B from
> taking the same lock until fault is fully handled, in particular, the IPI
> is delivered.

Here is the sequence of actions I am referring to.  There is only one 
lock, and all the writes/reads are on one logical page.

+The process is forked transitioning a map entry to COW
+Thread A writes to a page on the map entry, faults, updates the pmap to 
writable at a new phys addr, and starts TLB invalidations...
+Thread B acquires a lock, writes to a location on the new phys addr, 
and releases the lock
+Thread C acquires the lock, reads from the location on the old phys addr...
+Thread A ...continues the TLB invalidations which are completed
+Thread C ...reads from the location on the new phys addr, and releases 
the lock

In this example Thread B and C [lock, use and unlock] properly and 
neither own the lock at the same time.  Thread A was writing somewhere 
else on the page and so never had/needed the lock.  Thread B sees a 
location that is only ever read|modified under a lock change beneath it 
while it is the lock owner.

I will get a test patch together and make it available as soon as I can.