Stale memory during post fork cow pmap update
Elliott.Rabe at dell.com
Elliott.Rabe at dell.com
Sat Feb 10 05:13:24 UTC 2018
Greetings-
I've been hunting for the root cause of elusive, slight memory
corruptions in a large, complex process that manages many threads. All
failures and experimentation thus far has been on x86_64 architecture
machines, and pmap_pcid is not in use.
I believe I have stumbled into a very unlikely race condition in the way
the vm code updates the pmap during write fault processing following a
fork of the process. In this situation, when the process is forked,
appropriate vm entries are marked copy-on-write. One such entry
allocated by static process initialization is frequently used by many
threads in the process. This makes it a prime candidate to write-fault
shortly after a fork system call is made. In this scenario, such a
fault normally burdens the faulting thread with the task of allocating a
new page, entering the page as part of managed memory, and updating the
pmap with the new physical address and the change to writeable status.
This action is followed with an invalidation of the TLB on the current
CPU, and in this case is also followed by IPI_INVLPG IPIs to do the same
on other CPUs (there are often many active threads in this process).
Before this remote TLB invalidation has completed, other CPUs are free
to act on either the old OR new page characteristics. If other threads
are alive and using contents of the faulting page on other CPUs, bad
things can occur.
In one simplified and somewhat contrived example, one thread attempts to
write to a location on the faulting page under the protection of a lock
while another thread attempts to read from the same location twice in
succession under the protection of the same lock. If both the writing
thread and reading thread are running on different CPUs, and if the
write is directed to the new physical address, the reads may come from
different physical addresses if a TLB invalidation occurs between them.
This seemingly violates the guarantees provided by the locking
primitives and can result in subtle memory corruption symptoms.
It took me quite a while to chase these symptoms from user-space down
into the operating system, and even longer to end up with a stand-alone
test fixture able to reproduce the situation described above on demand.
If I alter the kernel code to perform a two-stage update of the pmap
entry, the observed corruption symptoms disappear. This two-stage
mechanism updates and invalidates the new physical address in a
read-only state first, and then does a second pmap update and
invalidation to change the status to writeable. The intended effect was
to cause any other threads writing to the faulting page to become
obstructed until the earlier fault is complete, thus eliminating the
possibility of the physical pages having different contents until the
new physical address was fully visible. This is goofy, and from an
efficiency standpoint it is obviously undesirable, but it was the first
thing that came to mind, and it seems to be working fine.
I am not terribly familliar with the higher level design here, so it is
unclear to me if this problem is simply a very unlikely race condition
that hasn't yet been diagnosed or if this is instead the breakdown of
some other mechanism of which I am not aware. I would appreciate the
insights of those of you who have more history and experience with this
area of the code.
Thank you for your time!
Elliott Rabe
elliott_rabe at dell.com
More information about the freebsd-hackers
mailing list