The arm64 fork-then-swap-out-then-swap-in failures: a program source for exploring them

[I've identified the code path involved is the arm64 small allocations
turning into zeros for later fork-then-swapout-then-back-in,
specifically the ongoing RES(ident memory) size decrease that
"top -PCwaopid" shows before the fork/swap sequence. Hopefully
I've also exposed enough related information for someone that
knows what they are doing to get started with a specific
investigation, looking for a fix. I'd like for a pine64+
2GB to have buildworld complete despite the forking and
swapping involved (yep: for a time zero RES(ident memory) for
some processes involved in the build).]

> [I now can: (A) crudely control the number of allocated
> pages that get zeros (that should not). (B) Watch a
> "top -PCwaopid" display and predict if the
> test-architecture will fail or not before the fork()
> or swap-out happens.]
>> Uncommenting/commenting parts of the below program allows
>> exploring the problems with fork-then-swap-out-then-in on
>> arm64.
>> Note: By swap-out I mean that zero RES(ident memory) results,
>>     for the process(s) of interest, as shown by
>>     "top -PCwaopid" .
>> I discovered recently that swapping-out just before the
>> fork() prevents the failure from the swapping after the
>> fork().
>> Note:
>> Without the fork() no problem happens. Without the later
>> swap-out no problem happens. Both are required. But some
>> activities before the fork() or between fork() and the
>> swap-out prevent the failures.
>> Some of the comments are based on a pine64+ 2GB context.
>> I use stress to force swap-outs during some sleeps in
>> the program. See also Buzilla 217239 and 217138. (I now
>> expect that they have the same cause.)
>> In my environment I've seen the fork-then-swap-out/swap-in
>> failures on a pine64+ 2GB and a rpi3. They are repeatable
>> on both. I do not have access to server-class machines, or
>> any other arm64 machines.
>> // swap_testing5.c
>> // Built via (cc was clang 4.0 in my case):
>> //
>> // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c
>> // -O0 and -O2 also gets the problem.
>> // Note: jemalloc's tcache needs to be enabled to get the failure.
>> //       But FreeBSD can get into a state were /etc/malloc.conf
>> //       -> 'tcache:false' is ineffective. Also: the allocation
>> //       size needs to by sufficiently small (<= SMALL_MAXCLASS)
>> //       to see the problem. Other comments are based on a specific
>> //       context (pine64+ 2GB).
>> #include <signal.h>     // for raise(.), SIGABRT (induce core dump)
>> #include <unistd.h>     // for fork(), sleep(.)
>> #include <sys/types.h>  // for pid_t
>> #include <sys/wait.h>   // for wait(.)
>> extern void test_setup(void);         // Sets up the memory byte patterns.
>> extern void test_check(void);         // Tests the memory byte patterns.
>> extern void memory_willneed(void); // For seeing if
>>                                  // posix_madvise(.,.,POSIX_MADV_WILLNEED)
>>                                  // makes a difference.
>> int main(void) {
>>   sleep(30); // Potentialy force swap-out here.
>>              // [Swap-out here does not avoid later failures.]
>>   test_setup();
>>   test_check(); // Before potential sleep(.)/swap-out or fork(.) [passes]
>>   sleep(30); // Potentialy force swap-out here.
>>              // [Everything below passes if swapped-out here,
>>              //  no matter if there are later swap-outs
>>              //  or not.]
>>   pid_t pid = fork(); // To test no-fork use: = 0; no-fork does not fail.
>>   int wait_status = 0;
>>   // HERE: After fork; before sleep/swap-out/wait.
>>   // if (0 <  pid) memory_willneed(); // Does not prevent either parent or
>>                                    // child failure if enabled.
>>   // if (0 == pid) memory_willneed(); // Prevents both the parent and the
>>                                    // child failure. Disable to see
>>                                    // failure of both parent and child.
>>                                    // [Presuming no prior swap-out: that
>>                                    // would make everything pass.]
>>   // During sleep/wait: manually force this process to
>>   // swap out. I use something like:
>>   //     stress -m 1 --vm-bytes 1800M
>>   // in another shell and ^C'ing it after top shows the
>>   // swapped status desired. 1800M just happened to work
>>   // on the Pine64+ 2GB that I was using. I watch with
>>   // top -PCwaopid [checking for zero RES(ident memory)].
>>   if (0 < pid) {
>>       sleep(30);    // Intend to swap-out during sleep.
>>       // test_check(); // Test in parent before child runs (longer sleep).
>>                     // This test fails if run for a failing region_size
>>                     // unless earlier preventing-activity happened.
>>       wait(&wait_status); // Only if test_check above passes or is
>>                           // disabled above.
>>   }
>>   if (-1 != wait_status && 0 <= pid) {
>>       if (0 == pid) { sleep(90); } // Intend to swap-out during sleep.
>>       test_check(); // Fails for small-enough region_size, both
>>                     // parent and child processes, unless earlier
>>                     // preventing-activty happened.
>>   }
>> }
>> // The memory and test code follows.
>> #include <stddef.h>     // for size_t, NULL
>> #include <stdlib.h>     // for malloc(.), free(.)
>> #include <sys/mman.h>   // for POSIX_MADV_WILLNEED, posix_madvise(.,.,.)
>> #define region_size (14u*1024u)
>>       // Bad dyn_region pattern, parent and child processes examples:
>>       // 256u, 2u*1024u, 4u*1024u, 8u*1024u, 9u*1024u, 12u*1024u, 14u*1024u
>>       // No failure examples:
>>       // 14u*1024u+1u, 15u*1024u, 16u*1024u, 32u*1024u, 256u*1024u*1024u
>> #define num_regions (256u*1024u*1024u/region_size)
>> typedef volatile unsigned char value_type;
>> struct region_struct { value_type array[region_size]; };
>> typedef struct region_struct region;
>> static region * volatile dyn_regions[num_regions] = {NULL,};
>> static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); }
>>                 // value avoids zero values: the bad values are zeros.
>> void test_setup(void) {
>>   for(size_t i=0u; i<num_regions; i++) {
>>       dyn_regions[i] = malloc(sizeof(region));
>>       if (!dyn_regions[i]) raise(SIGABRT);
>>       for(size_t j=0u; j<region_size; j++) {
>>           (*dyn_regions[i]).array[j] = value(j);
>>       }
>>   }
>> }
>> void memory_willneed(void) {
>>   for(size_t i=0u; i<num_regions; i++) {
>>       (void) posix_madvise(dyn_regions[i], region_size, POSIX_MADV_WILLNEED);
>>   }
>> }
>> static volatile size_t first_failure_idx = 0u; // dyn_regions index
>> static volatile size_t first_failure_pos = 0u; //   sub-array index
>> static volatile size_t after_bad_idx     = 0u; // dyn_regions index
>> static volatile size_t after_bad_pos     = 0u; //   sub-array index
>> static volatile size_t after_good_idx    = 0u; // dyn_regions index
>> static volatile size_t after_good_pos    = 0u; //   sub-array index
>> // Note: Some failing cases get (conjunctive notation):
>> //
>> //    0 == first_failure_idx < after_bad_idx < after_good_idx == num_regions
>> // && 0 == first_failure_pos && 0<=after_bad_pos<=region_size && after_good_idx==0
>> // && (after_bad_pos is a multiple of the page size in Bytes, here:
>> //     after_bad_pos==N*4096 for some non-negative integral value N)
>> //
>> // other failing cases instead fail with:
>> //
>> //    0 == first_failure && num_regions == after_bad_idx == after_good_idx
>> // && 0 == first_failure_pos == after_bad_pos == after_good_idx
>> //
>> // after_bad_idx strongly tends to vary from failing run to failing run
>> // as does after_bad_pos.
>> // Note: The working cases get:
>> //
>> //    num_regions == first_failure == after_bad_idx == after_good_idx
>> // && 0 == first_failure_pos == after_bad_pos == after_good_idx
>> void test_check(void) {
>>   first_failure_idx = first_failure_pos = 0u;
>>   while (first_failure_idx < num_regions) {
>>       while (  first_failure_pos < region_size
>>             && (  value(first_failure_pos)
>>                == (*dyn_regions[first_failure_idx]).array[first_failure_pos]
>>                )
>>             ) {
>>           first_failure_pos++;
>>       }
>>       if (region_size != first_failure_pos) break;
>>       first_failure_idx++;
>>       first_failure_pos = 0u;
>>   }
>>   after_bad_idx = first_failure_idx;
>>   after_bad_pos = first_failure_pos;
>>   while (after_bad_idx < num_regions) {
>>       while (  after_bad_pos < region_size
>>             && (  value(after_bad_pos)
>>                != (*dyn_regions[after_bad_idx]).array[after_bad_pos]
>>                )
>>             ) {
>>           after_bad_pos++;
>>       }
>>       if(region_size != after_bad_pos) break;
>>       after_bad_idx++;
>>       after_bad_pos = 0u;
>>   }
>>   after_good_idx = after_bad_idx;
>>   after_good_pos = after_bad_pos;
>>   while (after_good_idx < num_regions) {
>>       while (  after_good_pos < region_size
>>             && (  value(after_good_pos)
>>                == (*dyn_regions[after_good_idx]).array[after_good_pos]
>>                )
>>             ) {
>>           after_good_pos++;
>>       }
>>       if(region_size != after_good_pos) break;
>>       after_good_idx++;
>>       after_good_pos = 0u;
>>   }
>>   if (num_regions != first_failure_idx) raise(SIGABRT);
>> }
> I've found that for the above swap_testing5.c
> I can make variations that change how much of the
> allocated region prefix ends up zero vs. stays good.
> I vary the sleep time between testing the initialized
> allocations and doing the fork. The longer the sleep
> the more zero pages show up (be sure to read the
> comments):
> # diff swap_testing[56].c                                                                                                                                                                               1c1
> < // swap_testing5.c
> ---
>> // swap_testing6.c
> 5c5
> < // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c
> ---
>> // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing6.c
> 33c33
> <     sleep(30); // Potentialy force swap-out here.
> ---
>>    sleep(150); // Potentialy force swap-out here.
> 37a38,48
>>               // For no-swap-out here cases:
>>               //
>>               // The longer the sleep here the more allocations
>>               // that end up as zero.
>>               //
>>               // top's Mem Active, Inact, Wired, Bug, Free and
>>               // Swap Total, Used, and Free stay unchanged.
>>               // What does change is the process RES decreases
>>               // while the process SIZE and SWAP stay unchanged
>>               // during this sleep.
> NOTE: On other architectures that I've tried (such as armv6/v7)
>      RES does not decrease during the sleep --and the problem
>      does not happen even for as long of sleeps as I've tried.
>      (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead
>      of -m 1 --vm-bytes 1800M because that large in one
>      process is not allowed.)
> So watching top's RES during the sleep (longer than a few
> seconds) just before the fork() predicts the later
> fails-vs.-not status: If RES decreases (while other things
> associated with the process status stay the same) then
> there will be a failure.
> At this point I've no clue why the sleeping process has
> a decreasing RES(ident memory) size.
> I infer that without the sleep there still is a small
> amount of loss of RES but on too short of a timescale
> to observe in a "top -PCwaopid" or other such: in other
> words that the same behavior is causing the failure then
> as well, possibly for a loss of only one page of RES.

I've been able to identify what code sequence
is gradually removing the "small_mappings" via
some breakpointing in the kernel after reaching
the "should be just sleeping" status. Specifically
I started with breakpointing when
pmap_resident_count_dec was on the call stack
in order to see the call chain(s) that lead to
it being called while RES(ident memory) is
gradually decreasing during the sleep that
is just before forking.

(tid 100067 is [pagedaemon{pagedaemon}], which
is in vm_pageout_worker. bt does not show inlined

[ thread pid 17 tid 100067 ]
Breakpoint at   $x.1:   undefined       d65f03c0
db> bt
Tracing pid 17 tid 100067 td 0xfffffd0001c4aa00
. . .
handle_el1h_sync() at pmap_remove_l3+0xdc
        pc = 0xffff000000604870  lr = 0xffff000000611158
        sp = 0xffff000083a49980  fp = 0xffff000083a49a40

pmap_remove_l3() at pmap_ts_referenced+0x580
        pc = 0xffff000000611158  lr = 0xffff000000615c50
        sp = 0xffff000083a49a50  fp = 0xffff000083a49ac0

pmap_ts_referenced() at vm_pageout+0xe60
        pc = 0xffff000000615c50  lr = 0xffff0000005d1f74
        sp = 0xffff000083a49ad0  fp = 0xffff000083a49b50

vm_pageout() at fork_exit+0x94
        pc = 0xffff0000005d1f74  lr = 0xffff0000002e01c0
        sp = 0xffff000083a49b60  fp = 0xffff000083a49b90

fork_exit() at fork_trampoline+0x10
        pc = 0xffff0000002e01c0  lr = 0xffff0000006177b4
        sp = 0xffff000083a49ba0  fp = 0x0000000000000000

It turns out that pmap_ts_referenced is on its:

. . .

path for the above so the pmap_remove_l3 call is
the one from that execution path. (Found by more
breakpointing after enabling such on the paths.)

So this is the path with:
(breakpoint hook not shown)

                                 * Wired pages cannot be paged out so
                                 * doing accessed bit emulation for
                                 * them is wasted effort. We do the
                                 * hard work for unwired pages only.
                                pmap_remove_l3(pmap, pte, pv->pv_va, tpde,
                                    &free, &lock);
                                pmap_invalidate_page(pmap, pv->pv_va);
                                if (pvf == pv)
                                        pvf = NULL;
                                pv = NULL;
                                . . .

pmap_remove_l3 decrements the resident_count in
this sequence.

From what I can tell this code is eliminating the
content of pages that in the failing tests, ones
with  no backing store yet (not swapped-out yet
by test design). The observed behavior is that
the pages that have the above happen end up as
zero pages after the later
fork-then-swapout-then-back-in .

I do not see anything putting the pages that this
happens to into any other lists to keep track of
the contents of the page content. The swap-out
and swap-in seem to have ignored these pages and
to have been based on automatically zeroed pages

Note that the (or a) question might be if these
pages should have ever gotten to this code at
all. (I'm no expert overall.) But that might
get into why POSIX_MADV_WILLNEED spanning each
page is sufficient to avoid the zeros issue for
work-then-swapout-and-back-in. I'll only write
here about what the backtrace code seems to be
doing if I'm interpreting correctly.

One oddity here is that pmap_remove_l3 does its own
pmap_invalidate_page to invalidate the same tlb entry as
the above pmap_invalidate_page, so a double-invalidate.
(I've no clue if such is just suboptimal vs. a form of

pmap_remove_l3 here does things that the analogous
sys/arm/arm/pmap-v6.c's pmap_ts_referenced does not
do and pmap-v6 does something this code does not.

arm64's pmap_remove_l3 does (in summary):

  decrements the resident_count
(then pmap_ts_referenced's small_mappings code
 does another pmap_invalidate_page for the
 same argument values)

arm pmap-v6's pmap_ts_referenced's small_mappings
code does:

  conditional vm_page_dirty
  pte2_clear_bit for PTE2_A

There is, for example, no decrement of the
resident_count involved (that I found anyway). 

But I've no clue just what should be analogous
vs. what should not between pmap-v6 and arm64's
pmap code in this area.

I'll also note that the code before the
arm64 small_mappings code also uses
pmap_remove_l3 but does not do the
decrement nor the extra pmap_invalidate_page
(for example). But again I do not know
how analogous the two paths should be.

Only the small_mappings path seems to have the
end-up-with-zeros problem for the later
fork-then-swap-out and then swap-back-in

