arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

Sat Mar 18 13:26:58 UTC 2017

[Summary: I've now tested on a rpi3 in addition to a
pine64+ 2GB. Both contexts show the problem.]

On 2017-Mar-16, at 2:07 AM, Mark Millard <markmi at dsl-only.net> wrote:

> On 2017-Mar-15, at 11:07 PM, Scott Bennett <bennett at sdf.org> wrote:
> 
>> Mark Millard <markmi ta dsl-only.net> wrote:
>> 
>>> [Something strange happened to the automatic CC: fill-in for my original
>>> reply. Also I should have mentioned that for my test program if a
>>> variant is made that does not fork the swapping works fine.]
>>> 
>>> On 2017-Mar-15, at 9:37 AM, Mark Millard <markmi at dsl-only.net> wrote:
>>> 
>>>> On 2017-Mar-15, at 6:15 AM, Scott Bennett <bennett at sdf.org> wrote:
>>>> 
>>>>>  On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard
>>>>> <markmi at dsl-only.net> wrote:
>>>>>> On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso at cicely7.cicely.de> wrote:
>>>>>> 
>>>>>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
>>>>>>>> [test_check() between the fork and the wait/sleep prevents the
>>>>>>>> failure from occurring. Even a small access to the memory at
>>>>>>>> that stage prevents the failure. Details follow.]
>>>>>>> 
>>>>>>> Maybe a stupid question, since you might have written it somewhere.
>>>>>>> What medium do you swap to?
>>>>>>> I've seen broken firmware on microSD cards doing silent data
>>>>>>> corruption for some access patterns.
>>>>>> 
>>>>>> The root filesystem is on a USB SSD on a powered hub.
>>>>>> 
>>>>>> Only the kernel is from the microSD card.
>>>>>> 
>>>>>> I have several examples of the USB SSD model and have
>>>>>> never observed such problems in any other context.
>>>>>> 
>>>>>> [remainder of irrelevant material deleted  --SB]
>>>>> 
>>>>>  You gave a very long-winded non-answer to Bernd's question, so I'll
>>>>> repeat it here.  What medium do you swap to?
>>>> 
>>>> My wording of:
>>>> 
>>>> The root filesystem is on a USB SSD on a powered hub.
>>>> 
>>>> was definitely poor. It should have explicitly mentioned the
>>>> swap partition too:
>>>> 
>>>> The root filesystem and swap partition are both on the same
>>>> USB SSD on a powered hub.
>>>> 
>>>> More detail from dmesg -a for usb:
>>>> 
>>>> usbus0: 12Mbps Full Speed USB v1.0
>>>> usbus1: 480Mbps High Speed USB v2.0
>>>> usbus2: 12Mbps Full Speed USB v1.0
>>>> usbus3: 480Mbps High Speed USB v2.0
>>>> ugen0.1: <Generic OHCI root HUB> at usbus0
>>>> uhub0: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
>>>> ugen1.1: <Allwinner EHCI root HUB> at usbus1
>>>> uhub1: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
>>>> ugen2.1: <Generic OHCI root HUB> at usbus2
>>>> uhub2: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
>>>> ugen3.1: <Allwinner EHCI root HUB> at usbus3
>>>> uhub3: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
>>>> . . .
>>>> uhub0: 1 port with 1 removable, self powered
>>>> uhub2: 1 port with 1 removable, self powered
>>>> uhub1: 1 port with 1 removable, self powered
>>>> uhub3: 1 port with 1 removable, self powered
>>>> ugen3.2: <GenesysLogic USB2.0 Hub> at usbus3
>>>> uhub4 on uhub3
>>>> uhub4: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.00/90.20, addr 2> on usbus3
>>>> uhub4: MTT enabled
>>>> uhub4: 4 ports with 4 removable, self powered
>>>> ugen3.3: <OWC Envoy Pro mini> at usbus3
>>>> umass0 on uhub4
>>>> umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 3> on usbus3
>>>> umass0:  SCSI over Bulk-Only; quirks = 0x0100
>>>> umass0:0:0: Attached to scbus0
>>>> . . .
>>>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
>>>> da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device
>>>> da0: Serial Number <REPLACED>
>>>> da0: 40.000MB/s transfers
>>>> 
>>>> (Edited a bit because there is other material interlaced, even
>>>> internal to some lines. Also: I removed the serial number of the
>>>> specific example device.)
>> 
>>    Thank you.  That presents a much clearer picture.
>>>> 
>>>>>  I will further note that any kind of USB device cannot automatically
>>>>> be trusted to behave properly.  USB devices are notorious, for example,
>>>>> 
>>>>> [reasons why deleted  --SB]
>>>>> 
>>>>>  You should identify where you page/swap to and then try substituting
>>>>> a different device for that function as a test to eliminate the possibility
>>>>> of a bad storage device/controller.  If the problem still occurs, that
>>>>> means there still remains the possibility that another controller or its
>>>>> firmware is defective instead.  It could be a kernel bug, it is true, but
>>>>> making sure there is no hardware or firmware error occurring is important,
>>>>> and as I say, USB devices should always be considered suspect unless and
>>>>> until proven innocent.
>>>> 
>>>> [FYI: This is a ufs context, not a zfs one.]
>> 
>>    Right.  It's only a Pi, after all. :-)
> 
> It is a Pine64+ 2GB, not an rpi3.
> 
>>>> 
>>>> I'm aware of such  things. There is no evidence that has resulted in
>>>> suggesting the USB devices that I can replace are a problem. Otherwise
>>>> I'd not be going down this path. I only have access to the one arm64
>>>> device (a Pine64+ 2GB) so I've no ability to substitution-test what
>>>> is on that board.
>> 
>>    There isn't even one open port on that hub that you could plug a
>> flash drive into temporarily to be the paging device?
> 
> Why do you think that I've never tried alternative devices? It
> is just that the result was no evidence that my usually-in-use
> SSD is having a special/local problem: the behavior continues
> across all such contexts when the Pine64+ 2GB is involved. (Again
> I have not had access to an alternate to the one arm64 board.
> That limits my substitution testing possibilities.)
> 
> Why would you expect a Flash drive to be better than another SSD
> for such testing? (The SSD that I usually use even happens to be
> a USB 3.0 SSD, capable of USB 3.0 speeds in USB 3.0 contexts. So
> is the hub that I usually use for that matter.)

FYI: I now have access to a rpi3 in addition to a pine64+ 2GB.

I've tested on the rpi3 using a different USB hub and a different
SSD: no hardware device in common with the recent Pine64+ 2GB
tests (other than console cabling and what handles the serial
console).

The fork-then-swap-out-then-swap-in failure happens in the
rpi3 context as well.

Because the rpi3 has only 1 GiByte of RAM the stress commands
that I used were more like:

stress -m 1 --vm-bytes 1000M

to get zero RES(ident memory) for the two processes from my
test program after it forks while they are waiting/sleeping.

>> You could then
>> try your tests before returning to the normal configuration.  If there
>> isn't an open port, then how about plugging a second hub into one of
>> the first hub's ports and moving the displaced device to the second
>> hub?  A flash drive could then be plugged in.  That kind of configuration
>> is obviously a bad idea for the long run, but just to try your tests it
>> ought to work well enough.
> 
> I have access to more SSDs that I can use than I do to Flash drives. I
> see no reason to specifically use a Flash drive.
> 
>> (BTW, if a USB storage device containing a
>> paging area drops off=line even momentarily and the system needs to use
>> it, that is the beginning of the end, even though it may take up to a few
>> minutes for everything to lock up.
> 
> The system does not lock up, even days or weeks later, with having done
> dozens of experiments that show memory corruption failures over those
> days. The only processes showing memory corruption so far are those
> that were the parent or child for a fork that were later swapped out
> to have zero RES(ident memory) and then even later swapped back in.
> 
> The context has no such issues. You are inventing problems that do
> not exist in my context. That is why none of my list submittals
> mention such problems: they did not occur.
> 
>> You probably won't be able to do an
>> orderly shutdown, but will instead have to crash it with the power switch.
>> In the case of something like a Pi, this is an unpleasant fact of life,
>> to be sure.)
> 
> Such things did not occur and has nothing to do with my context so far.
> 
>>    I think I buy your arguments, given the evidence you've collected
>> thus far, including what you've added below.  I just like to eliminate
>> possibilities that are much simpler to deal with before facing nastinesses
>> like bugs in the VM subsystem. :-)
> 
> When I started this I found no evidence of device-specific problems.
> My investigation activity goes back to long before my list submittals.
> 
> And I repeat: Other people have reported the symptoms that started
> this investigation. They did so before I ever started my activities.
> They were using none of the specific devices that I have access to.
> Likely the types of devices were frequently even different, such as
> a rpi3 instead of a Pine64+ 2GB or a different USB drive. I was able
> to get the symptoms that they reported.
> 
>>>> It would be neat if some folks used my code to test other arm64
>>>> contexts and reported the results. I'd be very interested.
>>>> (This is easier to do on devices that do not have massive
>>>> amounts of RAM, which may limit the range of devices or
>>>> device configurations that are reasonable to test.)
>>>> 
>>>> There is that other people using other devices have reported
>>>> the behavior that started this investigation. I can produce the
>>>> behavior that they reported, although I've not seen anyone else
>>>> listing specific steps that lead to the problem or ways to tell
>>>> if the symptom is going to happen before it actually does. Nor
>>>> have I seen any other core dump analysis. (I have bugzilla
>>>> submittals 217138 and 217239 tied to symptoms others have
>>>> reported as well as this test program material.)
>>>> 
>>>> Also, considering that for my test program I can control which pages
>>>> get the zeroed-problem by read-accessing even one byte of any 4K
>>>> Byte page that I want to make work normally, doing so in the child
>>>> process of the fork, between the fork and the sleep/swap-out, it does
>>>> not suggest USB-device-specific behavior. The read-access is changing
>>>> the status of the page in some way as far as I can tell.
>>>> 
>>>> (Such read-accesses in the parent process make no difference to the
>>>> behavior.)
>>> 
>>> I should have noted another comparison/contrast between
>>> having memory corruption and not in my context:
>>> 
>>> I've tried variants of my test program that do not fork but
>>> just sleep for 60s to allow me to force the swap-out. I
>>> did this before adding fork and before using
>>> parital_test_check, for example. I gradually added things
>>> apparently involved in the reports others had made
>>> until I found a combination that produced a memory
>>> corruption test failure.
>>> 
>>> These tests without fork involved find no problems with
>>> the memory content after the swap-in.
>>> 
>>> For my test program it appears that fork-before-swap-out
>>> or the like is essential to having the problem occur.
>>> 
>>    A comment about terminology seems in order here.  It bothers
>> me considerably to see you writing "swap out" or "swapping" where
>> it seems like you mean to write "page out" or "paging".  A BSD
>> system whose swapping mechanism gets activated has already waded
>> very deeply into the quicksand and frequently cannot be gotten out
>> in a reasonable amount of time even with manual assistance.  It is
>> often quicker to crash it, reboot, and wait for the fsck(8) cleanups
>> to complete.  Orderly shutdowns, even of the kind that results from
>> a quick poke to the power button, typically get mired in the same
>> mess that already has the system in knots.  Also, BSD systems since
>> 3.0BSD, unlike older AT&T (pre-SysVR2.3) systems, do not swap in,
>> just out.  A swapped out process, once the system determines that it
>> has adequate resources again to attempt to run the process, will have
>> the interrupted text page paged in and the rest will be paged in by
>> the normal mechanism of page faults and page-in operations.  I assume
>> you must already know all this, which is a large part of why it grates
>> on me that you appear to be using the wrong terms.
> 
> You apparently did not read any of the material about how the test
> is done or are unfamiliar with what "stress -m 1 --vm-bytes 1800M"
> does when there is only 2GB of RAM. I am deliberately inducing
> swapping in other processes, including the 2 from my test program
> (after the fork), not just paging. (stress is a port, not part of
> the base system.)
> 
> When I say swap-out and swap-in I mean it.
> 
> From the source code of my test program:
> 
>            sleep(60);
> 
>            // During this manually force this process to
>            // swap out. I use something like:
> 
>            // stress -m 1 --vm-bytes 1800M
> 
>            // in another shell and ^C'ing it after top
>            // shows the swapped status desired. 1800M
>            // just happened to work on the Pine64+ 2GB
>            // that I was using. I watch with top -PCwaopid .
> 
> That type of stress run uses about 1.8 GiBytes after a bit,
> which is enough to cause the swapping of other processes,
> including the two that I am testing (post-fork). (Some RAM
> is in use already before the stress run, which explains not
> needing 2 GiBytes to be in use by stress.)
> 
> Look at a "top -PCwaopid" display: there are columns for
> RES(ident memory) and SWAP. I cause my 2 test processes to
> show zero RES and everything under SWAP, starting sometime
> during the 60s sleep/wait.
> 
> Why would I cause swapping? Because buildworld causes such
> swap-outs at times when there is only 2GBytes of RAM,
> including processes that forked earlier, and as a result
> the corrupted memory problems show up later in some processes
> that were swapped out at the time. The build eventually
> stops for process failures tied to the corruptions of memory
> in the failing processes. (At least that is what my testing
> strongly suggests.)
> 
> But that is a very complicated context to use for analysis or
> testing of the problem. My test program is vastly simpler
> and easier/quicker to set up and test when used with stress
> as well. Such was the kind of thing I was trying to find.
> 
> I want the Pine64+ 2GB to work well enough to be able to have
> buildworld (-j 4) complete correctly without having to restart
> the build --even when everything has to be rebuilt. So I'm
> trying to find and provide enough evidence to help someone fix
> the problems that are observed to block such buildworld
> activity.
> 
> Again: others have reported such arm64 problems on the lists
> before I ever got into this activity. The evidence is that
> the issues are not a local property of my environment.
> 
> Swapping is supposed to work. I can do buildworld (-j 4)
> on armv6 (really -mcpu=cortex-a7 so armv7-a) and the
> swapping it causes works fine. This is true for both a
> bpim3 (2 GiBytes of RAM) and a rpi2 (1 GiByte of RAM
> so even more swapping). On a powerpc64 with 16 GiBytes
> I've built things that caused 26 GiBytes of swap to be
> in use some of the time (during 4 ld's running in
> parallel), with lots of processes having zero for
> RES(ident memory) and all their space listed under SWAP
> in a "top -PCwaopid" display. This too has no problems
> with swapping of previously forked processes (or of any
> other processes).
> 
> For the likes of a Pine64+ 2GB to be "self hosted" 
> for source-code based updates, swapping of previously
> forked processes must work and currently such
> swapping is unreliable.

===
Mark Millard
markmi at dsl-only.net