arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
Mark Millard
markmi at dsl-only.net
Sat Mar 18 13:26:58 UTC 2017
[Summary: I've now tested on a rpi3 in addition to a
pine64+ 2GB. Both contexts show the problem.]
On 2017-Mar-16, at 2:07 AM, Mark Millard <markmi at dsl-only.net> wrote:
> On 2017-Mar-15, at 11:07 PM, Scott Bennett <bennett at sdf.org> wrote:
>
>> Mark Millard <markmi ta dsl-only.net> wrote:
>>
>>> [Something strange happened to the automatic CC: fill-in for my original
>>> reply. Also I should have mentioned that for my test program if a
>>> variant is made that does not fork the swapping works fine.]
>>>
>>> On 2017-Mar-15, at 9:37 AM, Mark Millard <markmi at dsl-only.net> wrote:
>>>
>>>> On 2017-Mar-15, at 6:15 AM, Scott Bennett <bennett at sdf.org> wrote:
>>>>
>>>>> On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard
>>>>> <markmi at dsl-only.net> wrote:
>>>>>> On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso at cicely7.cicely.de> wrote:
>>>>>>
>>>>>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
>>>>>>>> [test_check() between the fork and the wait/sleep prevents the
>>>>>>>> failure from occurring. Even a small access to the memory at
>>>>>>>> that stage prevents the failure. Details follow.]
>>>>>>>
>>>>>>> Maybe a stupid question, since you might have written it somewhere.
>>>>>>> What medium do you swap to?
>>>>>>> I've seen broken firmware on microSD cards doing silent data
>>>>>>> corruption for some access patterns.
>>>>>>
>>>>>> The root filesystem is on a USB SSD on a powered hub.
>>>>>>
>>>>>> Only the kernel is from the microSD card.
>>>>>>
>>>>>> I have several examples of the USB SSD model and have
>>>>>> never observed such problems in any other context.
>>>>>>
>>>>>> [remainder of irrelevant material deleted --SB]
>>>>>
>>>>> You gave a very long-winded non-answer to Bernd's question, so I'll
>>>>> repeat it here. What medium do you swap to?
>>>>
>>>> My wording of:
>>>>
>>>> The root filesystem is on a USB SSD on a powered hub.
>>>>
>>>> was definitely poor. It should have explicitly mentioned the
>>>> swap partition too:
>>>>
>>>> The root filesystem and swap partition are both on the same
>>>> USB SSD on a powered hub.
>>>>
>>>> More detail from dmesg -a for usb:
>>>>
>>>> usbus0: 12Mbps Full Speed USB v1.0
>>>> usbus1: 480Mbps High Speed USB v2.0
>>>> usbus2: 12Mbps Full Speed USB v1.0
>>>> usbus3: 480Mbps High Speed USB v2.0
>>>> ugen0.1: <Generic OHCI root HUB> at usbus0
>>>> uhub0: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
>>>> ugen1.1: <Allwinner EHCI root HUB> at usbus1
>>>> uhub1: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
>>>> ugen2.1: <Generic OHCI root HUB> at usbus2
>>>> uhub2: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
>>>> ugen3.1: <Allwinner EHCI root HUB> at usbus3
>>>> uhub3: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
>>>> . . .
>>>> uhub0: 1 port with 1 removable, self powered
>>>> uhub2: 1 port with 1 removable, self powered
>>>> uhub1: 1 port with 1 removable, self powered
>>>> uhub3: 1 port with 1 removable, self powered
>>>> ugen3.2: <GenesysLogic USB2.0 Hub> at usbus3
>>>> uhub4 on uhub3
>>>> uhub4: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.00/90.20, addr 2> on usbus3
>>>> uhub4: MTT enabled
>>>> uhub4: 4 ports with 4 removable, self powered
>>>> ugen3.3: <OWC Envoy Pro mini> at usbus3
>>>> umass0 on uhub4
>>>> umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 3> on usbus3
>>>> umass0: SCSI over Bulk-Only; quirks = 0x0100
>>>> umass0:0:0: Attached to scbus0
>>>> . . .
>>>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
>>>> da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device
>>>> da0: Serial Number <REPLACED>
>>>> da0: 40.000MB/s transfers
>>>>
>>>> (Edited a bit because there is other material interlaced, even
>>>> internal to some lines. Also: I removed the serial number of the
>>>> specific example device.)
>>
>> Thank you. That presents a much clearer picture.
>>>>
>>>>> I will further note that any kind of USB device cannot automatically
>>>>> be trusted to behave properly. USB devices are notorious, for example,
>>>>>
>>>>> [reasons why deleted --SB]
>>>>>
>>>>> You should identify where you page/swap to and then try substituting
>>>>> a different device for that function as a test to eliminate the possibility
>>>>> of a bad storage device/controller. If the problem still occurs, that
>>>>> means there still remains the possibility that another controller or its
>>>>> firmware is defective instead. It could be a kernel bug, it is true, but
>>>>> making sure there is no hardware or firmware error occurring is important,
>>>>> and as I say, USB devices should always be considered suspect unless and
>>>>> until proven innocent.
>>>>
>>>> [FYI: This is a ufs context, not a zfs one.]
>>
>> Right. It's only a Pi, after all. :-)
>
> It is a Pine64+ 2GB, not an rpi3.
>
>>>>
>>>> I'm aware of such things. There is no evidence that has resulted in
>>>> suggesting the USB devices that I can replace are a problem. Otherwise
>>>> I'd not be going down this path. I only have access to the one arm64
>>>> device (a Pine64+ 2GB) so I've no ability to substitution-test what
>>>> is on that board.
>>
>> There isn't even one open port on that hub that you could plug a
>> flash drive into temporarily to be the paging device?
>
> Why do you think that I've never tried alternative devices? It
> is just that the result was no evidence that my usually-in-use
> SSD is having a special/local problem: the behavior continues
> across all such contexts when the Pine64+ 2GB is involved. (Again
> I have not had access to an alternate to the one arm64 board.
> That limits my substitution testing possibilities.)
>
> Why would you expect a Flash drive to be better than another SSD
> for such testing? (The SSD that I usually use even happens to be
> a USB 3.0 SSD, capable of USB 3.0 speeds in USB 3.0 contexts. So
> is the hub that I usually use for that matter.)
FYI: I now have access to a rpi3 in addition to a pine64+ 2GB.
I've tested on the rpi3 using a different USB hub and a different
SSD: no hardware device in common with the recent Pine64+ 2GB
tests (other than console cabling and what handles the serial
console).
The fork-then-swap-out-then-swap-in failure happens in the
rpi3 context as well.
Because the rpi3 has only 1 GiByte of RAM the stress commands
that I used were more like:
stress -m 1 --vm-bytes 1000M
to get zero RES(ident memory) for the two processes from my
test program after it forks while they are waiting/sleeping.
>> You could then
>> try your tests before returning to the normal configuration. If there
>> isn't an open port, then how about plugging a second hub into one of
>> the first hub's ports and moving the displaced device to the second
>> hub? A flash drive could then be plugged in. That kind of configuration
>> is obviously a bad idea for the long run, but just to try your tests it
>> ought to work well enough.
>
> I have access to more SSDs that I can use than I do to Flash drives. I
> see no reason to specifically use a Flash drive.
>
>> (BTW, if a USB storage device containing a
>> paging area drops off=line even momentarily and the system needs to use
>> it, that is the beginning of the end, even though it may take up to a few
>> minutes for everything to lock up.
>
> The system does not lock up, even days or weeks later, with having done
> dozens of experiments that show memory corruption failures over those
> days. The only processes showing memory corruption so far are those
> that were the parent or child for a fork that were later swapped out
> to have zero RES(ident memory) and then even later swapped back in.
>
> The context has no such issues. You are inventing problems that do
> not exist in my context. That is why none of my list submittals
> mention such problems: they did not occur.
>
>> You probably won't be able to do an
>> orderly shutdown, but will instead have to crash it with the power switch.
>> In the case of something like a Pi, this is an unpleasant fact of life,
>> to be sure.)
>
> Such things did not occur and has nothing to do with my context so far.
>
>> I think I buy your arguments, given the evidence you've collected
>> thus far, including what you've added below. I just like to eliminate
>> possibilities that are much simpler to deal with before facing nastinesses
>> like bugs in the VM subsystem. :-)
>
> When I started this I found no evidence of device-specific problems.
> My investigation activity goes back to long before my list submittals.
>
> And I repeat: Other people have reported the symptoms that started
> this investigation. They did so before I ever started my activities.
> They were using none of the specific devices that I have access to.
> Likely the types of devices were frequently even different, such as
> a rpi3 instead of a Pine64+ 2GB or a different USB drive. I was able
> to get the symptoms that they reported.
>
>>>> It would be neat if some folks used my code to test other arm64
>>>> contexts and reported the results. I'd be very interested.
>>>> (This is easier to do on devices that do not have massive
>>>> amounts of RAM, which may limit the range of devices or
>>>> device configurations that are reasonable to test.)
>>>>
>>>> There is that other people using other devices have reported
>>>> the behavior that started this investigation. I can produce the
>>>> behavior that they reported, although I've not seen anyone else
>>>> listing specific steps that lead to the problem or ways to tell
>>>> if the symptom is going to happen before it actually does. Nor
>>>> have I seen any other core dump analysis. (I have bugzilla
>>>> submittals 217138 and 217239 tied to symptoms others have
>>>> reported as well as this test program material.)
>>>>
>>>> Also, considering that for my test program I can control which pages
>>>> get the zeroed-problem by read-accessing even one byte of any 4K
>>>> Byte page that I want to make work normally, doing so in the child
>>>> process of the fork, between the fork and the sleep/swap-out, it does
>>>> not suggest USB-device-specific behavior. The read-access is changing
>>>> the status of the page in some way as far as I can tell.
>>>>
>>>> (Such read-accesses in the parent process make no difference to the
>>>> behavior.)
>>>
>>> I should have noted another comparison/contrast between
>>> having memory corruption and not in my context:
>>>
>>> I've tried variants of my test program that do not fork but
>>> just sleep for 60s to allow me to force the swap-out. I
>>> did this before adding fork and before using
>>> parital_test_check, for example. I gradually added things
>>> apparently involved in the reports others had made
>>> until I found a combination that produced a memory
>>> corruption test failure.
>>>
>>> These tests without fork involved find no problems with
>>> the memory content after the swap-in.
>>>
>>> For my test program it appears that fork-before-swap-out
>>> or the like is essential to having the problem occur.
>>>
>> A comment about terminology seems in order here. It bothers
>> me considerably to see you writing "swap out" or "swapping" where
>> it seems like you mean to write "page out" or "paging". A BSD
>> system whose swapping mechanism gets activated has already waded
>> very deeply into the quicksand and frequently cannot be gotten out
>> in a reasonable amount of time even with manual assistance. It is
>> often quicker to crash it, reboot, and wait for the fsck(8) cleanups
>> to complete. Orderly shutdowns, even of the kind that results from
>> a quick poke to the power button, typically get mired in the same
>> mess that already has the system in knots. Also, BSD systems since
>> 3.0BSD, unlike older AT&T (pre-SysVR2.3) systems, do not swap in,
>> just out. A swapped out process, once the system determines that it
>> has adequate resources again to attempt to run the process, will have
>> the interrupted text page paged in and the rest will be paged in by
>> the normal mechanism of page faults and page-in operations. I assume
>> you must already know all this, which is a large part of why it grates
>> on me that you appear to be using the wrong terms.
>
> You apparently did not read any of the material about how the test
> is done or are unfamiliar with what "stress -m 1 --vm-bytes 1800M"
> does when there is only 2GB of RAM. I am deliberately inducing
> swapping in other processes, including the 2 from my test program
> (after the fork), not just paging. (stress is a port, not part of
> the base system.)
>
> When I say swap-out and swap-in I mean it.
>
> From the source code of my test program:
>
> sleep(60);
>
> // During this manually force this process to
> // swap out. I use something like:
>
> // stress -m 1 --vm-bytes 1800M
>
> // in another shell and ^C'ing it after top
> // shows the swapped status desired. 1800M
> // just happened to work on the Pine64+ 2GB
> // that I was using. I watch with top -PCwaopid .
>
> That type of stress run uses about 1.8 GiBytes after a bit,
> which is enough to cause the swapping of other processes,
> including the two that I am testing (post-fork). (Some RAM
> is in use already before the stress run, which explains not
> needing 2 GiBytes to be in use by stress.)
>
> Look at a "top -PCwaopid" display: there are columns for
> RES(ident memory) and SWAP. I cause my 2 test processes to
> show zero RES and everything under SWAP, starting sometime
> during the 60s sleep/wait.
>
> Why would I cause swapping? Because buildworld causes such
> swap-outs at times when there is only 2GBytes of RAM,
> including processes that forked earlier, and as a result
> the corrupted memory problems show up later in some processes
> that were swapped out at the time. The build eventually
> stops for process failures tied to the corruptions of memory
> in the failing processes. (At least that is what my testing
> strongly suggests.)
>
> But that is a very complicated context to use for analysis or
> testing of the problem. My test program is vastly simpler
> and easier/quicker to set up and test when used with stress
> as well. Such was the kind of thing I was trying to find.
>
> I want the Pine64+ 2GB to work well enough to be able to have
> buildworld (-j 4) complete correctly without having to restart
> the build --even when everything has to be rebuilt. So I'm
> trying to find and provide enough evidence to help someone fix
> the problems that are observed to block such buildworld
> activity.
>
> Again: others have reported such arm64 problems on the lists
> before I ever got into this activity. The evidence is that
> the issues are not a local property of my environment.
>
> Swapping is supposed to work. I can do buildworld (-j 4)
> on armv6 (really -mcpu=cortex-a7 so armv7-a) and the
> swapping it causes works fine. This is true for both a
> bpim3 (2 GiBytes of RAM) and a rpi2 (1 GiByte of RAM
> so even more swapping). On a powerpc64 with 16 GiBytes
> I've built things that caused 26 GiBytes of swap to be
> in use some of the time (during 4 ld's running in
> parallel), with lots of processes having zero for
> RES(ident memory) and all their space listed under SWAP
> in a "top -PCwaopid" display. This too has no problems
> with swapping of previously forked processes (or of any
> other processes).
>
> For the likes of a Pine64+ 2GB to be "self hosted"
> for source-code based updates, swapping of previously
> forked processes must work and currently such
> swapping is unreliable.
===
Mark Millard
markmi at dsl-only.net
More information about the freebsd-arm
mailing list