Re: Armv7 (rpi2) getting stuck in buildworld for -current
Date: Mon, 15 May 2023 03:12:23 UTC
On May 14, 2023, at 16:58, bob prohaska <fbsd@www.zefox.net> wrote: > On Sun, May 14, 2023 at 12:31:29PM -0700, Mark Millard wrote: >> >> >> In my environment, I use /etc/sysctl.conf , which >> is a place appropriate for non-tunable but writable >> sysctl values: >> >> # grep vm.swap_ /etc/sysctl.conf >> vm.swap_enabled=0 >> vm.swap_idle_enabled=0 >> >> I suggest moving the assignments to /etc/sysctl.conf . >> I expect that this will get rid of your problem once >> you reboot with them in a right place. (You can also >> interactively set them via sysctl use.) >> > > At some point in the past I did that and failed to clean > up /boot/loader.conf . > >> I suggest avoiding confusions by not having copies of >> those 2 lines in /boot/loader.conf (where they will >> not work). >> > I elected to comment the incorrect lines out with a note > indicating why. If I got confused once it may happen again. > > IIRC the lines were added because ssh connections tend to > drop when the system gets busy. That's still happening, so > they're not the cure, or at least not the whole cure. > >>> A running diary of experiments is at >>> http://www.zefox.net/~fbsd/rpi2/crashes/20230514/armv7hang >> >> There you report reducing the swap space partition size. >> Were you getting the message about the swap possibly being >> mistuned prior to that? >> >> For 1 GiByte of RAM 3647M looks to me to likely be a little >> below where that message about mistuning shows up. If you >> were not getting the message, the size should have been >> fine. >> > > The last "too much swap" message I can find was: > warning: total configured swap (1048576 pages) exceeds maximum recommended amount (922200 pages). > Space was reserved for 4GB of swap, suggesting that only about 1.6 GB is recommended > if I did the arithmetic right. Resizing the swap partition is easy and 1 GB should > have been more than enough, but the machine stalled again with 30-odd MB in use. My screwup: about 3.6*RAM_SIZE is for aarch64, not armv7. armv7 is more like 1.7*RAM_SIZE. For armv7 I've used: # gpart show -pl . . . 534528 3563520 da0p2 BPIM3swap (1.7G) # For 1 GiByte of RAM RAM . . . 4311040 6291456 da0p3 BPIM3swp2 (3.0G) # For 2 GiByte of RAM RAM Going in another direction: Note that when top displays something it is showing a point in the past by the time you get to see it. "32M Used" need not be even approximately true at the point of failure. And your first top output shows "358M Used", indicating that it staying small like 32M is not likely over the whole build. > In the distant past armv7 seemed to use little or no swap with a > -j4 buildworld, Not just armv7. > now it seems to require at least some when building > llvm. So far having too much swap hasn't caused visible problems, > but that may have been an artifact of it not being used. > >> In other words, I expect it is appropriate to put back >> the original size (or some approximation of it that >> avoids the message about possibly being mistuned). So much for that claim. Sorry. >> Everything that you reported looks to me to be consistent >> with some kernel stacks having been swapped out for some >> processes/threads that would otherwise be involved in >> interactive I/O activity. >> > > For the moment I've updated /usr/src, set buildworld to -j4 and > am expecting it to hang sometime overnight if the problem is > repeatable. As I write this swap use is pushing 600MB Like the "358M Used", there is plenty of evidence around that expecting a -j4 build to use little swap space for 1 GiByte of RAM is not reasonable for FreeBSD and its use of LLVM (even on/for armv7, as well as the other architectures), going back a fair ways: the status is not a recent change. I'm unsure if you have well avoided having any tmpfs based space or the like that would compete for RAM and use some of the RAM+SWAP. In the low RAM environments, I avoid such competition and use UFS to exclusion. I'll note that causing swap space thrashing can make builds take longer. "Thrashing" is not directly the space used but the frequency/backlog of swap space I/O. I always avoided configurations that thrashed for notable periods of time, via using -j given that I'd already avoied RAM+SWAP competition. But thrashing is also tied to the likes of spinning rust vs. various, for example, NVMe USB media. It is probably generally easier to make spinning rust thrash for notable periods. I'd also avoided spinning rust. > with ~60% > idle time, which is far more than I recall seeing for armv7 in > the past. It's still running, and the scheduler does seem to > find threads to favor. > > The behavior starts to resemble aarch64 on a Pi3 but less extreme. > > For some reason the ssh session controlling buildworld > tends to live longer than an ssh session running a tip connection > to an adjacent Pi's serial console. Since the problem of dropped > ssh connections hasn't been cured by use of > vm.swap_enabled=0 > vm.swap_idle_enabled=0 > perhaps it's best to remove them, for sake of simplicity. > No. Removing them would just mean there would be more ways for you to lose interactive control, including over a serial console without ssh involved if you had such at the time, not just over ssh sessions. I never claimed there was only one cause of control loss. I have claimed that these lines have been used by various folks to avoid one mode of failure. (Some times one is lucky enough to have one access path fail but another still working, such that one can inspect to find out the cause for the failure path was. Such has shown examples of kernel stacks swapped out. Such folks that added the lines cut down the frequency and conditions would lead to lack of access/control.) Separately . . . Your online file report says: QUOTE The disk activity light pulsed steadily, the the time display in top stopped updating and the system was unresponsive to the enter-tilda-control-B debugger escape. END QUOTE The disk activity light suggests that the system was still doing the build and what you lost was just interactive control and interactive monitoring. If you could tolerate waiting for it without access beyond the activity light, you might have ended up with a completed build. I'll also remind that having one or more logs with an overall high frequency of updates being written to the media adds to the I/O issues. === Mark Millard marklmi at yahoo.com