Re: -current on armv7 stuck with flashing disk light

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 27 Jun 2023 16:59:40 UTC
On Jun 27, 2023, at 09:47, Mark Millard <marklmi@yahoo.com> wrote:

> On Jun 27, 2023, at 09:29, bob prohaska <fbsd@www.zefox.net> wrote:
> 
>> On Mon, Jun 26, 2023 at 07:57:05PM -0700, Mark Millard wrote:
>>> On Jun 26, 2023, at 19:12, bob prohaska <fbsd@www.zefox.net> wrote:
>>> 
>>>> A Pi2 freshly updated to 
>>>> FreeBSD 14.0-CURRENT #41 main-c3e58ace31: Mon Jun 26 17:06:01 PDT 2023
>>>>  bob@www.zefox.com:/usr/obj/usr/src/arm.armv7/sys/GENERIC arm
>>>> got stuck with a flashing USB disk LED after starting a -j3 buildworld.
>>>> No response to debugger escape, had to pull the plug.
> 
> I'm confused.
> 
> That says "stuck with a flashing USB disk LED". But:
> 
> http://nemesis.zefox.com/~bob/fbsd/rpi2/20230623/readme
> 
> says: "the disk had gone to sleep mode. Both LEDs were off"
> 
> Are these two different examples with variable behavior
> across the examples?
> 
>>> If I understand right, the LED flashing means the disk
>>> had not stopped doing I/O: the system was still running,
>>> doing disk activity. (But I do not have a description
>>> of what your drive documentation says about how the
>>> drive handles the LED and what various patterns/colors
>>> may mean.)
>>> 
>>> If the processes associated with processing input that
>>> would identify the debugger escape had the kernel stacks
>>> involved swapped out to swap space, I doubt that the
>>> debugger escape would work until/unless the kernel
>>> stacks are brought back into kernel RAM.
>>> 
>>> Avoiding the specific way of losing control is why I
>>> have in /etc/sysctl.conf :
>>> 
>>> #
>>> # Together this pair avoids swapping out the process kernel stacks.
>>> # This avoids processes for interacting with the system from being
>>> # hung-up by such.
>>> vm.swap_enabled=0
>>> vm.swap_idle_enabled=0
>>> 
>> 
>> This combination was tried and didn't seem to have any consistent
>> effect. It's commented out at the moment.
> 
> By not having them, we have no way to know if the
> relevant kernel stacks had been moved to swap space.
> Having them is part of problem isolation/identification
> even when other forms of loss of control happen.
> 
> The 2 lines serve more than one goal.
> 
>>> (No claim such is the only way to lose control.)
>>> 
>>> You might be able to get a clue if their was disk I/O going
>>> on based on modification times on files you know would have
>>> been modified periodically for some time (minutes) before
>>> you pulled the plug --but not modified on reboot and later
>>> activity. May be a log file that would only be modified by
>>> the build that you had been trying to do?
>>> 
>> 
>> There are log files for build and disk activity (for a cold
>> hang, no disk activity at all) at
>> http://nemesis.zefox.com/~bob/fbsd/rpi2/20230623/
> 
> So this is a different hangup?

j4swapscript.log has internal timestamp pairs:

Wed Jun 21 16:34:06 PDT 2023
. . .
Fri Jun 23 07:26:10 PDT 2023

It would be interesting to know if "Jun 23 07:26:10"
was after the appearent hangup was identified vs.
before.

>> In this case the top window was via ssh. Lately I've
>> taken to running top on the serial console in hopes
>> that will help distinguish system hangs from USB hangs.
> 
> If you want to identify system hangs, please
> put back:
> 
> vm.swap_enabled=0
> vm.swap_idle_enabled=0
> 
> otherwise all you may be seeing is the relevant
> kernel stacks having been moved to swap space.
> That is not a form of system hang relative to
> overall activity, leaving more uncertainty about
> what top no longer displaying updates implies.
> 
> You can use sysctl to adjust the live context
> as well.
> 
>> 
>>> (You did not indicate how long you let it run with the
>>> status "possibly hung up".)
>>> 
>> IIRC it was about half an hour. It was already stuck, so I
>> don't know the actual time
> 
> No logs or other files with modification times that
> might indicate if there was activity during that
> around 0.5 hr? (Timestamps in files can also serve.)
> 
>>>> Reboot with kernel.old,
>>>> FreeBSD 14.0-CURRENT #40 main-c1cbabe8ae: Tue Jun 20 03:58:47 PDT 2023
>>>>  bob@www.zefox.com:/usr/obj/usr/src/arm.armv7/sys/GENERIC arm
>>>> seems ok, I'll try to run buildworld with that.
>> 
>> The kernel.old  -j3 buildworld is still running, no complaints so far.
>> If it succeeds I'll experiment with usbtop.




===
Mark Millard
marklmi at yahoo.com