Re: -current on armv7 stuck with flashing disk light

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 27 Jun 2023 16:47:12 UTC
On Jun 27, 2023, at 09:29, bob prohaska <fbsd@www.zefox.net> wrote:

> On Mon, Jun 26, 2023 at 07:57:05PM -0700, Mark Millard wrote:
>> On Jun 26, 2023, at 19:12, bob prohaska <fbsd@www.zefox.net> wrote:
>> 
>>> A Pi2 freshly updated to 
>>> FreeBSD 14.0-CURRENT #41 main-c3e58ace31: Mon Jun 26 17:06:01 PDT 2023
>>>   bob@www.zefox.com:/usr/obj/usr/src/arm.armv7/sys/GENERIC arm
>>> got stuck with a flashing USB disk LED after starting a -j3 buildworld.
>>> No response to debugger escape, had to pull the plug.

I'm confused.

That says "stuck with a flashing USB disk LED". But:

http://nemesis.zefox.com/~bob/fbsd/rpi2/20230623/readme

says: "the disk had gone to sleep mode. Both LEDs were off"

Are these two different examples with variable behavior
across the examples?

>> If I understand right, the LED flashing means the disk
>> had not stopped doing I/O: the system was still running,
>> doing disk activity. (But I do not have a description
>> of what your drive documentation says about how the
>> drive handles the LED and what various patterns/colors
>> may mean.)
>> 
>> If the processes associated with processing input that
>> would identify the debugger escape had the kernel stacks
>> involved swapped out to swap space, I doubt that the
>> debugger escape would work until/unless the kernel
>> stacks are brought back into kernel RAM.
>> 
>> Avoiding the specific way of losing control is why I
>> have in /etc/sysctl.conf :
>> 
>> #
>> # Together this pair avoids swapping out the process kernel stacks.
>> # This avoids processes for interacting with the system from being
>> # hung-up by such.
>> vm.swap_enabled=0
>> vm.swap_idle_enabled=0
>> 
> 
> This combination was tried and didn't seem to have any consistent
> effect. It's commented out at the moment.

By not having them, we have no way to know if the
relevant kernel stacks had been moved to swap space.
Having them is part of problem isolation/identification
even when other forms of loss of control happen.

The 2 lines serve more than one goal.

>> (No claim such is the only way to lose control.)
>> 
>> You might be able to get a clue if their was disk I/O going
>> on based on modification times on files you know would have
>> been modified periodically for some time (minutes) before
>> you pulled the plug --but not modified on reboot and later
>> activity. May be a log file that would only be modified by
>> the build that you had been trying to do?
>> 
> 
> There are log files for build and disk activity (for a cold
> hang, no disk activity at all) at
> http://nemesis.zefox.com/~bob/fbsd/rpi2/20230623/

So this is a different hangup?

> In this case the top window was via ssh. Lately I've
> taken to running top on the serial console in hopes
> that will help distinguish system hangs from USB hangs.

If you want to identify system hangs, please
put back:

vm.swap_enabled=0
vm.swap_idle_enabled=0

otherwise all you may be seeing is the relevant
kernel stacks having been moved to swap space.
That is not a form of system hang relative to
overall activity, leaving more uncertainty about
what top no longer displaying updates implies.

You can use sysctl to adjust the live context
as well.

> 
>> (You did not indicate how long you let it run with the
>> status "possibly hung up".)
>> 
> IIRC it was about half an hour. It was already stuck, so I
> don't know the actual time

No logs or other files with modification times that
might indicate if there was activity during that
around 0.5 hr? (Timestamps in files can also serve.)

>>> Reboot with kernel.old,
>>> FreeBSD 14.0-CURRENT #40 main-c1cbabe8ae: Tue Jun 20 03:58:47 PDT 2023
>>>   bob@www.zefox.com:/usr/obj/usr/src/arm.armv7/sys/GENERIC arm
>>> seems ok, I'll try to run buildworld with that.
> 
> The kernel.old  -j3 buildworld is still running, no complaints so far.
> If it succeeds I'll experiment with usbtop.


===
Mark Millard
marklmi at yahoo.com