Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 18 Feb 2023 00:23:59 UTC
On Feb 17, 2023, at 15:25, bob prohaska <fbsd@www.zefox.net> wrote:

> On Wed, Feb 15, 2023 at 11:39:13AM -0800, Mark Millard wrote:
>> On Feb 15, 2023, at 11:08, bob prohaska <fbsd@www.zefox.net> wrote:
>> 
>>> On Wed, Feb 15, 2023 at 09:40:51AM -0800, Mark Millard wrote:
>>>> 
>>>> Looking in my /usr/main-src/sbin/fsck_ffs/inode.c
>>>> I see that the original file has a leading tab
>>>> instead of spaces.
>>>> 
>>>> The following mostly ignores the 1st column that
>>>> should have a space, -, or + in the diff output for
>>>> the file-content lines. It is mostly about the text
>>>> after the first column.
>>>> 
>>>> So, if you have spaces instead after the first column
>>>> for the lines that start with a space, those lines
>>>> will not match, leading to a rejection for the
>>>> context matching done by patch.
>>> 
>>> Replacing spaces with tabs allowed patch to find the 
>>> location, but it still fails with 
>>> patch: **** malformed patch at line 5: printf("SIZE=%ju ", (uintmax_t)DIP(dp, di_size));
>> 
>> My guess is that when you made the adjustment to have
>> the tabs, the leading space was also removed on this
>> line. The first column is not part of the original
>> text but is instead a directive to the tool. The
>> missing space would be that directive and it needs to
>> be there. So:
>> 
>> <space><tab>printf("SIZE=%ju ", (uintmax_t)DIP(dp, di_size));
>> 
>> The space indicates to use the reset of the line just
>> for context identification.
>> 
>> Of course, since I've no access the file to check my
>> hypothesis, it is just a guess.
>> 
>>> Editing by hand looks like a good way to drive myself crazy 8-)
> 
> Turns out to be true, but not in the manner expected. Editing in 
> the changes by hand seems to have worked, in that fsck_ffs recompiled
> and no longer segfaults when examining the -stable filesystem.
> 
> However, repeated runs of fsck continue to emit errors starting with
> root@www:/usr/src # fsck -y /dev/da1s2d
> ** /dev/da1s2d
> ** Last Mounted on /usr
> ** Phase 1 - Check Blocks and Sizes
> 7912408300994173476 BAD I=69393345
> 4313599915630302063 BAD I=69393345
> -4473632163892877928 BAD I=69393345
> 8068741989830080453 BAD I=69393345
> ....
> This continues through a succession of I values, 
> ending with  
> 
> .....
> 
> 3857159125896022134 BAD I=74682090
> -4354179704011695453 BAD I=74682090
> 7611175298055105740 BAD I=74682090
> 3985638883347136889 BAD I=74682090
> -2495754894521232470 BAD I=74682090
> 7739654885841380823 BAD I=74682090
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> LINK COUNT FILE I=69316035  OWNER=root MODE=100644
> SIZE=36680 MTIME=Feb 11 12:06 2023  COUNT 2 SHOULD BE 1
> ADJUST? yes
> 
> BAD/DUP FILE I=69393345  OWNER=root MODE=100644
> SIZE=720896 MTIME=Jul 22 23:00 2022 
> 
> CLEAR? yes
> 
> fsck_ffs: cglookup: out of range cylinder group 175966913
> root@www:/usr/src

Looks like that is one of the messages for problems
fsck_ffs does not attempt to deal with (probably for
good reasons in each case/context). The below does
not show the specific conditions, just the calls with
the message texts used for the various exits of the
"errx(EEXIT" form:

# grep -r "errx(EEXIT," /usr/main-src/sbin/fsck_ffs/ | more
/usr/main-src/sbin/fsck_ffs/pass5.c:                            errx(EEXIT, "BAD STATE %d FOR INODE I=%ju",
/usr/main-src/sbin/fsck_ffs/inode.c:            errx(EEXIT, "bad inode number %ju to ginode",
/usr/main-src/sbin/fsck_ffs/inode.c:            errx(EEXIT, "bad inode number %ju to nextinode",
/usr/main-src/sbin/fsck_ffs/inode.c:                    errx(EEXIT, "cannot allocate space for inode buffer");
/usr/main-src/sbin/fsck_ffs/inode.c:            errx(EEXIT, "cannot increase directory list");
/usr/main-src/sbin/fsck_ffs/inode.c:                    errx(EEXIT, "cannot increase directory list");
/usr/main-src/sbin/fsck_ffs/inode.c:            errx(EEXIT, "BAD STATE %d TO BLKERR", inoinfo(ino)->ino_state);
/usr/main-src/sbin/fsck_ffs/dir.c:              errx(EEXIT, "wrong type to dirscan %d", idesc->id_type);
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "inoinfo: inumber %ju out of range",
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "Initial malloc(%d) failed", sblock.fs_bsize);
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "%s", failreason);
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "cglookup: out of range cylinder group %d", cg);
/usr/main-src/sbin/fsck_ffs/fsutil.c:                   errx(EEXIT, "Cannot allocate cylinder group buffers");
/usr/main-src/sbin/fsck_ffs/fsutil.c:                   errx(EEXIT,"Ran out of memory during journal recovery");
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "Excessive buffer size %ld > %d\n", size,
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "panic: lost %d buffers", numbufs - cnt);
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "ABORTING DUE TO READ ERRORS");
/usr/main-src/sbin/fsck_ffs/fsutil.c:                   errx(EEXIT, "cannot allocate buffer pool");
/usr/main-src/sbin/fsck_ffs/fsutil.c:           errx(EEXIT, "UNKNOWN INODESC FIX MODE %d", idesc->id_fix);
/usr/main-src/sbin/fsck_ffs/pass4.c:                            errx(EEXIT, "BAD STATE %d FOR INODE I=%ju",
/usr/main-src/sbin/fsck_ffs/pass1.c:                    errx(EEXIT, "cannot alloc %u bytes for inoinfo",
/usr/main-src/sbin/fsck_ffs/pass1.c:                    errx(EEXIT, "cannot alloc %u bytes for inoinfo",
/usr/main-src/sbin/fsck_ffs/setup.c:                    errx(EEXIT, "cannot allocate space for snapshot "
/usr/main-src/sbin/fsck_ffs/setup.c:            errx(EEXIT, "cannot allocate space for superblock");
/usr/main-src/sbin/fsck_ffs/setup.c:            errx(EEXIT, "calcsb: cannot allocate recovery buffer");
/usr/main-src/sbin/fsck_ffs/main.c:                             errx(EEXIT, "cannot do level %d conversion",
/usr/main-src/sbin/fsck_ffs/main.c:                             errx(EEXIT, "bad mode to -m: %o", lfmode);
/usr/main-src/sbin/fsck_ffs/main.c:             errx(EEXIT, "-%c flag requires a %s", flag, req);
/usr/main-src/sbin/fsck_ffs/pass2.c:                    errx(EEXIT, "CANNOT ALLOCATE ROOT INODE");
/usr/main-src/sbin/fsck_ffs/pass2.c:                            errx(EEXIT, "CANNOT ALLOCATE ROOT INODE");
/usr/main-src/sbin/fsck_ffs/pass2.c:                            errx(EEXIT, "CANNOT ALLOCATE ROOT INODE");
/usr/main-src/sbin/fsck_ffs/pass2.c:            errx(EEXIT, "BAD STATE %d FOR ROOT INODE",
/usr/main-src/sbin/fsck_ffs/pass2.c:                    errx(EEXIT, "BAD STATE %d FOR INODE I=%ju",

> It's unclear whether the patch is preventing fsck
> from repairing the filesystem, or the problems are
> inherently beyond fixing.

Looks like it is in the do-not-fix category. If no
prior adjustments were made in the run, then things
have stayed as they were.

(These messages could be clearer about the status
that they imply and what one should do in responce.)

> Repeated fsck runs seem
> to just reproduce the same output.

So, appearently, no prior adjustments either for
the re-runs.

> There's no prompt 
> to re-run fsck.  


I expect that is true of all the above "errx(EEXIT"
lines: the report is of a "did not fix" issue that
blocks progress.

> Thanks to both  Marks for the patch and essential
> help it making it stick. If  anything else is
> worth trying I'm game, there's little to lose.

I've no clue if there is more to try. But, even if
there is, there may be other issues/constraints that
lead to not bothering to try?

Beyond that, things with floating-point use in
multi-threading contexts looks to be significantly
broken in main [so: 14] for now. (This was involved
in your FreeBSD crash based on the the backtrace
showed.)

If you try to set up another armv7 context, I suggest,
for now, staying before:

commit 6926e2699ae55080f860488895a2a9aa6e6d9b4d
Author: Kornel Dulęba <kd@FreeBSD.org>
AuthorDate: 2023-02-04 12:59:30 +0000
Commit: Kornel Dulęba <kd@FreeBSD.org>
CommitDate: 2023-02-04 19:21:43 +0000

arm: Add support for using VFP in kernel

This would be until a list of issues have been
addressed. I've reported how to produce 3
distinct failures, 2 of which hit KASSERT
panics, and the other one is for ending up with
floating-point values from the wrong thread
(but same process). More may be identified
and fixed before things generally work again
for main for armv7 FreeBSD.

===
Mark Millard
marklmi at yahoo.com