Question(s) related to "cylinder checksum failed" during "cp" of huge files (RPi4B EDK2 UEFI/ACPI context)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 29 Nov 2022 21:41:47 UTC
The later included question(s) are only intended to gather
background information that might help gather evidence for
why certain failure reports are generated in a particular
(somewhat odd) RPi4B context. I'm not suggesting any problems
with the cylinder checks themselves.

If I do a:

# cp -aRx larger-than-RAM.tar larger-than-RAM.tar.copied_via_RPi4B_C0T_Rev_1.5

in a UFS context with a file like, for example,

-rw-r--r--   1 root  wheel  27707039744 Nov 28 19:11:23 2022 larger-than-RAM.tar

I eventually get messages like the following before the system
completely fails during the attempted copy (I've been testing
only 13.1-STABLE so far):

UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 92, cgp: 0x0 != bp: 0x43bc4552
. . .
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 107, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 123, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 155, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 219, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 347, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 236, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 94, cgp: 0x0 != bp: 0x43bc4552
. . .
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 99, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 100, cgp: 0x0 != bp: 0x8b9e9592
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 101, cgp: 0x0 != bp: 0x43bc4552
. . .
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 293, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 295, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 296, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 298, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 302, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 310, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 326, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 358, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 55, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 183, cgp: 0x0 != bp: 0x43bc4552
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 72, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 297, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 298, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 299, cgp: 0xffffffff != bp: 0x544dd2da
UFS /dev/ufs/rootfs (/) cylinder checksum failed: cg 300, cgp: 0xffffffff != bp: 0x544dd2da
. . .

(That such messages happen, may make for better validation that
media I/O is well supported in the kernel for specific contexts
than has historically been the case. This may be an example of
that.)

(Note: My use of "-aRx" is habitual, not special to causing the
messages.)

I've not seen problems during basic normal operation but that
might just be a context that might take a long time to have
a significant probability of observing such a failure. The
"cp" activity above never has completed in this recent testing.

Are the above messages likely to be based on the cylinder
validation updates of fairly recent times? Or are they from
code for which the checks would have been involved long before
such changes as far as such "cp" activity goes? (I've been
guessing that the tests involved are fairly new.)

Is there anything about, say, when the checks happen vs. not
relative to other activity happens during a "cp" that might be
important to gathering or reporting evidence? (There might be
other questions that I should ask but did not manage to think
of.)

It seems unlikely that I'll get to the point of being able to
point at specific source code that has problems. But it would be
nice to have presented more/better evidence if I can gather
some.


Notes:

The oddity with the context is using EDK2 UEFI/ACPI instead
of U-Boot/DeviceTree. What is reported here only happens
with UEFI/ACPI. An apparently separate problem happens for
U-Boot and can happen after the above kind of messages for
UEFI/ACPI if the system manages to run long enough. The
only media is a USB3 SSD in my testing.

I first got such messages via use of:

FreeBSD-13.1-STABLE-arm64-aarch64-RPI-20221123-b51ee7ac252c-253133.img

but I also got such via my somewhat older builds that have
some past experimental patches for ACPI and DMA range handling
that I'd been using for some time, but mostly in a ZFS context
for UEFI/ACPI as far as on-going use was concerned. (I've
reverted to U-Boot for that on-going-use ZFS environment
that I do not want corrupted.)

What started this was getting access to a 8 GiByte RPi4B that
no longer has the DMA size restrictions that the original
parts had (a "C0T" part instead of a "B0T" at the end of the
part identification printed on the SOC top). I was testing and
comparing vs. old "B0T" parts, repeating old experiments that
had originally shown the ACPI support did not work for the
"B0T" parts. I'd not run such tests in some time and all the
failures seem to be new types of evidence.

My normal builds had patches that, prior to this, I thought
were handling the "B0T" DMA range limitation associated with
XHCI, as presented via ACPI. Part of what I had intended was
to see if the behavior still looked good for the new "C0T"
RPi4B and if things were well behaved without the patches
(for official FreeBSD builds). Instead I found failures
spanning into the old type of tests done on a "B0T" RPi4B.

I had not thought to rerun the tests as the cylinder related
tests were being added. Too bad. This possibly could have
been noticed earlier.

So far, I've not seen problems via U-Boot/DeviceTree. So
that is what I now use in every context I do not want
corrupted (avoiding likely needing regeneration from
scratch). (My ZFS use is for bectl use, not redundancy.)

===
Mark Millard
marklmi at yahoo.com