ns8250: UART FCR is broken, message might be misleading

From: John Hay <john_at_sanren.ac.za>
Date: Sat, 06 Jul 2024 18:36:09 UTC
Hi,

I have 3 machines running FreeBSD 14.0. (I have upgraded one to 14.1
recently). All 3 have a uart with a GPS behind it, but no program reading
from the uart. I see the "ns8250: UART FCR is broken" on all of them on
average a little less than one per day.

For example, the one machine has an uptime of 48 days. In that time the
message was printed 44 times, but there was 12846 overruns according to the
below sysctl:

dev.uart.2.rx_overruns: 12846

If the FCR was really broken, I would have expected the message to be
printed for every overrun.

The 16550d documentation that I could find on the internet has this about
Bit 1 of the Fifo Control Register (FCR):

Bit 1 Writing a 1 to FCR1 clears all bytes in the RCVR FIFO
and resets its counter logic to 0 The shift register is not
cleared The 1 that is written to this bit position is self-clear-ing

So what I think is happening is that occasionally when the RCVR FIFO is
cleared, a character is almost received and between the RCVR is cleared and
LSR bit LSR_RXRDY is checked, the new character is there.

The piece of code in ns8250_flush() looks like this:
<snip>
        uart_setreg(bas, REG_FCR, fcr);
        uart_barrier(bas);

        /*
         * Detect and work around emulated UARTs which don't implement the
         * FCR register; on these systems we need to drain the FIFO since
         * the flush we request doesn't happen.  One such system is the
         * Firecracker VMM, aka. the rust-vmm/vm-superio emulation code:
         * https://github.com/rust-vmm/vm-superio/issues/83
         */
        lsr = uart_getreg(bas, REG_LSR);
        if (((lsr & LSR_TEMT) == 0) && (what & UART_FLUSH_TRANSMITTER))
                drain |= UART_DRAIN_TRANSMITTER;
        if ((lsr & LSR_RXRDY) && (what & UART_FLUSH_RECEIVER))
                drain |= UART_DRAIN_RECEIVER;
        if (drain != 0) {
                printf("ns8250: UART FCR is broken\n");
                ns8250_drain(bas, drain);
        }
</snip>

So how to distinguish between a real FCR error and this case? Maybe if
ns8250_drain() returned the number of bytes it drained instead and it
returned one, then it isn't an FCR error. Currently ns8250_drain() returns
0 on no error or EIO if there is a hardware problem. Maybe that can be
changed to return -EIO and handled properly where its return value is used?

Note that these uarts are implemented on Xilinx/AMD FPGAs using the v2.0 IP
in this link, but I do think it can probably happen on other 16x50 uarts
too. https://docs.amd.com/v/u/en-US/pg143-axi-uart16550

Regards

John