Investigating jh7100 (VisionFive) MMC issue

From: JMT Sihvola <jsihv_at_gmx.com>
Date: Tue, 07 Feb 2023 13:27:21 UTC
Hello,

Here is a description of my efforts to investigate what is missing from 
jh7100 (VisionFive) MMC host driver

--

the driver file is here:
https://github.com/mhorne/freebsd/blob/beaglev/sys/riscv/
starfive/starfive_if_dwc.c

I use device trees provided in Linux repository:
https://github.com/starfive-tech/linux/blob/
594a662ba20ee2234e2c09a4b25d79f8fb99cfef/arch/riscv/boot/dts/starfive/

I also have my own GPIO and reset drivers in use plus one clock added.
These may affect the boot process. Ask me if interested.

--

Without "sc->use_pio = 1" in dwmmc_starfive.c the execution halts at msleep()
call which is located in the file sys/dev/mmc/mmc.c

It seems that it remains waiting for an external event which never occurs.

After switching pio mode on (like is done in several other board specific
dwmmc driver codes) booting does not stall anymore but we get an error message:
mmcsd0: Error indicated: 4 Failed

Possibly related to this error – which appears multiple times during the boot –
mountroot fails (error 19)

Using printfs to track the execution suggests that the earlier manifestation of
mmscd0's error 4 happens in the function dwmmc_intr() which is found in a file
sys/dev/mmc/host/dwmmc.c. The error number 4 matches to this error code
(MMC_ERR_FAILED defined in mmcreg.h) and printfs rule out other places where
the same error is assigned.

Printing opcodes and register data out at this point indicates that opcodes used
when the error manifests are 17, MMC_READ_SINGLE_BLOCK and 18,
MMC_READ_MULTIPLE_BLOCK and MINTSTS register has bits 3 and 7 on. These bits
indicate data_transfer_over_interrupt and data_crc_error_interrupt according
to two datasheets I've found. So it looks like there is a CRC error. 

Printfs reveal that during the boot many data packets are transferred.
CRC error appears with some of these.

Data packets transfered during the boot consists of 3735929054 sequences
(0xDEADC0DE), only tails of transfers may be different:

example 1:
...
3735929054
8309528
4294967232
9

example 2:
...
3735929054
8313624
4294967232
3735929054

Curiously sometimes the very last row is another 3735929054 sequence instead of
9 (example 2) but this difference is seen between different boots and concerns
all the transfers instead of marking a difference between error indicating and
non-error indicating transfers.

The quantity of data transferred seems to be always correct (...at least if
value 9 is not an error)

Printing out all the possible variables related to these data transfers (struct
mmc_command) does not seem to reveal much differences between error giving and
non-error giving transfers. Only suspectible variable I have found is "arg"
variable. It is usually 0 and sometimes 2 when the error occurs. With errorless
transfers (when opcodes are 17, 18) these values seem not to appear.

Tracking this arg variable takes us to function mmcsd_rw() in file mmscd.c.
There arg gets its value from bp->bio_pblkno (physical block address) via
"block" variable. This argument comes from function's parameter struct bio *bp.
Ultimately the origin of this value seems to lie in the struct bio's bio_offset
which is initialised somewhere in geom system (files like geom_slice.c,
geom_disk.c)

Might this path be worth of studying further...

Another hypothesis is that the error is related to bus/clock rates:

fedora linux boot prints:
[    3.071120] mmc_host mmc0: Bus speed (slot 0) = 100000000Hz
(slot req 400000Hz, actual 400000HZ div = 125)
[    3.168274] mmc_host mmc0: Bus speed (slot 0) = 100000000Hz
(slot req 50000000Hz, actual 50000000HZ div = 1)
[    9.240888] mmc_host mmc1: Bus speed (slot 0) = 123750000Hz
(slot req 400000Hz, actual 399193HZ div = 155)
[    9.398503] mmc_host mmc1: Bus speed (slot 0) = 123750000Hz
(slot req 50000000Hz, actual 30937500HZ div = 2)

freebsd boot prints (with my own printfs):
dwmmc_setup_bus(), div: 1, bus_hz: 100000000, freq: 50000000
dwmmc_setup_bus(), div: 125, bus_hz: 100000000, freq: 400000
dwmmc_setup_bus(), div: 155, bus_hz: 123750000, freq: 400000

It seems one combination of rates is missing from FreeBSD (or the booting
simply doesn't proceed far enough?) ...though it's related to mmc1 while
the relevant action seems to take place in mmc0.

I wonder if those rates seen are actually supported by FreeBSD. In some previous
cases FreeBSD has used different rates to linux due to less extensive drivers.

So here's where I currently stand. Tips are welcome.

-- 
Jari Sihvola <jsihv at gmx.com>