lsi
Aurelien "beorn" ROUGEMONT
beorn at binaries.fr
Fri Mar 22 09:06:22 UTC 2019
Hi the list,
I have been using FreeBSD at home and in production for years and today
i stumbled upon a question i could not answer.
Context
-----------------------------------------
I'm building a backup server on a server with this HBA :
3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i
Flags: bus master, fast devsel, latency 0, IRQ 34
I/O ports at e000
Memory at fb160000 (64-bit, non-prefetchable)
Memory at fb100000 (64-bit, non-prefetchable)
Expansion ROM at fb140000 [disabled]
Capabilities: [50] Power Management version 3
Capabilities: [68] Express Endpoint, MSI 00
Capabilities: [d0] Vital Product Data
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [1e0] Secondary PCI Express <?>
Capabilities: [1c0] Power Budgeting <?>
Capabilities: [190] Dynamic Power Allocation <?>
Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
After pushing the server I/Os to its limits the server had a very nasty
crash.
It happens very seldomly, in roughly 10 years among the petabytes of
storage servers i kept running it always was hardware or driver/firmware
related.
|Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all
block copies unavailable ZFS: can't read object set for dataset 52
ZFS: can't open root filesystem gptzfsboot: failed to mount default
pool zroot|
After simply reinstalling (for nothing) the bootloaders, checking the
partition tables, i went digging a lot in the FreeBSD codebase. I found
that it was a ZFS problem.
The nasty crash was indeed due to ZFS data corruption. Hence the
checksum errors while scrubing the zpool on a rescue network boot image :
pool: zroot
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Fri Mar 15 15:15:25 2019
49.6G scanned out of 1.65T at 109M/s, 4h15m to go
677M repaired, 2.94% done
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
mfisyspd0p3 ONLINE 0 0 5.44K (repairing)
mfisyspd1p3 ONLINE 0 0 4.76K (repairing)
mfisyspd10p3 ONLINE 0 0 4.35K (repairing)
mfisyspd11p3 ONLINE 0 0 5.17K (repairing)
mfisyspd2p3 ONLINE 0 0 4.76K (repairing)
mfisyspd3p3 ONLINE 0 0 4.24K (repairing)
mfisyspd4p3 ONLINE 0 0 4.75K (repairing)
mfisyspd5p3 ONLINE 0 0 5.20K (repairing)
mfisyspd6p3 ONLINE 0 0 4.51K (repairing)
mfisyspd7p3 ONLINE 0 0 4.65K (repairing)
mfisyspd8p3 ONLINE 0 0 4.70K (repairing)
mfisyspd9p3 ONLINE 0 0 3.81K (repairing)
At this point the server was still unable to reboot. I've had to force
data re-copy with a dumb :
mv /boot{,.dist}
cp -pr /boot{.dist}
Which turned out to be fine.
Going further i finally killed for good the zpool. It took me some time
and i stumbled upon the mfi(4) and the mrsas(4) man pages and code.
The mfi driver supports the following hardware:
o LSI MegaRAID SAS 1078
o LSI MegaRAID SAS 8408E
o LSI MegaRAID SAS 8480E
o LSI MegaRAID SAS 9240
o LSI MegaRAID SAS 9260
o Dell PERC5
o Dell PERC6
o IBM ServeRAID M1015 SAS/SATA
o IBM ServeRAID M1115 SAS/SATA
o IBM ServeRAID M5015 SAS/SATA
o IBM ServeRAID M5110 SAS/SATA
o IBM ServeRAID-MR10i
o Intel RAID Controller SRCSAS18E
o Intel RAID Controller SROMBSAS18E
The mrsas driver supports the following hardware:
[ Thunderbolt 6Gb/s MR controller ]
o LSI MegaRAID SAS 9265
o LSI MegaRAID SAS 9266
o LSI MegaRAID SAS 9267
o LSI MegaRAID SAS 9270
o LSI MegaRAID SAS 9271
o LSI MegaRAID SAS 9272
o LSI MegaRAID SAS 9285
o LSI MegaRAID SAS 9286
o DELL PERC H810
o DELL PERC H710/P
There was a detectoin priority problem
hw.mfi.mrsas_enable=1
More information about the freebsd-current
mailing list