lsi
Aurelien "beorn" ROUGEMONT
beorn at binaries.fr
Fri Mar 22 09:12:05 UTC 2019
On 3/22/19 10:06 AM, Aurelien "beorn" ROUGEMONT wrote:
> Hi the list,
>
> I have been using FreeBSD at home and in production for years and today
> i stumbled upon a question i could not answer.
>
>
> Context
>
> -----------------------------------------
>
> I'm building a backup server on a server with this HBA :
>
> 3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
> Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i
> Flags: bus master, fast devsel, latency 0, IRQ 34
> I/O ports at e000
> Memory at fb160000 (64-bit, non-prefetchable)
> Memory at fb100000 (64-bit, non-prefetchable)
> Expansion ROM at fb140000 [disabled]
> Capabilities: [50] Power Management version 3
> Capabilities: [68] Express Endpoint, MSI 00
> Capabilities: [d0] Vital Product Data
> Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [1e0] Secondary PCI Express <?>
> Capabilities: [1c0] Power Budgeting <?>
> Capabilities: [190] Dynamic Power Allocation <?>
> Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
>
> After pushing the server I/Os to its limits the server had a very nasty
> crash.
>
> It happens very seldomly, in roughly 10 years among the petabytes of
> storage servers i kept running it always was hardware or driver/firmware
> related.
>
> |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all
> block copies unavailable ZFS: can't read object set for dataset 52
> ZFS: can't open root filesystem gptzfsboot: failed to mount default
> pool zroot|
>
> After simply reinstalling (for nothing) the bootloaders, checking the
> partition tables, i went digging a lot in the FreeBSD codebase. I found
> that it was a ZFS problem.
>
> The nasty crash was indeed due to ZFS data corruption. Hence the
> checksum errors while scrubing the zpool on a rescue network boot image :
>
> pool: zroot
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using 'zpool clear' or replace the device with 'zpool replace'.
> see: http://illumos.org/msg/ZFS-8000-9P
> scan: scrub in progress since Fri Mar 15 15:15:25 2019
> 49.6G scanned out of 1.65T at 109M/s, 4h15m to go
> 677M repaired, 2.94% done
> config:
> NAME STATE READ WRITE CKSUM
> zroot ONLINE 0 0 0
> raidz2-0 ONLINE 0 0 0
> mfisyspd0p3 ONLINE 0 0 5.44K (repairing)
> mfisyspd1p3 ONLINE 0 0 4.76K (repairing)
> mfisyspd10p3 ONLINE 0 0 4.35K (repairing)
> mfisyspd11p3 ONLINE 0 0 5.17K (repairing)
> mfisyspd2p3 ONLINE 0 0 4.76K (repairing)
> mfisyspd3p3 ONLINE 0 0 4.24K (repairing)
> mfisyspd4p3 ONLINE 0 0 4.75K (repairing)
> mfisyspd5p3 ONLINE 0 0 5.20K (repairing)
> mfisyspd6p3 ONLINE 0 0 4.51K (repairing)
> mfisyspd7p3 ONLINE 0 0 4.65K (repairing)
> mfisyspd8p3 ONLINE 0 0 4.70K (repairing)
> mfisyspd9p3 ONLINE 0 0 3.81K (repairing)
>
> At this point the server was still unable to reboot. I've had to force
> data re-copy with a dumb :
>
> mv /boot{,.dist}
>
> cp -pr /boot{.dist}
>
> Which turned out to be fine.
>
> Going further i finally killed for good the zpool. It took me some time
> and i stumbled upon the mfi(4) and the mrsas(4) man pages and code.
>
> The mfi driver supports the following hardware:
>
> o LSI MegaRAID SAS 1078
>
> o LSI MegaRAID SAS 8408E
>
> o LSI MegaRAID SAS 8480E
>
> o LSI MegaRAID SAS 9240
>
> o LSI MegaRAID SAS 9260
>
> o Dell PERC5
>
> o Dell PERC6
>
> o IBM ServeRAID M1015 SAS/SATA
>
> o IBM ServeRAID M1115 SAS/SATA
>
> o IBM ServeRAID M5015 SAS/SATA
>
> o IBM ServeRAID M5110 SAS/SATA
>
> o IBM ServeRAID-MR10i
>
> o Intel RAID Controller SRCSAS18E
>
> o Intel RAID Controller SROMBSAS18E
>
>
> The mrsas driver supports the following hardware:
>
> [ Thunderbolt 6Gb/s MR controller ]
>
> o LSI MegaRAID SAS 9265
>
> o LSI MegaRAID SAS 9266
>
> o LSI MegaRAID SAS 9267
>
> o LSI MegaRAID SAS 9270
>
> o LSI MegaRAID SAS 9271
>
> o LSI MegaRAID SAS 9272
>
> o LSI MegaRAID SAS 9285
>
> o LSI MegaRAID SAS 9286
>
> o DELL PERC H810
>
> o DELL PERC H710/P
>
There was a detection priority problem mfi wins for the wrong HBA.
The fix was to add hw.mfi.mrsas_enable=1 in /boot/loader.conf
After this the server behaved correctly.
Should it be fixed for everyone ?
NB: sorry my last email was mistakenly sent unfinished
More information about the freebsd-current
mailing list