MRSAS driver/LSI MegaRaid 92XX-93XX admin question: When one of the Raid's physical drives break, how is it reported in the logs?

Thu Feb 18 17:33:31 UTC 2016

On Wed, Feb 17, 2016 at 02:38:10PM +0700, Tinker wrote:
| Hi Doug,
| 
| Would you mind sharing your kernel patch for that functionality (if I 
| understand you right, you patched your kernel to channelize the events 
| to the dmesg)?

I need to do some work on mrsas stuff at work, so I plan to sync our
changes to -current etc.

I'll send them to you.

Doug A.

| On 2016-02-17 07:00, Doug Ambrisko wrote:
| > On Sun, Feb 14, 2016 at 10:13:31PM +0700, Tinker wrote:
| > | (Will send any followup from now only to freebsd-scsi@ .)
| > |
| > | Did some additional research and found that the disk failure indeed 
| > is
| > | reported in MRSAS' "event log".
| > |
| > | So my final question then is, how do you extract it into userland (in
| > | the absence of an "mfiutil" as the MFI driver has)?
| > 
| > I have local changes to print the event log in dmesg which gets 
| > sysloged.
| > We then watch syslog for issues to report things to our customers
| > automatically.  This is similar to mfi(4).
| > 
| > Thanks,
| > 
| > Doug A.
| > | Details below. Thanks.
| > |
| > | On 2016-02-14 19:59, Tinker wrote:
| > | [...]
| > | >
| > http://www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/3rd-party/lsi/mrsas/userguide/LSI_MR_SAS_SW_UG.pdf
| > | > on page 305, that is section "A.2 Event Messages" - I don't know 
| > for
| > | > what LGI chip this document is, but, it does not list particular 
| > event
| > | > message very clearly for when an individual underlying disk would 
| > have
| > | > broken, I don't even see any event for when a hot spare would be 
| > taken
| > | > in use!
| > |
| > |
| > | Wait - this page:
| > |
| > | 
| > https://www.schirmacher.de/display/Linux/Replace+failed+disk+in+MegaRAID+array
| > |
| > | (and also
| > |
| > http://serverfault.com/questions/485147/drive-is-failing-but-lsi-megaraid-controller-does-not-detect-it
| > | )
| > |
| > | gives an example of how the host system learns about broken disks:
| > |
| > |
| > | Code: 0x00000051 .. Event Description: State change on VD 00/1 from
| > | OPTIMAL(3) to DEGRADED(2)
| > |
| > |
| > | Code: 0x00000072 .. Event Description: State change on PD 
| > 05(e0xfc/s0)
| > | from ONLINE(18) to FAILED(11)
| > |
| > | (unclean disk broken seems to be shown as:)
| > |
| > | Code: 0x00000071 .. Event Description: Unexpected sense: PD 
| > 05(e0xfc/s0)
| > | Path 4433221103000000, CDB: 2e 00 3a 38 1b c7 00 00 01 00, Sense:
| > | b/00/00
| > |
| > |
| > | And this version of the LSI documentation
| > |
| > |
| > http://hwraid.le-vert.net/raw-attachment/wiki/LSIMegaRAIDSAS/megacli_user_guide.pdf
| > |
| > | gives a clearer definition of the physical and virtual drive states 
| > in
| > | "1.4.16 Physical Drive States"
| > | and "1.4.17 Virtual Disk States" on pages 1-11 to 1-12.
| > |
| > | So as we see, a physical drive breaking would
| > |
| > |   * "FAILED" the physical drive
| > |
| > |   * "DEGRADED" the Virtual Drive (that is the logical exported drive)
| > | (from "OPTIMAL")
| > |
| > |
| > | So then, it was indeed the card's "event log" that contains this 
| > info.
| > |
| > |
| > |
| > | Last question then would only be then, *where* FreeBSD's MRSAS driver
| > | sends its event log?
| > |
| > |
| > |
| > | _______________________________________________
| > | freebsd-stable at freebsd.org mailing list
| > | https://lists.freebsd.org/mailman/listinfo/freebsd-stable
| > | To unsubscribe, send any mail to 
| > "freebsd-stable-unsubscribe at freebsd.org"